machinelearning/src/Microsoft.ML.Transforms/Text/doc.xml at master · devhttps/machinelearning

237 lines (229 loc) · 12.3 KB
﻿<?xml version="1.0" encoding="utf-8" ?>
  <members>
    <member name="TextTransform">
      <summary>
        A transform that turns a collection of text documents into numerical feature vectors.
        The feature vectors are normalized counts of (word and/or character) ngrams in a given tokenized text.
      </summary>
      <remarks>
        The TextFeaturizer transform gives user one-stop solution for doing:
        <list type="bullet">
          <item><description>Language Detection</description></item>
          <item><description>Tokenzation​</description></item>
          <item><description>Text normalization</description></item>
          <item><description>Predefined and custom stopwords removal.</description></item>
          <item><description>Word-based or character-based Ngram and SkipGram extraction.​</description></item>
          <item><description>TF, IDF or TF-IDF.</description></item>
          <item><description>L-p vector normalization.​</description></item>
        </list>
        The TextFeaturizer will show the transformed text, after being applied.
        It converts a collection of text columns to a matrix of token  ngrams/skip-grams counts.
        Features are made of (word/character) n-grams/skip-grams​ and the number of features are equal to the vocabulary size found by analyzing the data.
      </remarks>
    </member>
    <example name="TextTransform">
      <example>
        <code language="csharp">
          pipeline.Add(new TextFeaturizer(&quot;Features&quot;, &quot;SentimentText&quot;)
            KeepDiacritics = false,
            KeepPunctuations = false,
            TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
            OutputTokens = true,
            StopWordsRemover = new PredefinedStopWordsRemover(),
            VectorNormalizer = TextTransformTextNormKind.L2,
            CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 3, AllLengths = false },
            WordFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = true }
          });
        </code>
      </example>
    </example>
    <member name="WordTokenizer">
      <summary>
        This transform splits the text into words using the separator character(s).
      </summary>
      <remarks>
        The input for this transform is a ReadOnlyMemory or a vector of ReadOnlyMemory,
        and its output is a vector of ReadOnlyMemory, corresponding to the tokens in the input text.
        The output is generated by splitting the input text, using a set of user specified separator characters.
        Empty strings and strings containing only spaces are dropped.
        This transform is not typically used on its own, but it is one of the transforms composing the Text Featurizer.
      </remarks>
    </member>
    <example name="WordTokenizer">
      <example>
        <code language="csharp">
          pipeline.Add( new WordTokenizer(&quot;TextColumn&quot;)
            TermSeparators = &quot;&apos; &apos;, &apos;\t&apos;, &apos;;&apos;&quot;  
          });
        </code>
      </example>
    </example>
    <member name="NgramTranslator">
      <summary>
        This transform produces a bag of counts of n-grams (sequences of consecutive values of length 1-n) in a given vector of keys. 
        It does so by building a dictionary of n-grams and using the id in the dictionary as the index in the bag.
      </summary>
      <remarks>
        This transform produces a matrix of token ngrams/skip-grams counts for a given corpus of text.
        The n-grams are represented as count vectors, with vector slots corresponding to n-grams.
        Embedding ngrams in a vector space allows their contents to be compared in an efficient manner. 
        The slot values in the vector can be weighted by the following factors:
        <list type="bullet">
          <item>
            <term>term frequency</term>
            <description> the number of occurrences of the slot in the text</description>
          </item>
          <item>
            <term>inverse document frequency</term>
            <description> a ratio (the logarithm of inverse relative slot frequency)
              that measures the information a slot provides by determining how common or rare it is across the entire text.</description>
          </item>
            <item>
              <term>term frequency-inverse document frequency</term>
              <description> the product term frequency and the inverse document frequency.</description>
            </item>
        </list>
        This transform is not typically used on its own, but it is one of the transforms composing the <see cref="Microsoft.ML.Transforms.TextFeaturizer">Text Featurizer</see> .
      </remarks>
      <seealso cref="Microsoft.ML.Transforms.WordTokenizer"/>
      <seealso cref="Microsoft.ML.Transforms.TextToKey"/>
      <seealso cref="Microsoft.ML.Transforms.TextFeaturizer"/>
      <seealso cref="Microsoft.ML.Transforms.CharacterTokenizer"/>
      <example>
        <code language="csharp">
          pipeline.Add(new NGramTranslator(&quot;TextColumn&quot;)
            Weighting=NgramTransformWeightingCriteria.TfIdf  
          });
      </code>
      </example>
    </member>
    <member name="SentimentAnalyzer">
      <summary>
        Uses a pretrained sentiment model to score input strings.
      </summary>
      <remarks>
        <para>The Sentiment transform returns the probability that the sentiment of a natural text is positive. </para>
        <para>
          The model was trained with the <a href="https://anthology.aclweb.org/P/P14/P14-1146.pdf">Sentiment-specific word embedding (SSWE)</a>  and NGramFeaturizer on Twitter sentiment data,
          similarly to the sentiment analysis part of the
          <a href="https://www.microsoft.com/cognitive-services/en-us/text-analytics-api">Text Analytics cognitive service</a>. 
          The transform outputs a score between 0 and 1 as a sentiment prediction 
          (where 0 is a negative sentiment and 1 is a positive sentiment).</para> 
          <para>Currently it supports only English.</para>
      </remarks>
    </member>
    <exaple>
      <example name="SentimentAnalyzer">
        <code language="csharp">
          pipeline.Add(new SentimentAnalyzer()
            Source = &quot;TextColumn&quot; 
          });
        </code>
      </example>
    </exaple>
    <member name="CharacterTokenizer">
      <summary>
        This transform breaks text into individual tokens, each consisting of an individual character.
      </summary>
      <remarks>
      This transform is not typically used on its own, but it is one of the transforms composing the 
      <see cref="Microsoft.ML.Transforms.TextFeaturizer">Text Featurizer</see>. 
      </remarks>
      <seealso cref="Microsoft.ML.Transforms.WordTokenizer"/>
      <seealso cref="Microsoft.ML.Transforms.TextToKey"/>
      <seealso cref="Microsoft.ML.Transforms.NGramTranslator"/>
      <seealso cref="Microsoft.ML.Transforms.TextFeaturizer"/>
      <example>
        <code language="csharp">
          pipeline.Add(new CharacterTokenizer(&quot;TextCol1&quot; , &quot;TextCol2&quot; ));
        </code>
      </example>
    </member>
    <member name="LightLDA">
      <summary>
        The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.
      </summary>
      <remarks>
        Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data,
        and can be used to featurize any text fields as low-dimensional topical vectors. 
        <para>LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of 
         optimization techniques. See <a href="https://arxiv.org/abs/1412.1576">LightLDA: Big Topic Models on Modest Compute Clusters</a>.
        </para>
        <para>
          With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
          on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters).
          The most significant innovation is a super-efficient O(1) <a href="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis-Hastings sampling algorithm</a>,
          whose running cost is (surprisingly) agnostic of model size,
          allowing it to converges nearly an order of magnitude faster than other <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers.</a>
        </para>
        <para>
          For more details please see original LightLDA paper, and its open source implementation. 
          <list type="bullet">
            <item><description><a href="https://arxiv.org/abs/1412.1576"> LightLDA: Big Topic Models on Modest Computer Clusters</a></description></item>
            <item><description><a href=" https://github.com/Microsoft/LightLDA">LightLDA </a></description></item>
          </list>
        </para>
      </remarks>
    </member>
    <example name="LightLDA">
      <example>
        <code language="csharp">
          pipeline.Add(new LightLda((&quot;InTextCol&quot; , &quot;OutTextCol&quot;)));
        </code>
      </example>
    </example>
    <member name="WordEmbeddings">
      <summary>
        Word Embeddings transform is a text featurizer which converts vectors of text tokens into sentence vectors using a pre-trained model.
      </summary>
      <remarks>
        <para>WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use. 
        The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">fastText</a>, and <a href="https://anthology.aclweb.org/P/P14/P14-1146.pdf">SSWE</a>.
        </para>
        <para>Note: As WordEmbedding requires a column with text vector, for example, &apos;this&apos;, &apos;is&apos;, &apos;good&apos;, users need to create an input column by
          using the output_tokens=True for TextTransform to convert a column with sentences like &apos;This is good&apos; into &apos;this&apos;, &apos;is&apos;, &apos;good&apos;.
          The suffix of &apos;_TransformedText&apos; is added to the original column name to create the output token column. For instance if the input column is &apos;body&apos;,
          the output tokens column is named &apos;body_TransformedText&apos;.</para>
        <para>
          License attributes for pretrained models:
          <list type="bullet">
            <item>
              <description>
                &quot;fastText Wikipedia 300D&quot; by Facebook, Inc. is licensed under <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA 3.0</a> based on:
                P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov,<a href="https://arxiv.org/abs/1607.04606">Enriching Word Vectors with Subword Information</a>
                More information can be found <a href="https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md">here</a>.
              </description>
            </item>
            <item>
              <description>
                GloVe models by Stanford University, or (Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. 
                <a href="https://nlp.stanford.edu/pubs/glove.pdf">GloVe: Global Vectors for Word Representation</a>) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.
                More information can be found <a href="https://nlp.stanford.edu/projects/glove/">here</a>. 
                Repository can be found <a href="https://github.com/stanfordnlp/GloVe">here</a>.
              </description>
          </item>
        </list>
        </para>
      </remarks>
    </member>
    <example name="WordEmbeddings">
      <example>
        <code language="csharp">
          pipeline.Add(new TextFeaturizer(&quot;Words&quot;, &quot;InputTextCol&quot;)
              TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
              OutputTokens = true,
              CharFeatureExtractor=null,
              WordFeatureExtractor = null
          });
          pipeline.Add(new WordEmbeddings((&quot;Words_TransformedText&quot;, &quot;OutTextCol&quot;)));
        </code>
      </example>
    </example>
  </members>
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

doc.xml

Latest commit

History

doc.xml

File metadata and controls