forked from dotnet/machinelearning
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdoc.xml
More file actions
237 lines (229 loc) · 12.3 KB
/
Copy pathdoc.xml
File metadata and controls
237 lines (229 loc) · 12.3 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
<?xml version="1.0" encoding="utf-8" ?>
<doc>
<members>
<member name="TextTransform">
<summary>
A transform that turns a collection of text documents into numerical feature vectors.
The feature vectors are normalized counts of (word and/or character) ngrams in a given tokenized text.
</summary>
<remarks>
The TextFeaturizer transform gives user one-stop solution for doing:
<list type="bullet">
<item><description>Language Detection</description></item>
<item><description>Tokenzation</description></item>
<item><description>Text normalization</description></item>
<item><description>Predefined and custom stopwords removal.</description></item>
<item><description>Word-based or character-based Ngram and SkipGram extraction.</description></item>
<item><description>TF, IDF or TF-IDF.</description></item>
<item><description>L-p vector normalization.</description></item>
</list>
The TextFeaturizer will show the transformed text, after being applied.
It converts a collection of text columns to a matrix of token ngrams/skip-grams counts.
Features are made of (word/character) n-grams/skip-grams and the number of features are equal to the vocabulary size found by analyzing the data.
</remarks>
</member>
<example name="TextTransform">
<example>
<code language="csharp">
pipeline.Add(new TextFeaturizer("Features", "SentimentText")
{
KeepDiacritics = false,
KeepPunctuations = false,
TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
OutputTokens = true,
StopWordsRemover = new PredefinedStopWordsRemover(),
VectorNormalizer = TextTransformTextNormKind.L2,
CharFeatureExtractor = new NGramNgramExtractor() { NgramLength = 3, AllLengths = false },
WordFeatureExtractor = new NGramNgramExtractor() { NgramLength = 2, AllLengths = true }
});
</code>
</example>
</example>
<member name="WordTokenizer">
<summary>
This transform splits the text into words using the separator character(s).
</summary>
<remarks>
The input for this transform is a ReadOnlyMemory or a vector of ReadOnlyMemory,
and its output is a vector of ReadOnlyMemory, corresponding to the tokens in the input text.
The output is generated by splitting the input text, using a set of user specified separator characters.
Empty strings and strings containing only spaces are dropped.
This transform is not typically used on its own, but it is one of the transforms composing the Text Featurizer.
</remarks>
</member>
<example name="WordTokenizer">
<example>
<code language="csharp">
pipeline.Add( new WordTokenizer("TextColumn")
{
TermSeparators = "' ', '\t', ';'"
});
</code>
</example>
</example>
<member name="NgramTranslator">
<summary>
This transform produces a bag of counts of n-grams (sequences of consecutive values of length 1-n) in a given vector of keys.
It does so by building a dictionary of n-grams and using the id in the dictionary as the index in the bag.
</summary>
<remarks>
This transform produces a matrix of token ngrams/skip-grams counts for a given corpus of text.
The n-grams are represented as count vectors, with vector slots corresponding to n-grams.
Embedding ngrams in a vector space allows their contents to be compared in an efficient manner.
The slot values in the vector can be weighted by the following factors:
<list type="bullet">
<item>
<term>term frequency</term>
<description> the number of occurrences of the slot in the text</description>
</item>
<item>
<term>inverse document frequency</term>
<description> a ratio (the logarithm of inverse relative slot frequency)
that measures the information a slot provides by determining how common or rare it is across the entire text.</description>
</item>
<item>
<term>term frequency-inverse document frequency</term>
<description> the product term frequency and the inverse document frequency.</description>
</item>
</list>
This transform is not typically used on its own, but it is one of the transforms composing the <see cref="Microsoft.ML.Transforms.TextFeaturizer">Text Featurizer</see> .
</remarks>
<seealso cref="Microsoft.ML.Transforms.WordTokenizer"/>
<seealso cref="Microsoft.ML.Transforms.TextToKey"/>
<seealso cref="Microsoft.ML.Transforms.TextFeaturizer"/>
<seealso cref="Microsoft.ML.Transforms.CharacterTokenizer"/>
<example>
<code language="csharp">
pipeline.Add(new NGramTranslator("TextColumn")
{
Weighting=NgramTransformWeightingCriteria.TfIdf
});
</code>
</example>
</member>
<member name="SentimentAnalyzer">
<summary>
Uses a pretrained sentiment model to score input strings.
</summary>
<remarks>
<para>The Sentiment transform returns the probability that the sentiment of a natural text is positive. </para>
<para>
The model was trained with the <a href="https://anthology.aclweb.org/P/P14/P14-1146.pdf">Sentiment-specific word embedding (SSWE)</a> and NGramFeaturizer on Twitter sentiment data,
similarly to the sentiment analysis part of the
<a href="https://www.microsoft.com/cognitive-services/en-us/text-analytics-api">Text Analytics cognitive service</a>.
The transform outputs a score between 0 and 1 as a sentiment prediction
(where 0 is a negative sentiment and 1 is a positive sentiment).</para>
<para>Currently it supports only English.</para>
</remarks>
</member>
<exaple>
<example name="SentimentAnalyzer">
<code language="csharp">
pipeline.Add(new SentimentAnalyzer()
{
Source = "TextColumn"
});
</code>
</example>
</exaple>
<member name="CharacterTokenizer">
<summary>
This transform breaks text into individual tokens, each consisting of an individual character.
</summary>
<remarks>
This transform is not typically used on its own, but it is one of the transforms composing the
<see cref="Microsoft.ML.Transforms.TextFeaturizer">Text Featurizer</see>.
</remarks>
<seealso cref="Microsoft.ML.Transforms.WordTokenizer"/>
<seealso cref="Microsoft.ML.Transforms.TextToKey"/>
<seealso cref="Microsoft.ML.Transforms.NGramTranslator"/>
<seealso cref="Microsoft.ML.Transforms.TextFeaturizer"/>
<example>
<code language="csharp">
pipeline.Add(new CharacterTokenizer("TextCol1" , "TextCol2" ));
</code>
</example>
</member>
<member name="LightLDA">
<summary>
The LDA transform implements LightLDA, a state-of-the-art implementation of Latent Dirichlet Allocation.
</summary>
<remarks>
Latent Dirichlet Allocation is a well-known topic modeling algorithm that infers topical structure from text data,
and can be used to featurize any text fields as low-dimensional topical vectors.
<para>LightLDA is an extremely efficient implementation of LDA developed in MSR-Asia that incorporates a number of
optimization techniques. See <a href="https://arxiv.org/abs/1412.1576">LightLDA: Big Topic Models on Modest Compute Clusters</a>.
</para>
<para>
With the LDA transform, ML.NET users can train a topic model to produce 1 million topics with 1 million vocabulary
on a 1-billion-token document set one a single machine in a few hours (typically, LDA at this scale takes days and requires large clusters).
The most significant innovation is a super-efficient O(1) <a href="https://en.wikipedia.org/wiki/Metropolis–Hastings_algorithm">Metropolis-Hastings sampling algorithm</a>,
whose running cost is (surprisingly) agnostic of model size,
allowing it to converges nearly an order of magnitude faster than other <a href="https://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs samplers.</a>
</para>
<para>
For more details please see original LightLDA paper, and its open source implementation.
<list type="bullet">
<item><description><a href="https://arxiv.org/abs/1412.1576"> LightLDA: Big Topic Models on Modest Computer Clusters</a></description></item>
<item><description><a href=" https://github.com/Microsoft/LightLDA">LightLDA </a></description></item>
</list>
</para>
</remarks>
</member>
<example name="LightLDA">
<example>
<code language="csharp">
pipeline.Add(new LightLda(("InTextCol" , "OutTextCol")));
</code>
</example>
</example>
<member name="WordEmbeddings">
<summary>
Word Embeddings transform is a text featurizer which converts vectors of text tokens into sentence vectors using a pre-trained model.
</summary>
<remarks>
<para>WordEmbeddings wrap different embedding models, such as GloVe. Users can specify which embedding to use.
The available options are various versions of <a href="https://nlp.stanford.edu/projects/glove/">GloVe Models</a>, <a href="https://en.wikipedia.org/wiki/FastText">fastText</a>, and <a href="https://anthology.aclweb.org/P/P14/P14-1146.pdf">SSWE</a>.
</para>
<para>Note: As WordEmbedding requires a column with text vector, for example, 'this', 'is', 'good', users need to create an input column by
using the output_tokens=True for TextTransform to convert a column with sentences like 'This is good' into 'this', 'is', 'good'.
The suffix of '_TransformedText' is added to the original column name to create the output token column. For instance if the input column is 'body',
the output tokens column is named 'body_TransformedText'.</para>
<para>
License attributes for pretrained models:
<list type="bullet">
<item>
<description>
"fastText Wikipedia 300D" by Facebook, Inc. is licensed under <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA 3.0</a> based on:
P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov,<a href="https://arxiv.org/abs/1607.04606">Enriching Word Vectors with Subword Information</a>
More information can be found <a href="https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md">here</a>.
</description>
</item>
<item>
<description>
GloVe models by Stanford University, or (Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
<a href="https://nlp.stanford.edu/pubs/glove.pdf">GloVe: Global Vectors for Word Representation</a>) is licensed under <a href="https://opendatacommons.org/licenses/pddl/1.0/">PDDL</a>.
More information can be found <a href="https://nlp.stanford.edu/projects/glove/">here</a>.
Repository can be found <a href="https://github.com/stanfordnlp/GloVe">here</a>.
</description>
</item>
</list>
</para>
</remarks>
</member>
<example name="WordEmbeddings">
<example>
<code language="csharp">
pipeline.Add(new TextFeaturizer("Words", "InputTextCol")
{
TextCase = TextNormalizerTransformCaseNormalizationMode.Lower,
OutputTokens = true,
CharFeatureExtractor=null,
WordFeatureExtractor = null
});
pipeline.Add(new WordEmbeddings(("Words_TransformedText", "OutTextCol")));
</code>
</example>
</example>
</members>
</doc>