forked from dotnet/machinelearning
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdoc.xml
More file actions
356 lines (339 loc) · 16.4 KB
/
Copy pathdoc.xml
File metadata and controls
356 lines (339 loc) · 16.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
<?xml version="1.0" encoding="utf-8"?>
<doc>
<members>
<member name="CategoricalHashOneHotVectorizer">
<summary>
Encodes the categorical variable with hash-based encoding.
</summary>
<remarks>
CategoricalHashOneHotVectorizer converts a categorical value into an indicator array by hashing the value and using the hash as an index in the bag.
If the input column is a vector, a single indicator bag is returned for it.
</remarks>
</member>
<example name="CategoricalHashOneHotVectorizer">
<example>
<code language="csharp">
pipeline.Add(new CategoricalHashOneHotVectorizer("Text1")
{
HashBits = 10,
Seed = 314489979,
OutputKind = CategoricalTransformOutputKind.Bag
});
</code>
</example>
</example>
<member name="CategoricalOneHotVectorizer">
<summary>
Converts the categorical value into an indicator array by building a dictionary of categories based on the data and using the id in the dictionary as the index in the array
</summary>
<remarks>
<para>
The CategoricalOneHotVectorizer transform passes through a data set, operating on text columns, to
build a dictionary of categories.
For each row, the entire text string appearing in the input column is defined as a category.</para>
<para>The output of this transform is an indicator vector.</para>
Each slot in this vector corresponds to a category in the dictionary, so its length is the size of the built dictionary.
<para>The CategoricalOneHotVectorizer can be applied to one or more columns, in which case it builds and uses a separate dictionary
for each column that it is applied to.</para>
<para>The <see cref="T:Microsoft.ML.Transforms.CategoricalTransformOutputKind"/> produces integer values and KeyType columns.
The Key value is the one-based index of the slot set in the Ind/Bag options.
If the Key option is not found, it is assigned the value zero.
In the <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Ind"/>, <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Bag"/> options are not found, they result in an all zero bit vector.
<see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Ind"/> and <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Bag"/> differ simply in how the bit-vectors generated from individual slots are aggregated:
for Ind they are concatenated and for Bag they are added.
When the source column is a singleton, the Ind and Bag options are identical.</para>
</remarks>
</member>
<example name="CategoricalOneHotVectorizer">
<example>
An example of how to add the CategoricalOneHotVectorizer transform to a pipeline with two text column
features named "Text1" and "Text2".
<code language="csharp">
pipeline.Add(new CategoricalOneHotVectorizer("Text1", "Text1"));
</code>
</example>
</example>
<member name="CountFeatureSelection">
<summary>
Selects the slots for which the count of non-default values is greater than or equal to a threshold.
</summary>
<remarks>
<para>
This transform uses a set of aggregators to count the number of non-default values for each slot and
instantiates a <see cref="SlotsDroppingTransformer"/> to actually drop the slots.
This transform is useful when applied together with a <see cref="T:Microsoft.ML.Transforms.OneHotHashEncodingTransformer"/>.
The count feature selection can remove those features generated by the hash transform that have no data in the examples.
</para>
</remarks>
</member>
<example name="CountFeatureSelection">
<example>
<code language="csharp">
pipeline.Add(new FeatureSelectorByCount
{
Column = new[]{ "Feature1" },
Count = 2
});
</code>
</example>
</example>
<member name="MutualInformationFeatureSelection">
<summary>
Selects the top k slots across all specified columns ordered by their mutual information with the label column.
</summary>
<remarks>
<para>
The mutual information of two random variables X and Y is a measure of the mutual dependence between the variables.
Formally, the mutual information can be written as:
</para>
<para>I(X;Y) = E[log(p(x,y)) - log(p(x)) - log(p(y))]</para>
<para>where the expectation is taken over the joint distribution of X and Y.
Here p(x,y) is the joint probability density function of X and Y, p(x) and p(y) are the marginal probability density functions of X and Y respectively.
In general, a higher mutual information between the dependent variable (or label) and an independent variable (or feature) means
that the label has higher mutual dependence over that feature.
It keeps the top SlotsInOutput features with the largest mutual information with the label.
</para>
</remarks>
</member>
<example name="MutualInformationFeatureSelection">
<example>
<code language="csharp">
pipeline.Add(new FeatureSelectorByMutualInformation
{
Column = new[]{ "Feature1" },
SlotsInOutput = 6
});
</code>
</example>
</example>
<member name="OptionalColumnTransform">
<summary>
Creates a new column with the specified type and default values.
</summary>
<remarks>
If the user wish to create additional columns with a particular type and default values, or replicated the values from one column to another, changing their type, they can do so using this transform.
This transform can be used as a workaround to create a Label column after deserializing a model, for prediction.
Some transforms in the serialized model operate on the Label column, and would throw errors during prediction if such a column is not found.
</remarks>
</member>
<example name="OptionalColumnTransform">
<example>
<code language="csharp">
pipeline.Add(new OptionalColumnCreator
{
Column = new[]{ "OptColumn"}
});
</code>
</example>
</example>
<member name="HashJoin">
<summary>
Converts multiple column values into hashes.
This transform accepts both numeric and text inputs, both single and vector-valued columns.
</summary>
<remarks>
This transform can be helpful for ranking and cross-validation. In the case of ranking, where the GroupIdColumn column is required,
and needs to be of a key type you can use the <see cref="T:Microsoft.ML.Transforms.CategoricalHashOneHotVectorizer" /> to hash the text value of a single GroupID column into a key value.
If the GroupID is the combination of the values from multiple columns, you can use the HashConverter to hash multiple text columns into one key column.
Similarly with CrossValidator and the StratificationColumn.
</remarks>
</member>
<example name="HashJoin">
<example>
<code language="csharp">
pipeline.Add(new HashConverter("Column1", "Column2"));
</code>
</example>
</example>
<member name="NADrop">
<summary>
Removes missing values from vector type columns.
</summary>
</member>
<example>
<example>
<code language="csharp">
pipeline.Add(new MissingValuesDropper("Column1"));
</code>
</example>
</example>
<member name="NAIndicator">
<summary>
This transform can transform either scalars or vectors (both fixed and variable size),
creating output columns that indicate, through the true/false booleans whether the row has a missing value.
</summary>
</member>
<example name="NAIndicator">
<example>
<code language="csharp">
pipeline.Add(new MissingValueIndicator("Column1"));
</code>
</example>
</example>
<member name="NAReplace">
<summary>
Create an output column of the same type and size of the input column,
where missing values are replaced with either the default value or the mean/min/max value (for non-text columns only).
</summary>
<remarks>
This transform can transform either scalars or vectors (both fixed and variable size),
creating output columns that are identical to the input columns except for replacing NA values
with either the default value, user input, or imputed values (min/max/mean are currently supported).
Imputation modes are supported for vectors both by slot and across all slots.
</remarks>
</member>
<example name="NAReplace">
<example>
<code language="csharp">
pipeline.Add(new MissingValueSubstitutor("FeatureCol")
{
ReplacementKind = NAReplaceTransformReplacementKind.Mean
});
</code>
</example>
</example>
<member name="LpNormalize">
<summary>
The LpNormalizer transforms, normalizes vectors (rows) individually by rescaling them to unit norm (L2, L1 or LInf).
<para>Performs the following operation on a vector X:</para>
<para>Y = (X - M) / D</para>
<para>where M is mean and D is either L2 norm, L1 norm or LInf norm.</para>
</summary>
<remarks>
Scaling inputs to unit norms is a common operation for text classification or clustering.
For more information see: <a href="https://www.cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf"></a>
</remarks>
<seealso cref=" Microsoft.ML.Transforms.Projections.GlobalContrastNormalizingEstimator"></seealso>
<example>
<code language="csharp">
pipeline.Add(new LpNormalizer("FeatureCol")
{
NormKind = LpNormNormalizerTransformNormalizerKind.L1Norm
});
</code>
</example>
</member>
<member name="GcNormalize">
<summary>
<para>Performs a global contrast normalization on input values:</para>
<para>Y = (s * X - M) / D</para>
<para>where s is a scale, M is mean and D is either the L2 norm or standard deviation.</para>
</summary>
<remarks>
Scaling inputs to unit norms is a common operation for text classification or clustering.
For more information see:
<a href="https://www.cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf">An Analysis of Single-Layer Networks in Unsupervised Feature Learning</a>
</remarks>
<seealso cref="Microsoft.ML.Transforms.Projections.LpNormalizingEstimator"></seealso>
<example>
<code language="csharp">
pipeline.Add(new GlobalContrastNormalizer("FeatureCol")
{
SubMean= false
});
</code>
</example>
</member>
<member name="Ungroup">
<summary>
Un-groups vector columns into sequences of rows, inverse of Group transform.
</summary>
<remarks>
<para>This can be thought of as an inverse of the <see cref="T:Microsoft.ML.Transforms.CombinerByContiguousGroupId"/>.
For all specified vector columns ("pivot" columns), performs the "ungroup" (or "unroll") operation as outlined below.
</para>
<para>If the only pivot column is called P, and has size K, then for every row of the input we will produce
K rows, that are identical in all columns except P. The column P will become a scalar column, and this
column will hold all the original values of input's P, one value per row, in order. The order of columns
will remain the same.
</para>
<para>Variable-length pivot columns are supported (including zero, which will eliminate the row from the result).</para>
<para>Multiple pivot columns are also supported:</para>
<list type="bullet">
<item><description>A number of output rows is controlled by the 'mode' parameter.
<list type="bullet">
<item><term>outer</term><description> it is equal to the maximum length of pivot columns</description></item>
<item><term>inner</term><description> it is equal to the minimum length of pivot columns</description></item>
<item><term>first</term><description> it is equal to the length of the first pivot column</description></item>
</list>
</description>
</item>
<item><description>
If a particular pivot column has size that is different than the number of output rows, the extra slots will
be ignored, and the missing slots will be 'padded' with default values.
</description></item>
</list>
<para>All metadata are preserved for the retained columns. For 'unrolled' columns, all known metadata
except slot names are preserved.
</para>
</remarks>
</member>
<example name="Ungroup">
<example>
<code language="csharp">
pipeline.Add(new Segregator
{
Column = new[]{"Column1" },
Mode = UngroupTransformUngroupMode.First
});
</code>
</example>
</example>
<member name="KeyToText">
<summary>
Helps retrieving the original values from a key column.
</summary>
<remarks>
The KeyToTextConverter is the complement of the <see cref="TextToKeyConverter"/> transform.
Since key values are an enumeration into the set of keys, most transforms that produce key valued outputs
corresponding to input values will often, wherever possible, associate a piece of KeyValue metadata with that dataset.
Transforming values into a categorical variable would be of limited use,
if we couldn't somehow backtrack to figure out what those categories actually mean.
The KeyToTextConverter enables that functionality.
</remarks>
<seealso cref="Microsoft.ML.Transforms.HashConverter"/>
<seealso cref="Microsoft.ML.Transforms.TextToKeyConverter"/>
<example>
<code language="csharp">
pipeline.Add(new KeyToTextConverter(("InColumn", "OutColumn" )));
</code>
</example>
</member>
<member name="Group">
<summary>
Groups values of a scalar column into a vector, by a contiguous group ID.
</summary>
<remarks>
The CombinerByContiguousGroupId transform groups the consecutive rows that share the specified group key (or keys).
Both group keys and the aggregated values can be of arbitrary non-vector types.
The resulting data will have all the group key columns preserved,
and the aggregated columns will become variable-length vectors of the original types.
<para>This transform essentially performs the following SQL-like operation:</para>
<para>SELECT GroupKey1, GroupKey2, ... GroupKeyK, LIST(Value1), LIST(Value2), ... LIST(ValueN)</para>
<para>FROM Data</para>
<para>GROUP BY GroupKey1, GroupKey2, ... GroupKeyK.</para>
</remarks>
<seealso cref="Microsoft.ML.Transforms.Segregator"/>
<example>
<code language="csharp">
pipeline.Add(new CombinerByContiguousGroupId
{
GroupKey = new []{"Key1", "Key2" }
});
</code>
</example>
</member>
<member name="Whitening">
<summary>
Implements PCA (Principal Component Analysis) and ZCA (Zero phase Component Analysis) whitening.
The whitening process consists of 2 steps:
1. Decorrelation of the input data. Input data is assumed to have zero mean.
2. Rescale decorrelated features to have unit variance.
That is, PCA whitening is essentially just a PCA + rescale.
ZCA whitening tries to make resulting data to look more like input data by rotating it back to the
original input space.
More information: <a href="http://ufldl.stanford.edu/wiki/index.php/Whitening">http://ufldl.stanford.edu/wiki/index.php/Whitening</a>
</summary>
</member>
</members>
</doc>