forked from dotnet/machinelearning
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathdoc.xml
More file actions
194 lines (189 loc) · 12.5 KB
/
Copy pathdoc.xml
File metadata and controls
194 lines (189 loc) · 12.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
<?xml version="1.0" encoding="utf-8"?>
<doc>
<members>
<member name="FastTree">
<summary>
Trains gradient boosted decision trees to the LambdaRank quasi-gradient.
</summary>
<remarks>
<para>
FastTree is an efficient implementation of the <a href='https://arxiv.org/abs/1505.01866'>MART</a> gradient boosting algorithm.
Gradient boosting is a machine learning technique for regression problems.
It builds each regression tree in a step-wise fashion, using a predefined loss function to measure the error for each step and corrects for it in the next.
So this prediction model is actually an ensemble of weaker prediction models. In regression problems, boosting builds a series of of such trees in a step-wise fashion and then selects the optimal tree using an arbitrary differentiable loss function.
</para>
<para>
MART learns an ensemble of regression trees, which is a decision tree with scalar values in its leaves.
A decision (or regression) tree is a binary tree-like flow chart, where at each interior node one decides which of the two child nodes to continue to based on one of the feature values from the input.
At each leaf node, a value is returned. In the interior nodes, the decision is based on the test 'x <= v' where x is the value of the feature in the input sample and v is one of the possible values of this feature.
The functions that can be produced by a regression tree are all the piece-wise constant functions.
</para>
<para>
The ensemble of trees is produced by computing, in each step, a regression tree that approximates the gradient of the loss function, and adding it to the previous tree with coefficients that minimize the loss of the new tree.
The output of the ensemble produced by MART on a given instance is the sum of the tree outputs.
</para>
<list type='bullet'>
<item><description>In case of a binary classification problem, the output is converted to a probability by using some form of calibration.</description></item>
<item><description>In case of a regression problem, the output is the predicted value of the function.</description></item>
<item><description>In case of a ranking problem, the instances are ordered by the output value of the ensemble.</description></item>
</list>
<para>For more information see:</para>
<list type="bullet">
<item><description><a href='https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting'>Wikipedia: Gradient boosting (Gradient tree boosting).</a></description></item>
<item><description><a href='https://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1013203451'>Greedy function approximation: A gradient boosting machine.</a></description></item>
</list>
</remarks>
</member>
<example name='FastTreeRanker'>
<example>
<code language="csharp">
new FastTreeRanker
{
SortingAlgorithm = "DescendingReverse",
OptimizationAlgorithm = BoostedTreeArgsOptimizationAlgorithmType.AcceleratedGradientDescent
}
</code>
</example>
</example>
<example name='FastTreeRegressor'>
<example>
<code language="csharp">
new FastTreeRegressor
{
NumTrees = 200,
EarlyStoppingRule = new GLEarlyStoppingCriterion(),
LearningRates = 0.4f,
DropoutRate = 0.05f
}
</code>
</example>
</example>
<example name='FastTreeBinaryClassifier'>
<example>
<code language="csharp">
new FastTreeBinaryClassifier
{
NumTrees = 100,
EarlyStoppingRule = new PQEarlyStoppingCriterion(),
LearningRates = 0.4f,
DropoutRate = 0.05f
}
</code>
</example>
</example>
<member name="FastForest">
<summary>
Trains a random forest to fit target values using least-squares.
</summary>
<remarks>
Decision trees are non-parametric models that perform a sequence of simple tests on inputs.
This decision procedure maps them to outputs found in the training dataset whose inputs were similar to the instance being processed.
A decision is made at each node of the binary tree data structure based on a measure of similarity that maps each instance recursively through the branches of the tree until the appropriate leaf node is reached and the output decision returned.
<para>Decision trees have several advantages:</para>
<list type='bullet'>
<item><description>They are efficient in both computation and memory usage during training and prediction. </description></item>
<item><description>They can represent non-linear decision boundaries.</description></item>
<item><description>They perform integrated feature selection and classification. </description></item>
<item><description>They are resilient in the presence of noisy features.</description></item>
</list>
<para>Fast forest is a random forest implementation.
The model consists of an ensemble of decision trees. Each tree in a decision forest outputs a Gaussian distribution by way of prediction.
An aggregation is performed over the ensemble of trees to find a Gaussian distribution closest to the combined distribution for all trees in the model.
This decision forest classifier consists of an ensemble of decision trees.</para>
<para>Generally, ensemble models provide better coverage and accuracy than single decision trees.
Each tree in a decision forest outputs a Gaussian distribution.</para>
<para>For more see: </para>
<list type='bullet'>
<item><description><a href='https://en.wikipedia.org/wiki/Random_forest'>Wikipedia: Random forest</a></description></item>
<item><description><a href='http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf'>Quantile regression forest</a></description></item>
<item><description><a href='https://blogs.technet.microsoft.com/machinelearning/2014/09/10/from-stumps-to-trees-to-forests/'>From Stumps to Trees to Forests</a></description></item>
</list>
</remarks>
</member>
<example name='FastForestBinaryClassifier'>
<example>
<code language="csharp">
new FastForestBinaryClassifier
{
NumTrees = 100,
NumLeaves = 50,
Calibrator = new FixedPlattCalibratorCalibratorTrainer()
}
</code>
</example>
</example>
<example name='FastForestRegressor'>
<example>
<code language="csharp">
new FastForestRegressor
{
NumTrees = 100,
NumLeaves = 50,
NumThreads = 5,
EntropyCoefficient = 0.3
}
</code>
</example>
</example>
<member name="FastTreeTweedieRegression">
<summary>
Trains gradient boosted decision trees to fit target values using a Tweedie loss function.
This learner is a generalization of Poisson, compound Poisson, and gamma regression.
</summary>
<remarks>
The Tweedie boosting model follows the mathematics established in <a href="https://arxiv.org/pdf/1508.06378.pdf">
Insurance Premium Prediction via Gradient Tree-Boosted Tweedie Compound Poisson Models.</a> from Yang, Quan, and Zou.
<para>For an introduction to Gradient Boosting, and more information, see:</para>
<para><a href='https://en.wikipedia.org/wiki/Gradient_boosting#Gradient_tree_boosting'>Wikipedia: Gradient boosting (Gradient tree boosting)</a></para>
<para><a href='https://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1013203451'>Greedy function approximation: A gradient boosting machine</a></para>
</remarks>
</member>
<member name="TreeEnsembleFeaturizerTransform">
<summary>
Trains a tree ensemble, or loads it from a file, then maps a numeric feature vector to outputs.
</summary>
<remarks>
In machine learning it is a pretty common and powerful approach to utilize the already trained model in the process of defining features.
<para>One such example would be the use of model's scores as features to downstream models. For example, we might run clustering on the original features,
and use the cluster distances as the new feature set.
Instead of consuming the model's output, we could go deeper, and extract the 'intermediate outputs' that are used to produce the final score. </para>
There are a number of famous or popular examples of this technique:
<list type='bullet'>
<item><description>A deep neural net trained on the ImageNet dataset, with the last layer removed, is commonly used to compute the 'projection' of the image into the 'semantic feature space'.
It is observed that the Euclidean distance in this space often correlates with the 'semantic similarity': that is, all pictures of pizza are located close together,
and far away from pictures of kittens. </description></item>
<item><description>A matrix factorization and/or LDA model is also often used to extract the 'latent topics' or 'latent features' associated with users and items.</description></item>
<item><description>The weights of the linear model are often used as a crude indicator of 'feature importance'. At the very minimum, the 0-weight features are not needed by the model,
and there's no reason to compute them. </description></item>
</list>
<para>
Tree featurizer uses the decision tree ensembles for feature engineering in the same fashion as above.
It trains a tree ensemble, or loads it from a file, then maps a numeric feature vector to three outputs:
</para>
<list type='number'>
<item><description>A vector containing the individual tree outputs of the tree ensemble.</description></item>
<item><description>A vector indicating the leaves that the feature vector falls on in the tree ensemble.</description></item>
<item><description>A vector indicating the paths that the feature vector falls on in the tree ensemble.</description></item>
</list>
If a both a model file and a trainer are specified - will use the model file. If neither are specified,
will train a default FastTree model.
This can handle key labels by training a regression model towards their optionally permuted indices.
<para>Let's assume that we've built a tree ensemble of 100 trees with 100 leaves each (it doesn't matter whether boosting was used or not in training).
If we associate each leaf of each tree with a sequential integer, we can, for every incoming example x,
produce an indicator vector L(x), where Li(x) = 1 if the example x 'falls' into the leaf #i, and 0 otherwise.</para>
<para>Thus, for every example x, we produce a 10000-valued vector L, with exactly 100 1s and the rest zeroes.
This 'leaf indicator' vector can be considered the ensemble-induced 'footprint' of the example.</para>
<para>The 'distance' between two examples in the L-space is actually a Hamming distance, and is equal to the number of trees that do not distinguish the two examples.</para>
<para>We could repeat the same thought process for the non-leaf, or internal, nodes of the trees (we know that each tree has exactly 99 of them in our 100-leaf example),
and produce another indicator vector, N (size 9900), for each example, indicating the 'trajectory' of each example through each of the trees.</para>
<para>The distance in the combined 19900-dimensional LN-space will be equal to the number of 'decisions' in all trees that 'agree' on the given pair of examples.</para>
<para>The TreeLeafFeaturizer is also producing the third vector, T, which is defined as Ti(x) = output of tree #i on example x.</para>
</remarks>
<example>
<code language="csharp">
pipeline.Add(new TreeLeafFeaturizer())
</code>
</example>
</member>
</members>
</doc>