machinelearning/src/Microsoft.ML.Transforms/doc.xml at master · devhttps/machinelearning

356 lines (339 loc) · 16.4 KB
﻿<?xml version="1.0" encoding="utf-8"?>
  <members>
    <member name="CategoricalHashOneHotVectorizer">
      <summary>
        Encodes the categorical variable with hash-based encoding. 
      </summary>
      <remarks>
        CategoricalHashOneHotVectorizer converts a categorical value into an indicator array by hashing the value and using the hash as an index in the bag.
        If the input column is a vector, a single indicator bag is returned for it.
      </remarks>
    </member>
    <example name="CategoricalHashOneHotVectorizer">
      <example>
        <code language="csharp">
          pipeline.Add(new CategoricalHashOneHotVectorizer(&quot;Text1&quot;) 
            HashBits = 10, 
            OutputKind = CategoricalTransformOutputKind.Bag 
          });
        </code>
      </example>
    </example>
    <member name="CategoricalOneHotVectorizer">
      <summary>
        Converts the categorical value into an indicator array by building a dictionary of categories based on the data and using the id in the dictionary as the index in the array
      </summary>
      <remarks>
        <para>
        The CategoricalOneHotVectorizer transform passes through a data set, operating on text columns, to
        build a dictionary of categories.
        For each row, the entire text string appearing in the input column is defined as a category.</para>
        <para>The output of this transform is an indicator vector.</para>
        Each slot in this vector corresponds to a category in the dictionary, so its length is the size of the built dictionary.
        <para>The CategoricalOneHotVectorizer can be applied to one or more columns, in which case it builds and uses a separate dictionary
        for each column that it is applied to.</para>
        <para>The <see cref="T:Microsoft.ML.Transforms.CategoricalTransformOutputKind"/> produces integer values and KeyType columns.
        The Key value is the one-based index of the slot set in the Ind/Bag options.
        If the Key option is not found, it is assigned the value zero.
        In the <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Ind"/>, <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Bag"/> options are not found, they result in an all zero bit vector.
        <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Ind"/> and <see cref="F:Microsoft.ML.Transforms.CategoricalTransformOutputKind.Bag"/> differ simply in how the bit-vectors generated from individual slots are aggregated:
        for Ind they are concatenated and for Bag they are added.
        When the source column is a singleton, the Ind and Bag options are identical.</para>
      </remarks>
    </member>
    <example name="CategoricalOneHotVectorizer">
      <example>
        An example of how to add the CategoricalOneHotVectorizer transform to a pipeline with two text column
        features named &quot;Text1&quot; and &quot;Text2&quot;.
        <code language="csharp">
          pipeline.Add(new CategoricalOneHotVectorizer(&quot;Text1&quot;, &quot;Text1&quot;));
        </code>
      </example>
    </example>
    <member name="CountFeatureSelection">
      <summary>
        Selects the slots for which the count of non-default values is greater than or equal to a threshold.
      </summary>
      <remarks>
        <para>
          This transform uses a set of aggregators to count the number of non-default values for each slot and
          instantiates a <see cref="SlotsDroppingTransformer"/> to actually drop the slots.
          This transform is useful when applied together with a <see cref="T:Microsoft.ML.Transforms.OneHotHashEncodingTransformer"/>. 
          The count feature selection can remove those features generated by the hash transform that have no data in the examples.
        </para>
      </remarks>
    </member>
    <example name="CountFeatureSelection">
       <example>
        <code language="csharp">
          pipeline.Add(new FeatureSelectorByCount
            Column = new[]{ &quot;Feature1&quot; }, 
            Count = 2 
          });
        </code>
      </example>
    </example>
    <member name="MutualInformationFeatureSelection">
      <summary>
        Selects the top k slots across all specified columns ordered by their mutual information with the label column.
      </summary>
      <remarks>
        <para>
          The mutual information of two random variables X and Y is a measure of the mutual dependence between the variables.
          Formally, the mutual information can be written as:
        </para>
        <para>I(X;Y) = E[log(p(x,y)) - log(p(x)) - log(p(y))]</para>
        <para>where the expectation is taken over the joint distribution of X and Y. 
        Here p(x,y) is the joint probability density function of X and Y, p(x) and p(y) are the marginal probability density functions of X and Y respectively. 
        In general, a higher mutual information between the dependent variable (or label) and an independent variable (or feature) means 
        that the label has higher mutual dependence over that feature.
        It keeps the top SlotsInOutput features with the largest mutual information with the label.
        </para>
      </remarks>
    </member>
    <example name="MutualInformationFeatureSelection">
      <example>
        <code language="csharp">
          pipeline.Add(new FeatureSelectorByMutualInformation
            Column = new[]{ &quot;Feature1&quot; }, 
            SlotsInOutput = 6 
           });
        </code>
      </example>
    </example>
    <member name="OptionalColumnTransform">
      <summary>
        Creates a new column with the specified type and default values.
      </summary>
      <remarks>
        If the user wish to create additional columns with a particular type and default values, or replicated the values from one column to another, changing their type, they can do so using this transform.
        This transform can be used as a workaround to create a Label column after deserializing a model, for prediction.
        Some transforms in the serialized model operate on the Label column, and would throw errors during prediction if such a column is not found.
      </remarks>
    </member>
    <example name="OptionalColumnTransform">
      <example>
        <code language="csharp">
          pipeline.Add(new OptionalColumnCreator 
            Column = new[]{ &quot;OptColumn&quot;} 
          });
        </code>
      </example>
    </example>
    <member name="HashJoin">
      <summary>
        Converts multiple column values into hashes. 
        This transform accepts both numeric and text inputs, both single and vector-valued columns. 
      </summary>
      <remarks>
        This transform can be helpful for ranking and cross-validation. In the case of ranking, where the GroupIdColumn column is required,
        and needs to be of a key type you can use the <see cref="T:Microsoft.ML.Transforms.CategoricalHashOneHotVectorizer" /> to hash the text value of a single GroupID column into a key value.
        If the GroupID is the combination of the values from multiple columns, you can use the HashConverter to hash multiple text columns into one key column. 
        Similarly with CrossValidator and the StratificationColumn. 
      </remarks>
    </member>
    <example name="HashJoin">
      <example>
        <code language="csharp">
          pipeline.Add(new HashConverter(&quot;Column1&quot;, &quot;Column2&quot;));
        </code>
      </example>
    </example>
    <member name="NADrop">
      <summary>
        Removes missing values from vector type columns.
      </summary>
    </member>
    <example>
      <example>
        <code language="csharp">
          pipeline.Add(new MissingValuesDropper(&quot;Column1&quot;));
        </code>
      </example>
    </example>
    <member name="NAIndicator">
      <summary>
        This transform can transform either scalars or vectors (both fixed and variable size),
        creating output columns that indicate, through the true/false booleans whether the row has a missing value.
      </summary>
    </member>
    <example name="NAIndicator">
      <example>
        <code language="csharp">
          pipeline.Add(new MissingValueIndicator(&quot;Column1&quot;));
        </code>
      </example>
    </example>
    <member name="NAReplace">
      <summary>
        Create an output column of the same type and size of the input column, 
        where missing values are replaced with either the default value or the mean/min/max value (for non-text columns only). 
      </summary>
      <remarks>
        This transform can transform either scalars or vectors (both fixed and variable size),
        creating output columns that are identical to the input columns except for replacing NA values
        with either the default value, user input, or imputed values (min/max/mean are currently supported).
        Imputation modes are supported for vectors both by slot and across all slots.
      </remarks>
    </member>
    <example name="NAReplace">
      <example>
        <code language="csharp">
          pipeline.Add(new MissingValueSubstitutor(&quot;FeatureCol&quot;)
              ReplacementKind = NAReplaceTransformReplacementKind.Mean 
          });
        </code>
      </example>
    </example>
    <member name="LpNormalize">
      <summary>
         The LpNormalizer transforms, normalizes vectors (rows) individually by rescaling them to unit norm (L2, L1 or LInf). 
         <para>Performs the following operation on a vector X:</para> 
         <para>Y = (X - M) / D</para> 
         <para>where M is mean and D is either L2 norm, L1 norm or LInf norm.</para>
       </summary>
      <remarks>
        Scaling inputs to unit norms is a common operation for text classification or clustering.
        For more information see: <a href="https://www.cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf"></a>
      </remarks>
      <seealso cref=" Microsoft.ML.Transforms.Projections.GlobalContrastNormalizingEstimator"></seealso>
      <example>
        <code language="csharp">
          pipeline.Add(new LpNormalizer(&quot;FeatureCol&quot;)
              NormKind = LpNormNormalizerTransformNormalizerKind.L1Norm
          });
        </code>
      </example>
    </member>
  <member name="GcNormalize">
      <summary>
        <para>Performs a global contrast normalization on input values:</para>
        <para>Y = (s * X - M) / D</para> 
        <para>where s is a scale, M is mean and D is either the L2 norm or standard deviation.</para>
       </summary>
      <remarks>
        Scaling inputs to unit norms is a common operation for text classification or clustering.
        For more information see: 
        <a href="https://www.cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf">An Analysis of Single-Layer Networks in Unsupervised Feature Learning</a>
      </remarks>
      <seealso cref="Microsoft.ML.Transforms.Projections.LpNormalizingEstimator"></seealso>
      <example>
        <code language="csharp">
          pipeline.Add(new GlobalContrastNormalizer(&quot;FeatureCol&quot;)
              SubMean= false
          });
        </code>
      </example>
    </member>
  <member name="Ungroup">
      <summary>
        Un-groups vector columns into sequences of rows, inverse of Group transform.
       </summary>
      <remarks>
        <para>This can be thought of as an inverse of the <see cref="T:Microsoft.ML.Transforms.CombinerByContiguousGroupId"/>.  
        For all specified vector columns ("pivot" columns), performs the "ungroup" (or "unroll") operation as outlined below.
        </para>
        <para>If the only pivot column is called P, and has size K, then for every row of the input we will produce 
         K rows, that are identical in all columns except P. The column P will become a scalar column, and this 
         column will hold all the original values of input's P, one value per row, in order. The order of columns 
         will remain the same.
        </para>
        <para>Variable-length pivot columns are supported (including zero, which will eliminate the row from the result).</para>
        <para>Multiple pivot columns are also supported:</para>
        <list type="bullet">
          <item><description>A number of output rows is controlled by the 'mode' parameter. 
            <list type="bullet">
              <item><term>outer</term><description> it is equal to the maximum length of pivot columns</description></item>
              <item><term>inner</term><description> it is equal to the minimum length of pivot columns</description></item>
              <item><term>first</term><description> it is equal to the length of the first pivot column</description></item>
            </list>
            </description>
          </item>
          <item><description>
              If a particular pivot column has size that is different than the number of output rows, the extra slots will
              be ignored, and the missing slots will be 'padded' with default values.
            </description></item>
        </list>
        <para>All metadata are preserved for the retained columns. For 'unrolled' columns, all known metadata
        except slot names are preserved.
        </para>
      </remarks>
    </member>
    <example name="Ungroup">
      <example>
        <code language="csharp">
          pipeline.Add(new Segregator
              Column = new[]{&quot;Column1&quot; },
              Mode = UngroupTransformUngroupMode.First
          });
        </code>
      </example>
    </example>
    <member name="KeyToText">
      <summary>
        Helps retrieving the original values from a key column. 
      </summary>
      <remarks>
        The KeyToTextConverter is the complement of the <see  cref="TextToKeyConverter"/> transform.
        Since key values are an enumeration into the set of keys, most transforms that produce key valued outputs
        corresponding to input values will often, wherever possible, associate a piece of KeyValue metadata with that dataset.
        Transforming values into a categorical variable would be of limited use,
        if we couldn't somehow backtrack to figure out what those categories actually mean.
        The KeyToTextConverter enables that functionality.
      </remarks>
      <seealso cref="Microsoft.ML.Transforms.HashConverter"/>
      <seealso cref="Microsoft.ML.Transforms.TextToKeyConverter"/>
      <example>
        <code language="csharp">
          pipeline.Add(new KeyToTextConverter((&quot;InColumn&quot;, &quot;OutColumn&quot; )));
        </code>
      </example>
    </member>
    <member name="Group">
      <summary>
       Groups values of a scalar column into a vector, by a contiguous group ID.
      </summary>
      <remarks>
       The CombinerByContiguousGroupId transform groups the consecutive rows that share the specified group key (or keys). 
       Both group keys and the aggregated values can be of arbitrary non-vector types. 
       The resulting data will have all the group key columns preserved, 
       and the aggregated columns will become variable-length vectors of the original types.
       <para>This transform essentially performs the following SQL-like operation:</para> 
       <para>SELECT GroupKey1, GroupKey2, ... GroupKeyK, LIST(Value1), LIST(Value2), ... LIST(ValueN)</para> 
       <para>FROM Data</para> 
       <para>GROUP BY GroupKey1, GroupKey2, ... GroupKeyK.</para> 
      </remarks>
       <seealso cref="Microsoft.ML.Transforms.Segregator"/>
      <example>
        <code language="csharp">
          pipeline.Add(new CombinerByContiguousGroupId
              GroupKey = new []{&quot;Key1&quot;, &quot;Key2&quot; } 
          });
        </code>
      </example>
    </member>
    <member name="Whitening">
      <summary>
        Implements PCA (Principal Component Analysis) and ZCA (Zero phase Component Analysis) whitening.
        The whitening process consists of 2 steps:
            1. Decorrelation of the input data. Input data is assumed to have zero mean.
            2. Rescale decorrelated features to have unit variance.
        That is, PCA whitening is essentially just a PCA + rescale.
        ZCA whitening tries to make resulting data to look more like input data by rotating it back to the
        original input space.
        More information: <a href="http://ufldl.stanford.edu/wiki/index.php/Whitening">http://ufldl.stanford.edu/wiki/index.php/Whitening</a>
      </summary>
    </member>
  </members>
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FilesExpand file tree

doc.xml

Latest commit

History

doc.xml

File metadata and controls