Skip to content

Latest commit

 

History

History
79 lines (55 loc) · 4.23 KB

File metadata and controls

79 lines (55 loc) · 4.23 KB
description Tools for Preprocessing(Encoding/Scaling)

3. Data Prep

  1. Click on Data Prep in the Machine Learning category.

  1. Model Type: You can perform various preprocessing tasks:
  2. Allocate to: Assign variable names for the model to perform the selected preprocessing tasks.
  3. Code View: Preview the code that will be output.
  4. Run: Execute the code.

Encoding

  1. Sparse (OneHotEncoder): If true, returns the encoding result as a sparse matrix.
  2. Handle unknown (OneHotEncoder, OrdinalEncoder): Used when encoding, if there is a category that exists in the training data but not in the test data. If ignore is selected, it will be set to 0, and if error is selected, a ValueError will be raised.
  3. Unknown values (OrdinalEncoder): Fill with a specific value, not ignore or error.
  4. Cols (TargetEncoder): Select the columns to encode.
  5. Handle missing (TargetEncoder): Choose how to handle missing values.
  6. Smoothing (TargetEncoder): When the number of data in a particular category is small, it adds the entered values and calculates the average of the categories to prevent overfitting.

Scaling

  1. With mean (StandardScaler): Center the mean of the data to zero.
  2. With std (StandardScaler): Scale the standard deviation of the data to 1.
  3. With centering (RobustScaler): Performs centering by Q-subtracting the median from each attribute (column).
  4. With scaling (RobustScaler): Scales each attribute by dividing it by its IQR.
  5. Feature range (MinMaxScaler): Sets the minimum and maximum values for the scaled result.
  6. Norm (Normalizer):
    1. L1: The sum of the absolute values of each attribute will be 1.
    2. L2: Scale the vectors so that their Euclidean distance is 1.
    3. Max Norm: Ensures that the scaling result does not exceed an existing maximum value.
  7. N bins (KBins Discretizer): Determines how many bins to divide the variable into.
  8. Strategy (KBins Discretizer):
    1. uniform: Divide the section by a uniform width.
    2. QUANTILE: Divide so that each bin has an even number of data.
  9. Encode (KBins Discretizer): Specify the encoding method.
    1. ordinal: Encodes each interval as an integer.
    2. onehot: Encodes each interval as a binary vector.

ETC(SimpleImputer / SMOTE / MakeColumnTransformer)

  1. Missing values (SimpleImputer): Treats the entered values as missing.
  2. Fill value (SimpleImputer): Replaces the missing value with the input value.
  3. Copy (SimpleImputer): Returns the original data unchanged, as new data.
  4. Add indicator (SimpleImputer): Adds a new column with 0s and 1s, with a 1 for rows with missing values and a 0 for rows without.
  5. K neighbors (SMOTE): Specifies the number of neighbors to group together based on center point data.
  6. Sampling strategy (SMOTE):
    1. auto: Automatically adjusts the ratio of minority to majority class data to balance out class imbalances.
    2. minority: Makes the size of the minority class dataset equal to the size of the majority class dataset.
    3. float: You can specify the desired class ratio. For example, setting it to 0.5 makes the minority class dataset half the size of the majority class dataset.
  7. Estimator (MakeColumnTransformer): You can specify different global models to apply to each column. The model selected here will be applied to the columns selected in Columns below.