3.-data-prep.md

description	Tools for Preprocessing(Encoding/Scaling)

3. Data Prep

Model Type: You can perform various preprocessing tasks:
Allocate to: Assign variable names for the model to perform the selected preprocessing tasks.
Code View: Preview the code that will be output.
Run: Execute the code.

Sparse (OneHotEncoder): If true, returns the encoding result as a sparse matrix.
Handle unknown (OneHotEncoder, OrdinalEncoder): Used when encoding, if there is a category that exists in the training data but not in the test data. If ignore is selected, it will be set to 0, and if error is selected, a ValueError will be raised.
Unknown values (OrdinalEncoder): Fill with a specific value, not ignore or error.
Cols (TargetEncoder): Select the columns to encode.
Handle missing (TargetEncoder): Choose how to handle missing values.
Smoothing (TargetEncoder): When the number of data in a particular category is small, it adds the entered values and calculates the average of the categories to prevent overfitting.

With mean (StandardScaler): Center the mean of the data to zero.
With std (StandardScaler): Scale the standard deviation of the data to 1.
With centering (RobustScaler): Performs centering by Q-subtracting the median from each attribute (column).
With scaling (RobustScaler): Scales each attribute by dividing it by its IQR.
Feature range (MinMaxScaler): Sets the minimum and maximum values for the scaled result.
Norm (Normalizer):
1. L1: The sum of the absolute values of each attribute will be 1.
2. L2: Scale the vectors so that their Euclidean distance is 1.
3. Max Norm: Ensures that the scaling result does not exceed an existing maximum value.
N bins (KBins Discretizer): Determines how many bins to divide the variable into.
Strategy (KBins Discretizer):
1. uniform: Divide the section by a uniform width.
2. QUANTILE: Divide so that each bin has an even number of data.
Encode (KBins Discretizer): Specify the encoding method.
1. ordinal: Encodes each interval as an integer.
2. onehot: Encodes each interval as a binary vector.

Missing values (SimpleImputer): Treats the entered values as missing.
Fill value (SimpleImputer): Replaces the missing value with the input value.
Copy (SimpleImputer): Returns the original data unchanged, as new data.
Add indicator (SimpleImputer): Adds a new column with 0s and 1s, with a 1 for rows with missing values and a 0 for rows without.
K neighbors (SMOTE): Specifies the number of neighbors to group together based on center point data.
Sampling strategy (SMOTE):
1. auto: Automatically adjusts the ratio of minority to majority class data to balance out class imbalances.
2. minority: Makes the size of the minority class dataset equal to the size of the majority class dataset.
3. float: You can specify the desired class ratio. For example, setting it to 0.5 makes the minority class dataset half the size of the majority class dataset.
Estimator (MakeColumnTransformer): You can specify different global models to apply to each column. The model selected here will be applied to the columns selected in Columns below.