|
1 |
| -# 3. Data Prep |
2 |
| - |
3 |
| - |
| 1 | +--- |
| 2 | +description: Tools for Preprocessing(Encoding/Scaling) |
| 3 | +--- |
4 | 4 |
|
5 |
| -<figure><img src="../.gitbook/assets/image (148).png" alt="" width="211"><figcaption></figcaption></figure> |
6 |
| - |
7 |
| -1. Click on Data Prep in the Machine Learning category. |
| 5 | +# 3. Data Prep |
8 | 6 |
|
| 7 | +<figure><img src="../.gitbook/assets/image (322).png" alt="" width="529"><figcaption></figcaption></figure> |
9 | 8 |
|
| 9 | +1. Click on **Data Prep** in the **Machine Learning** category. |
10 | 10 |
|
11 |
| -<figure><img src="../.gitbook/assets/image (149).png" alt="" width="563"><figcaption></figcaption></figure> |
| 11 | +<figure><img src="../.gitbook/assets/image (323).png" alt="" width="563"><figcaption></figcaption></figure> |
12 | 12 |
|
13 | 13 | 2. _**Model Type**_: You can perform various preprocessing tasks:
|
14 |
| - * Encoding |
15 |
| - * Scaling |
16 |
| - * ETC |
| 14 | + * [**Encoding**](3.-data-prep.md#encoding) |
| 15 | + * [**Scaling**](3.-data-prep.md#scaling) |
| 16 | + * [**ETC**](3.-data-prep.md#etc-simpleimputer-smote-makecolumntransformer) |
17 | 17 | 3. _**Allocate to**_: Assign variable names for the model to perform the selected preprocessing tasks.
|
18 | 18 | 4. _**Code View**_: Preview the code that will be output.
|
19 | 19 | 5. _**Run**_: Execute the code.
|
20 | 20 |
|
| 21 | + |
| 22 | + |
| 23 | +*** |
| 24 | + |
| 25 | +## Encoding |
| 26 | + |
| 27 | +<figure><img src="../.gitbook/assets/image (324).png" alt="" width="563"><figcaption></figcaption></figure> |
| 28 | + |
| 29 | +1. _**Sparse (OneHotEncoder)**_: If _**true**,_ returns the encoding result as a sparse matrix. |
| 30 | +2. _**Handle unknown (OneHotEncoder, OrdinalEncoder)**_: Used when encoding, if there is a category that exists in the training data but not in the test data. If _**ignore** is_ selected, it will be set to 0, and if _**error**_ is selected, a ValueError will be raised. |
| 31 | +3. _**Unknown values (OrdinalEncoder)**_: Fill with a specific value, not ignore or error. |
| 32 | +4. _**Cols (TargetEncoder)**_: Select the columns to encode. |
| 33 | +5. _**Handle missing (TargetEncoder)**_: Choose how to handle missing values. |
| 34 | +6. _**Smoothing (TargetEncoder)**_: When the number of data in a particular category is small, it adds the entered values and calculates the average of the categories to prevent overfitting. |
| 35 | + |
| 36 | + |
| 37 | + |
| 38 | +*** |
| 39 | + |
| 40 | +## Scaling |
| 41 | + |
| 42 | +<figure><img src="../.gitbook/assets/image (325).png" alt="" width="563"><figcaption></figcaption></figure> |
| 43 | + |
| 44 | +1. _**With mean (StandardScaler)**_: Center the mean of the data to zero. |
| 45 | +2. _**With std (StandardScaler)**_: Scale the standard deviation of the data to 1. |
| 46 | +3. _**With centering (RobustScaler)**_: Performs centering by Q-subtracting the median from each attribute (column)_._ |
| 47 | +4. _**With scaling (RobustScaler)**_: Scales each attribute by dividing it by its IQR. |
| 48 | +5. _**Feature range (MinMaxScaler)**_: Sets the minimum and maximum values for the scaled result. |
| 49 | +6. _**Norm (Normalizer)**:_ |
| 50 | + 1. _**L1**_: The sum of the absolute values of each attribute will be 1. |
| 51 | + 2. _**L2**_: Scale the vectors so that their Euclidean distance is 1.  |
| 52 | + 3. _**Max Norm**_: Ensures that the scaling result does not exceed an existing maximum value. |
| 53 | +7. _**N bins (KBins Discretizer)**_: Determines how many bins to divide the variable into. |
| 54 | +8. _**Strategy (KBins Discretizer)**_: |
| 55 | + 1. _**uniform**_: Divide the section by a uniform width. |
| 56 | + 2. _**QUANTILE**_: Divide so that each bin has an even number of data. |
| 57 | +9. _**Encode (KBins Discretizer)**_: Specify the encoding method. |
| 58 | + 1. _**ordinal**_: Encodes each interval as an integer. |
| 59 | + 2. _**onehot**_: Encodes each interval as a binary vector. |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | +*** |
| 64 | + |
| 65 | +## ETC(SimpleImputer / SMOTE / MakeColumnTransformer) |
| 66 | + |
| 67 | +<figure><img src="../.gitbook/assets/image (326).png" alt="" width="563"><figcaption></figcaption></figure> |
| 68 | + |
| 69 | +1. _**Missing values (SimpleImputer)**_: Treats the entered values as missing. |
| 70 | +2. _**Fill value (SimpleImputer)**_: Replaces _the_ missing value with the input value. |
| 71 | +3. _**Copy (SimpleImputer)**_: Returns the original data unchanged, as new data. |
| 72 | +4. _**Add indicator (SimpleImputer)**_: Adds a new column with 0s and 1s, with a 1 for rows with missing values and a 0 for rows without. |
| 73 | +5. _**K neighbors (SMOTE)**_: Specifies the number of neighbors to group together based on center point data. |
| 74 | +6. _**Sampling strategy (SMOTE)**_: |
| 75 | + 1. _**auto**_: Automatically adjusts the ratio of minority to majority class data to balance out class imbalances.  |
| 76 | + 2. _**minority**_: Makes the size of the minority class dataset equal to the size of the majority class dataset. |
| 77 | + 3. _**float**_: You can specify the desired class ratio. For example, setting it to 0.5 makes the minority class dataset half the size of the majority class dataset. |
| 78 | +7. _**Estimator (MakeColumnTransformer)**_: You can specify different global models to apply to each column. The model selected here will be applied to the columns selected _in Columns_ below. |
| 79 | + |
0 commit comments