Skip to content

Commit c3a0abc

Browse files
minjk-blgitbook-bot
authored andcommitted
GITBOOK-69: Data Prep
1 parent 890f3ec commit c3a0abc

File tree

6 files changed

+69
-10
lines changed

6 files changed

+69
-10
lines changed

docs/.gitbook/assets/image (322).png

236 KB
Loading

docs/.gitbook/assets/image (323).png

48.9 KB
Loading

docs/.gitbook/assets/image (324).png

66.2 KB
Loading

docs/.gitbook/assets/image (325).png

63.6 KB
Loading

docs/.gitbook/assets/image (326).png

77.6 KB
Loading

docs/machine-learning/3.-data-prep.md

Lines changed: 69 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,79 @@
1-
# 3. Data Prep
2-
3-
1+
---
2+
description: Tools for Preprocessing(Encoding/Scaling)
3+
---
44

5-
<figure><img src="../.gitbook/assets/image (148).png" alt="" width="211"><figcaption></figcaption></figure>
6-
7-
1. Click on Data Prep in the Machine Learning category.
5+
# 3. Data Prep
86

7+
<figure><img src="../.gitbook/assets/image (322).png" alt="" width="529"><figcaption></figcaption></figure>
98

9+
1. Click on **Data Prep** in the **Machine Learning** category.
1010

11-
<figure><img src="../.gitbook/assets/image (149).png" alt="" width="563"><figcaption></figcaption></figure>
11+
<figure><img src="../.gitbook/assets/image (323).png" alt="" width="563"><figcaption></figcaption></figure>
1212

1313
2. _**Model Type**_: You can perform various preprocessing tasks:
14-
* Encoding
15-
* Scaling
16-
* ETC
14+
* [**Encoding**](3.-data-prep.md#encoding)
15+
* [**Scaling**](3.-data-prep.md#scaling)
16+
* [**ETC**](3.-data-prep.md#etc-simpleimputer-smote-makecolumntransformer)
1717
3. _**Allocate to**_: Assign variable names for the model to perform the selected preprocessing tasks.
1818
4. _**Code View**_: Preview the code that will be output.
1919
5. _**Run**_: Execute the code.
2020

21+
22+
23+
***
24+
25+
## Encoding
26+
27+
<figure><img src="../.gitbook/assets/image (324).png" alt="" width="563"><figcaption></figcaption></figure>
28+
29+
1. _**Sparse (OneHotEncoder)**_: If _**true**,_ returns the encoding result as a sparse matrix.
30+
2. _**Handle unknown (OneHotEncoder, OrdinalEncoder)**_: Used when encoding, if there is a category that exists in the training data but not in the test data. If _**ignore** is_ selected, it will be set to 0, and if _**error**_ is selected, a ValueError will be raised.
31+
3. _**Unknown values (OrdinalEncoder)**_: Fill with a specific value, not ignore or error.
32+
4. _**Cols (TargetEncoder)**_: Select the columns to encode.
33+
5. _**Handle missing (TargetEncoder)**_: Choose how to handle missing values.
34+
6. _**Smoothing (TargetEncoder)**_: When the number of data in a particular category is small, it adds the entered values and calculates the average of the categories to prevent overfitting.
35+
36+
37+
38+
***
39+
40+
## Scaling
41+
42+
<figure><img src="../.gitbook/assets/image (325).png" alt="" width="563"><figcaption></figcaption></figure>
43+
44+
1. _**With mean (StandardScaler)**_: Center the mean of the data to zero.
45+
2. _**With std (StandardScaler)**_: Scale the standard deviation of the data to 1.
46+
3. _**With centering (RobustScaler)**_: Performs centering by Q-subtracting the median from each attribute (column)_._
47+
4. _**With scaling (RobustScaler)**_: Scales each attribute by dividing it by its IQR.
48+
5. _**Feature range (MinMaxScaler)**_: Sets the minimum and maximum values for the scaled result.
49+
6. _**Norm (Normalizer)**:_
50+
1. _**L1**_: The sum of the absolute values of each attribute will be 1.
51+
2. _**L2**_: Scale the vectors so that their Euclidean distance is 1.&#x20;
52+
3. _**Max Norm**_: Ensures that the scaling result does not exceed an existing maximum value.
53+
7. _**N bins (KBins Discretizer)**_: Determines how many bins to divide the variable into.
54+
8. _**Strategy (KBins Discretizer)**_:
55+
1. _**uniform**_: Divide the section by a uniform width.
56+
2. _**QUANTILE**_: Divide so that each bin has an even number of data.
57+
9. _**Encode (KBins Discretizer)**_: Specify the encoding method.
58+
1. _**ordinal**_: Encodes each interval as an integer.
59+
2. _**onehot**_: Encodes each interval as a binary vector.
60+
61+
62+
63+
***
64+
65+
## ETC(SimpleImputer / SMOTE / MakeColumnTransformer)
66+
67+
<figure><img src="../.gitbook/assets/image (326).png" alt="" width="563"><figcaption></figcaption></figure>
68+
69+
1. _**Missing values (SimpleImputer)**_: Treats the entered values as missing.
70+
2. _**Fill value (SimpleImputer)**_: Replaces _the_ missing value with the input value.
71+
3. _**Copy (SimpleImputer)**_: Returns the original data unchanged, as new data.
72+
4. _**Add indicator (SimpleImputer)**_: Adds a new column with 0s and 1s, with a 1 for rows with missing values and a 0 for rows without.
73+
5. _**K neighbors (SMOTE)**_: Specifies the number of neighbors to group together based on center point data.
74+
6. _**Sampling strategy (SMOTE)**_:
75+
1. _**auto**_: Automatically adjusts the ratio of minority to majority class data to balance out class imbalances.&#x20;
76+
2. _**minority**_: Makes the size of the minority class dataset equal to the size of the majority class dataset.
77+
3. _**float**_: You can specify the desired class ratio. For example, setting it to 0.5 makes the minority class dataset half the size of the majority class dataset.
78+
7. _**Estimator (MakeColumnTransformer)**_: You can specify different global models to apply to each column. The model selected here will be applied to the columns selected _in Columns_ below.
79+

0 commit comments

Comments
 (0)