Example Dataset
Let’s say we have a small dataset with two features: Age and Income.
Perso Ag Income (in
n e thousands)
A 25 50
B 30 60
C 35 80
D 40 100
Step 1: Min-Max Normalization
Goal: Scale the data to a range of [0, 1].
Formula:
Xnormalized=X−XminXmax−XminXnormalized=Xmax−XminX−Xmin
Step-by-Step Calculation:
1. Find Min and Max for Each Feature:
oAge: Xmin=25Xmin=25, Xmax=40Xmax=40
o Income: Xmin=50Xmin=50, Xmax=100Xmax=100
2. Normalize Age:
oFor Person A: 25−2540−25=040−2525−25=0
o For Person B: 30−2540−25=0.3340−2530−25=0.33
o For Person C: 35−2540−25=0.6740−2535−25=0.67
o For Person D: 40−2540−25=140−2540−25=1
3. Normalize Income:
o For Person A: 50−50100−50=0100−5050−50=0
o For Person B: 60−50100−50=0.2100−5060−50=0.2
o For Person C: 80−50100−50=0.6100−5080−50=0.6
o For Person D: 100−50100−50=1100−50100−50=1
4. Normalized Dataset:
Person Age (Normalized) Income (Normalized)
A 0 0
B 0.33 0.2
C 0.67 0.6
D 1 1
Step 2: Z-Score Normalization (Standardization)
Goal: Center the data around 0 with a standard deviation of 1.
Formula:
Xstandardized=X−μσXstandardized=σX−μ
μμ = mean, σσ = standard deviation.
Step-by-Step Calculation:
1. Calculate Mean (μμ) and Standard Deviation (σσ) for Each
Feature:
o Age:
Mean: μ=25+30+35+404=32.5μ=425+30+35+40=32.5
Standard Deviation: σ=6.45σ=6.45
o Income:
Mean: μ=50+60+80+1004=72.5μ=450+60+80+100=72.5
Standard Deviation: σ=21.02σ=21.02
2. Standardize Age:
o For Person A: 25−32.56.45=−1.166.4525−32.5=−1.16
o For Person B: 30−32.56.45=−0.396.4530−32.5=−0.39
o For Person C: 35−32.56.45=0.396.4535−32.5=0.39
o For Person D: 40−32.56.45=1.166.4540−32.5=1.16
3. Standardize Income:
o For Person A: 50−72.521.02=−1.0721.0250−72.5=−1.07
o For Person B: 60−72.521.02=−0.5921.0260−72.5=−0.59
o For Person C: 80−72.521.02=0.3621.0280−72.5=0.36
o For Person D: 100−72.521.02=1.3121.02100−72.5=1.31
4. Standardized Dataset:
Perso
Age (Standardized) Income (Standardized)
n
A -1.16 -1.07
B -0.39 -0.59
C 0.39 0.36
D 1.16 1.31
Step 3: Impact on Distance-Based Algorithms
Why Scaling Matters:
Without Scaling:
o Income (range: 50-100) dominates Age (range: 25-40) in
distance calculations.
o Algorithms like KNN and clustering will be biased toward
Income.
With Scaling:
o Both features contribute equally to distance calculations.
o Improves accuracy and fairness in predictions.
Example: KNN
Suppose we want to classify a new person with Age = 28 and
Income = 55.
Using the normalized data, distances will be calculated fairly
between Age and Income.
Example: Clustering
Clusters will group people based on patterns, not just Income.
For example, younger people with lower incomes will form a
distinct cluster