K-Nearest Neighbors (KNN)
1. Introduction to K-Nearest Neighbors (KNN)
K-Nearest Neighbors (KNN) is a supervised learning algorithm used for both classification
and regression tasks. It is an instance-based learning algorithm, meaning it does not
explicitly learn a model but rather memorizes the training dataset and makes predictions
based on similarity. The key idea behind KNN is that similar data points tend to belong to
the same class or have similar output values.
2. How KNN Works
KNN does not involve an explicit training phase. Instead, it simply stores the feature vectors
and corresponding labels from the training dataset. For a given test point X, KNN follows
these steps:
1. Compute the distance between X and all points in the training dataset.
2. Select the K nearest neighbors based on the computed distances.
3. Assign a class label (for classification) or compute the average (for regression) based on
the K neighbors.
3. Distance Metrics in KNN
KNN relies on distance metrics to find the closest neighbors. Some commonly used distance
measures include:
3.1. Euclidean Distance
The most widely used distance metric in KNN is Euclidean distance, which measures the
straight-line distance between two points in an n-dimensional space. Given two points X =
(x₁, x₂, ..., xₙ) and Y = (y₁, y₂, ..., yₙ), the Euclidean distance is defined as:
d(X, Y) = sqrt(Σ (x_i - y_i)^2)
3.2. Manhattan Distance
Manhattan distance computes the sum of absolute differences between corresponding
coordinates:
d(X, Y) = Σ |x_i - y_i|
3.3. Minkowski Distance
Minkowski distance generalizes Euclidean and Manhattan distances. It is given by:
d(X, Y) = (Σ |x_i - y_i|^p)^(1/p)
4. Choosing the Right K Value
The choice of K (number of neighbors) significantly affects KNN's performance. A small K
may lead to overfitting, while a large K smooths decision boundaries but may ignore local
patterns.
5. KNN for Classification
In classification, KNN assigns the class label of a test point based on the majority class
among its K nearest neighbors.
6. KNN for Regression
In regression, KNN predicts the output as the average of the target values of the K nearest
neighbors. Weighted KNN assigns higher weights to closer neighbors.
7. Advantages and Disadvantages of KNN
Advantages:
- Simple and intuitive
- No training time
- Works for both classification and regression
Disadvantages:
- Computationally expensive
- Sensitive to feature scaling
- Not robust to noisy data
8. Applications of KNN
KNN is widely used in:
- Image recognition
- Recommendation systems
- Medical diagnosis
- Anomaly detection