Unit-4
Unit-4
Unit-4
• The goal is to infer the natural structure present within a set of data points. Unlike
supervised learning, there are no predefined labels or outcomes associated with the
data.
• Unsupervised learning objective is to observe only the features X1:, X2:,…,Xn, we are
not going to predict any outcome variable, but rather our intention is to find out the
association between the features or their grouping to understand the nature of the
data.
Unsupervised Learning: Introduction
• To understand the concept of unsupervised learning, two methods need to be
explained - Clustering and Association Analysis.
1. Clustering Algorithms:
1. K-Means Clustering
2. Hierarchical Clustering
5. Auto encoders
Unsupervised Learning: Introduction
3. Anomaly Detection Algorithms:
• Isolation forest
• One-Class SVM
1. Apriori Algorithm
2. Eclat Algorithm
APPLICATION OF UNSUPERVISED LEARNING
• There are many domains where unsupervised learning finds its
application.
• Chat bots, self-driven cars, and many more recent innovations are results
of the combination of unsupervised and supervised learning.
CLUSTERING
• Clustering is a method for sorting data into groups, called clusters, based
on their characteristics.
• The goal is to group items that are similar to each other within the same
cluster, while ensuring that items in different clusters are as different
from each other as possible.
• So, one cluster might be people who frequently buy sports equipment,
while another cluster might be people who mostly buy books.
• Data Mining involves extracting useful information from large datasets, we often
group similar features together.
• Text data mining: this includes tasks such as text categorization, text clustering,
document summarization, concept extraction, sentiment analysis, and entity
relation modelling.
1. How clustering tasks differ from classification tasks and how clustering
defines groups.
• In the clustering task the data inside a cluster should be very similar to
each other but very different from those outside the cluster.
• Using this principle, whenever a large set of diverse and varied data is
presented for analysis, clustering enables to represent the data in a
smaller number of groups.
1. Clustering as a machine learning task
• Hence, the effectiveness of clustering task is measured by the
homogeneity within a group as well as the difference between distinct
groups.
1. Clustering as a machine learning task
• How clustering tasks differ from classification tasks ?
2. Objective:
• Clustering aims to discover natural groupings or patterns in the data without any
prior knowledge.
4. Evaluation:
1. Partitioning methods,
3. Density-based methods.
• The approach towards creating the clusters is same. However, the way to
measure the quality of the clusters, and applicability are different.
• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points.
• The principle of the k-means algorithm is to assign each of the ‘n’ data points to
one of the K clusters where ‘K’ is a user defined parameter as the number of
clusters desired.
Partitioning methods
• The objective is to maximize the homogeneity within the clusters and also
to maximize the differences between the clusters.
• Step 1: Select K points in the data space and mark them as initial
centroids
K-means - A centroid-based technique
• loop
• Step 2: Assign each point in the data space to the nearest centroid to form K
clusters.
• Step 3: Measure the distance of each point in the cluster from the centroid.
• Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the
clusters.
• Step 5: Identify the new centroid of each cluster on the basis of distance between
points.
• end loop
K-means - A centroid-based technique
• Strengths and Weaknesses of K-means algorithm
K-means - A centroid-based technique
• Types of K-Means
• There are some different ways to find the optimal number of clusters
• 1. For a small data set, sometimes a rule of thumb that is followed is:
• But unfortunately, this thumb rule does not work well for large data sets.
K-means - A centroid-based technique
• 2. Elbow method
• This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster.
• The formula to calculate the value of WCSS (for 3 clusters) is given below:
K-means - A centroid-based technique
• It is the sum of the square of the distances between each data point and
its centroid within a cluster.
• To measure the distance between data points and centroid, we can use
Euclidean distance formula.
• To find the optimal value of clusters, the elbow method follows the below
steps:
• Image Compression
• Document Clustering
• Market Segmentation
• Anomaly Detection
K-Medoids: a representative object-based technique
• In K-means algorithm it is a common practice to run the k-means algorithm
multiple times with different cluster centers to identify the optimal clusters.
• The necessity to set the initial ‘K’ values is perceived as a disadvantage of the k-
means algorithm.
• The k-means algorithm is sensitive to outliers in the data set and inadvertently
produces skewed clusters.
• Instead of considering the mean of the data points in the cluster, k medoids
considers k representative data points from the existing points in the data set
as the center of the clusters.
K-Medoids: a representative object-based technique
• It then assigns the data points according to their distance from these centers to
form k clusters.
• K-Medoid Algorithm:
1. Initialization:
• Assign each data point to the nearest medoid based on a chosen distance
metric (e.g., Euclidean distance, Manhattan distance).
K-Medoids: a representative object-based technique
3. Update Medoids:
• For each cluster, find the data point within the cluster that minimizes the total distance to
all other points in that cluster. This point becomes the new medoid.
4. Reassign Points:
• Reassign points to the nearest medoid based on the updated medoids.
5. Repeat:
• Repeat steps 3 and 4 until the medoids stabilize (no changes) or a stopping criterion is met
(e.g., maximum iterations).
6. Output:
• The final medoids and their respective clusters.
K-Medoids: a representative object-based technique
• Example K-Medoids Clustering Algorithm
1. Take an example of dataset
2. Apply K-Medoid Clustering algorithm to form two clusters
3. Use Manhattan Distance to find the between data point and medoid
I X Y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
Example K-Medoids Clustering Algorithm
• STEP1:
X1 2 6 3 7 C1
X2 3 4 0 4 C1
X3 3 8 4 8 C1
X4 4 7 4 6 C1
X5 6 2 5 3 C2
X6 6 4 3 1 C2
X7 7 3 5 1 C2
X8 7 4 4 0 C2
X9 8 5 6 2 C2
X10 7 6 6 2 C2
Calculating the values with respect to Cluster 1 values (3,4), Cluster 2 values
(7,4), Finding the cluster group based on C1 & C2 taken minimum values.
Example K-Medoids Clustering Algorithm
Step 9. Clusters are
d.Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4), (3,8)) +
{(Cost (3,4), (4,7)) + (Cost (7,4), (6,2)) + (Cost (7,4), (6,4)) + (Cost (7,4), (7,3))
+(Cost (7,4), (7,4)) + (Cost (7,4), (8,5)) +(Cost (7,4), (7,6))}
d.Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4), (3,8)) +
{(Cost (3,4), (4,7)) + (Cost (7,3), (6,2)) + (Cost (7,3), (6,4)) + (Cost (7,3), (7,3))
+(Cost (7,3), (7,4)) + (Cost (7,3), (8,5)) +(Cost (7,3), (7,6))}
• In K-Mediod clustering algorithm, we find the Current Total Cost =22 and
Previous Total Cost = 20.
• Here the problem is the current total cost greater than the previous total cost,
So finally replacing C2 with P is not good idea, so we will go for final Mediods
are C1= (3,4), and C2= (7,4).
DBSCAN clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm designed to identify clusters of varying shapes and sizes within
a dataset.
• How It Works
1. Parameters:
o Epsilon (ε): The maximum distance between two points for them to be
considered neighbors.
o Min Pts: The minimum number of points required to form a dense region (a
cluster).
DBSCAN clustering
2. Core Points, Border Points, and Noise:
o Core Point: A point is a core point if it has at least MinPts points within
its ε- neighborhood.
o Border Point: A point is a border point if it has fewer than MinPts points
within its ε- neighborhood but is within the ε-neighborhood of a core point.
o Noise: Points that are neither core points nor border points are classified as
noise.
DBSCAN clustering
Algorithm Steps:
2. If the point is a core point, create a new cluster. Expand the cluster by adding
all points that are density-reachable from the core point (directly or indirectly
connected via other core points).
3. If the point is a border point and not part of any cluster, it is assigned to the
nearest cluster.
o Choosing appropriate values for ε and MinPts can be challenging and may
require domain knowledge or experimentation.