Unit-4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

Unsupervised Learning

• SYLLABUS: Unsupervised Learning: Introduction, Unsupervised vs Supervised


Learning, Application of Unsupervised Learning, Clustering K-Means, K-
Medoid, DBSCAN.
Unsupervised Learning: Introduction
• Unsupervised learning is a machine learning concept where the unlabelled and
unclassified information is analysed to discover hidden knowledge.

• The goal is to infer the natural structure present within a set of data points. Unlike
supervised learning, there are no predefined labels or outcomes associated with the
data.

• Unsupervised learning can be used to discover patterns, groupings, and associations


within the data.

• Unsupervised learning objective is to observe only the features X1:, X2:,…,Xn, we are
not going to predict any outcome variable, but rather our intention is to find out the
association between the features or their grouping to understand the nature of the
data.
Unsupervised Learning: Introduction
• To understand the concept of unsupervised learning, two methods need to be
explained - Clustering and Association Analysis.

• Clustering is a broad class of methods used for discovering unknown subgroups


in data, which is the most important concept in unsupervised learning.

• Association Analysis is a technique used to captures the key patterns or


relationships (also called association rules) that explain the differences between
data points.
Unsupervised Learning: Introduction
• Types of Unsupervised Learning

1. Clustering: Grouping data points into clusters based on their similarities.

2. Dimensionality Reduction: Reducing the number of features while


preserving the significant information.

3. Anomaly Detection: Identifying rare items or events which differ


significantly from the majority of the data.
4. Association: Discovering interesting relationships between variables in large
databases.
Unsupervised Learning: Introduction
• List of Unsupervised Learning Algorithms

1. Clustering Algorithms:

1. K-Means Clustering

2. Hierarchical Clustering

3. DBSCAN (Density-Based Spatial Clustering of Applications with


Noise)

4. Gaussian Mixture Models (GMM)

5. Mean Shift Clustering


Unsupervised Learning: Introduction
2. Dimensionality Reduction Algorithms:

1. Principal Component Analysis (PCA)

2. t-Distributed Stochastic Neighbour Embedding (t-SNE)

3. Linear Discriminant Analysis (LDA)

4. Independent Component Analysis (ICA)

5. Auto encoders
Unsupervised Learning: Introduction
3. Anomaly Detection Algorithms:

• Isolation forest

• Local Outlier Factor (LOF)

• One-Class SVM

4. Association Rule Learning Algorithms:

1. Apriori Algorithm

2. Eclat Algorithm
APPLICATION OF UNSUPERVISED LEARNING
• There are many domains where unsupervised learning finds its
application.

• Few examples of such applications are as follows:

1. Segmentation of target consumer populations by an advertisement


consulting agency on the basis of few dimensions such as
demography, financial data, purchasing habits, etc.

2. Anomaly or fraud detection in the banking sector by identifying the


pattern of loan defaulters.
APPLICATION OF UNSUPERVISED LEARNING
3. Image processing and image segmentation such as face recognition,
expression identification, etc.

4. Grouping of important characteristics in genes to identify important


influencers in new areas of genetics.

5. Document Clustering: Grouping similar documents for organizing large


collections of text data.

• Chat bots, self-driven cars, and many more recent innovations are results
of the combination of unsupervised and supervised learning.
CLUSTERING
• Clustering is a method for sorting data into groups, called clusters, based
on their characteristics.

• The goal is to group items that are similar to each other within the same
cluster, while ensuring that items in different clusters are as different
from each other as possible.

• The effectiveness of clustering depends on how similar or related the


objects within a group are or how different or unrelated the objects in
different groups are from each other.
CLUSTERING
• For example, consider a set of customer data.

• Clustering might group customers with similar buying habits together.

• So, one cluster might be people who frequently buy sports equipment,
while another cluster might be people who mostly buy books.

• This helps in understanding distinct customer types and tailoring


marketing strategies accordingly.
Application areas of cluster analysis
• Customer segmentation: creating clusters of customers on the basis of
parameters such as demographics, financial conditions, buying habits, etc.,

• Data Mining involves extracting useful information from large datasets, we often
group similar features together.

• Text data mining: this includes tasks such as text categorization, text clustering,
document summarization, concept extraction, sentiment analysis, and entity
relation modelling.

• Anomaly checking: checking of anomalous behaviors such as fraudulent bank


transaction, unauthorized computer intrusion, suspicious movements on a radar
scanner, etc.
Application areas of cluster analysis
• The methods related to the machine learning task of clustering :

1. How clustering tasks differ from classification tasks and how clustering
defines groups.

2. A classic and easy-to-understand clustering algorithm, namely k-means,


which is used for clustering along with the k-medoids algorithm.
1. Clustering as a machine learning task
• The knowledge of clustering is primarily motivated by discovery rather
than on prediction.

• So, clustering is defined as an unsupervised machine learning task that


automatically divides the data into clusters or groups of similar items.

• In the clustering task the data inside a cluster should be very similar to
each other but very different from those outside the cluster.

• Using this principle, whenever a large set of diverse and varied data is
presented for analysis, clustering enables to represent the data in a
smaller number of groups.
1. Clustering as a machine learning task
• Hence, the effectiveness of clustering task is measured by the
homogeneity within a group as well as the difference between distinct
groups.
1. Clustering as a machine learning task
• How clustering tasks differ from classification tasks ?

1. Labeled vs. Unlabeled Data:

Clustering is an unsupervised learning method. It works with unlabeled


data—there’s no predefined category or label assigned to the data points.

Classification is a supervised learning method. It works with labeled data—


each data point has a known category or label.

2. Objective:
• Clustering aims to discover natural groupings or patterns in the data without any
prior knowledge.

Classification aims to assign predefined labels to data points based on learned


characteristics.
1. Clustering as a machine learning task
3. Output:

• Clustering produces clusters, or groups of similar data points. The clusters


are often labeled arbitrarily (like Cluster 1, Cluster 2, etc.), as there’s no
predefined label.

• Classification produces specific class labels for each data point.

4. Evaluation:

• Clustering is harder to evaluate because there’s no ground truth; metrics like


silhouette score or intra-cluster distance are used.

• Classification can be evaluated with accuracy metrics (e.g., accuracy,


precision, recall, F1-score).
Different types of clustering techniques
• The major clustering techniques are :

1. Partitioning methods,

2. Hierarchical methods and,

3. Density-based methods.

• The approach towards creating the clusters is same. However, the way to
measure the quality of the clusters, and applicability are different.

• Table below summarizes the main characteristics of each method.


Different types of clustering techniques
Partitioning methods
• Two of the most important algorithms for partitioning-based clustering are k-
means and k-medoid.

• In the k-means algorithm, the centroid of the prototype is identified for


clustering, which is normally the mean of a group of points.

• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points.

• 1. K-means - A centroid-based technique

• The principle of the k-means algorithm is to assign each of the ‘n’ data points to
one of the K clusters where ‘K’ is a user defined parameter as the number of
clusters desired.
Partitioning methods
• The objective is to maximize the homogeneity within the clusters and also
to maximize the differences between the clusters.

• The homogeneity and differences are measured in terms of the distance


between the objects or points in the data set.

• Algorithm shows the simple algorithm of K-means

• Step 1: Select K points in the data space and mark them as initial
centroids
K-means - A centroid-based technique
• loop

• Step 2: Assign each point in the data space to the nearest centroid to form K

clusters.

• Step 3: Measure the distance of each point in the cluster from the centroid.

• Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of the

clusters.

• Step 5: Identify the new centroid of each cluster on the basis of distance between

points.

• Step 6: Repeat Steps 2 to 5 to refine until centroids do not change.

• end loop
K-means - A centroid-based technique
• Strengths and Weaknesses of K-means algorithm
K-means - A centroid-based technique

• Types of K-Means

1. Standard K-Means: The basic form of the algorithm described above.

2. K-Means++: An improved initialization technique that spreads out the


initial centroids, reducing the chances of poor clustering results.

3. Mini-Batch K-Means: A variation that processes small random batches of


the dataset, which improves efficiency and scalability for large datasets.
K-means - A centroid-based technique
• Example

• Consider a set of data points as shown in


figure.

• Apply k-means algorithm to find out the


clusters generated from the given data set.

• Let us fix K = 4, implying that we want to


create four clusters out of this data set.

• As the first step, we assign four random


points from the data set as the centroids, as
represented by the * signs.
K-means - A centroid-based technique
• Assign the data points to the nearest centroid
to create four clusters.

• In the second step, on the basis of the distance


of the points from the corresponding centroids,
the centroids are updated and points are
reassigned to the updated centroids.

• After three iterations, we found that the


centroids are not moving as there is no scope
for refinement, and thus, the k-means
algorithm will terminate.
K-means - A centroid-based technique
• How to select the correct number of clusters ?

• The performance of the K-means clustering algorithm depends upon highly


efficient clusters that it forms.

• But choosing the optimal number of clusters is a big task.

• There are some different ways to find the optimal number of clusters

• 1. For a small data set, sometimes a rule of thumb that is followed is:

• K is set as the square root of n/2 for a data set of n examples.

• But unfortunately, this thumb rule does not work well for large data sets.
K-means - A centroid-based technique
• 2. Elbow method

• This method tries to measure the homogeneity or heterogeneity within the


cluster and for various values of ‘K’ and helps in arriving at the optimal ‘K’.

• This method uses the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total variations within a cluster.

• The formula to calculate the value of WCSS (for 3 clusters) is given below:
K-means - A centroid-based technique
• It is the sum of the square of the distances between each data point and
its centroid within a cluster.

• To measure the distance between data points and centroid, we can use
Euclidean distance formula.

• To find the optimal value of clusters, the elbow method follows the below
steps:

• It executes the K-means clustering on a given dataset for different K


values (ranges from 1-10).

• For each value of K, calculate the WCSS value.


K-means - A centroid-based technique
• Plots a curve between calculated WCSS
values and the number of clusters K.

• The sharp point of bend or a point of


the plot looks like an arm, then that
point is considered as the best value of
K.

• The graph shows the sharp bend,


which looks like an elbow, hence it is
known as the elbow method.
K-means - A centroid-based technique
• Example :

• Suppose we have two variables M1


and M2. The x-y axis scatter plot of
these two variables is given in
Figure.

• Number of clusters, i.e., K=2 is


decided, to identify the dataset and to
put them into different clusters.

• It means here we need to group these


datasets into two different clusters as
shown in Figure.
K-means - A centroid-based technique
Real-Time Applications of K-means algorithm
• Customer Segmentation

• Image Compression

• Document Clustering

• Market Segmentation

• Anomaly Detection
K-Medoids: a representative object-based technique
• In K-means algorithm it is a common practice to run the k-means algorithm
multiple times with different cluster centers to identify the optimal clusters.

• The necessity to set the initial ‘K’ values is perceived as a disadvantage of the k-
means algorithm.

• The k-means algorithm is sensitive to outliers in the data set and inadvertently
produces skewed clusters.

• k-medoids provides a solution to the above problems.

• Instead of considering the mean of the data points in the cluster, k medoids
considers k representative data points from the existing points in the data set
as the center of the clusters.
K-Medoids: a representative object-based technique
• It then assigns the data points according to their distance from these centers to
form k clusters.

• A medoid is defined as the object of a cluster, whose average dissimilarity to all


the objects in the cluster is minimal.

• K-Medoid Algorithm:

1. Initialization:

• Choose k initial medoids randomly from the dataset.

2. Assign Points to Medoids:

• Assign each data point to the nearest medoid based on a chosen distance
metric (e.g., Euclidean distance, Manhattan distance).
K-Medoids: a representative object-based technique
3. Update Medoids:
• For each cluster, find the data point within the cluster that minimizes the total distance to
all other points in that cluster. This point becomes the new medoid.

4. Reassign Points:
• Reassign points to the nearest medoid based on the updated medoids.

5. Repeat:
• Repeat steps 3 and 4 until the medoids stabilize (no changes) or a stopping criterion is met
(e.g., maximum iterations).

6. Output:
• The final medoids and their respective clusters.
K-Medoids: a representative object-based technique
• Example K-Medoids Clustering Algorithm
1. Take an example of dataset
2. Apply K-Medoid Clustering algorithm to form two clusters
3. Use Manhattan Distance to find the between data point and medoid

I X Y
X1 2 6
X2 3 4
X3 3 8
X4 4 7
X5 6 2
X6 6 4
X7 7 3
X8 7 4
X9 8 5
X10 7 6
Example K-Medoids Clustering Algorithm
• STEP1:

1. Select Two Mediods Randomly

2. C1= (3, 4), C2= (7,4)

3. (x1, y1) and (x2, y2) are data points

4. Manhattan Distance formula = |x1-x2| + |y1-y2|

5. Manhattan Distance C1 [ (2,6), (3,4)] = |2-3| + |6-4| = 3

6. Manhattan Distance C2 [ (2,6), (7,4)] = |2-7| + |6-4| = 7

7. Sincec C1 < C2, so data point (2, 6) will assigned to C1 cluster.

8. Similarly, we need to find up to X10 Values


Example K-Medoids Clustering Algorithm
I X Y C1 C2 CLUSTER

X1 2 6 3 7 C1
X2 3 4 0 4 C1
X3 3 8 4 8 C1
X4 4 7 4 6 C1
X5 6 2 5 3 C2
X6 6 4 3 1 C2
X7 7 3 5 1 C2
X8 7 4 4 0 C2
X9 8 5 6 2 C2
X10 7 6 6 2 C2

Calculating the values with respect to Cluster 1 values (3,4), Cluster 2 values
(7,4), Finding the cluster group based on C1 & C2 taken minimum values.
Example K-Medoids Clustering Algorithm
Step 9. Clusters are

a. C1 = {(2,6, (3,4), (3,8), (4,7)}

b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}

c. Calculate the Cost (C,X) = ∑𝒊 |𝒄𝒊 − 𝒙𝒊|

d.Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4), (3,8)) +
{(Cost (3,4), (4,7)) + (Cost (7,4), (6,2)) + (Cost (7,4), (6,4)) + (Cost (7,4), (7,3))
+(Cost (7,4), (7,4)) + (Cost (7,4), (8,5)) +(Cost (7,4), (7,6))}

Total cost = 3+4+4+2+3+1+1+2 = 20


Example K-Medoids Clustering Algorithm
• Step 10
1. Select Two Mediods Randomly
2. New Medoids are - C1= (3, 4), P2= (7,3)
a. Here the new medoids are C1 & P2
b. The C1 Values are Same, P2 values only changed
3. Previous Medoids are - C1= (3, 4), C2= (7,4)
4. (x1, y1) and (x2, y2) are data points
5. Manhattan Distance formula = |x1-x2| + |y1-y2|
a. Where the values are x1=2, y1=6, x2=7, y2=3 similarly we need to find all the
values and need to substitute the values in formula.
6. Manhattan Distance [ (2,6), (7,3)] = |2-7| + |6-3| = 8
a. Manhattan Distance [ (3,4), (7,3)] = |3-7| + |4-3| = 5
b. Manhattan Distance [ (3,8), (7,3)] = |3-7| + |8-3| = 9
c. Similarly, we need to find up to X10 Values
Example K-Medoids Clustering Algorithm
I X Y C1 P CLUSTER
FIND
X1 2 6 3 8 C1
X2 3 4 0 5 C1
X3 3 8 4 9 C1
X4 4 7 4 7 C1
X5 6 2 5 2 P
X6 6 4 3 2 P
X7 7 3 5 0 P
X8 7 4 4 1 P
X9 8 5 6 3 P
X10 7 6 6 3 P
Example K-Medoids Clustering Algorithm
Step 11. New Clusters are

a. C1 = {(2,6, (3,4), (3,8), (4,7)}

b. C2 = {(6,2), (6,4), (7,3), (7,4), (8,5), (7,6)}

c. Calculate the Cost (C,X) = ∑𝒊 |𝒄𝒊 − 𝒙𝒊|

d.Total Cost = {(Cost (3,4), (2,6)) + {(Cost (3,4), (3,4)) + {(Cost (3,4), (3,8)) +
{(Cost (3,4), (4,7)) + (Cost (7,3), (6,2)) + (Cost (7,3), (6,4)) + (Cost (7,3), (7,3))
+(Cost (7,3), (7,4)) + (Cost (7,3), (8,5)) +(Cost (7,3), (7,6))}

Current Total Cost = 3+4+4+2+2+1+3+3= 22


Example K-Medoids Clustering Algorithm
• STEP 12:

1. Cost of Swapping of Medoid C2 with P

2. S= Current Total Cost – Previous Total Cost S= 22-20 = 2>0

• In K-Mediod clustering algorithm, we find the Current Total Cost =22 and
Previous Total Cost = 20.

• Here the problem is the current total cost greater than the previous total cost,
So finally replacing C2 with P is not good idea, so we will go for final Mediods
are C1= (3,4), and C2= (7,4).
DBSCAN clustering
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular
clustering algorithm designed to identify clusters of varying shapes and sizes within
a dataset.

• It relies on density criteria to form clusters, distinguishing itself by effectively


handling noise (outliers).

• How It Works

1. Parameters:

o Epsilon (ε): The maximum distance between two points for them to be
considered neighbors.
o Min Pts: The minimum number of points required to form a dense region (a
cluster).
DBSCAN clustering
2. Core Points, Border Points, and Noise:

o Core Point: A point is a core point if it has at least MinPts points within
its ε- neighborhood.
o Border Point: A point is a border point if it has fewer than MinPts points
within its ε- neighborhood but is within the ε-neighborhood of a core point.
o Noise: Points that are neither core points nor border points are classified as
noise.
DBSCAN clustering
Algorithm Steps:

1. Randomly select a point.

2. If the point is a core point, create a new cluster. Expand the cluster by adding
all points that are density-reachable from the core point (directly or indirectly
connected via other core points).

3. If the point is a border point and not part of any cluster, it is assigned to the
nearest cluster.

4. Continue until all points are processed.


DBSCAN clustering
• Advantages

1. Ability to Find Arbitrarily Shaped Clusters: Unlike k-means, DBSCAN can


find clusters of arbitrary shapes and sizes, making it suitable for more complex
datasets.

2. No Need to Specify Number of Clusters: Unlike k-means, DBSCAN does not


require the number of clusters to be specified beforehand.

3. Robust to Noise: Effectively identifies outliers as noise points, improving the


quality of clustering results.
4. Minimal Parameters: Only requires two parameters (ε and MinPts),
simplifying the model tuning process.
DBSCAN clustering
• Limitations

1. Difficulty in Parameter Selection:

o Choosing appropriate values for ε and MinPts can be challenging and may
require domain knowledge or experimentation.

2. Inefficiency with Varying Density:

• Struggles with datasets containing clusters of varying densities, as the same


ε and MinPts values may not work well for all clusters.
DBSCAN clustering
3. Computational Complexity:

o DBSCAN has a time complexity of O(n log n)with spatial indexing


structures, but without such optimizations, it can be O(n2) making
it less efficient for very large datasets.
4. High-Dimensional Data:

o The performance of DBSCAN degrades with high-dimensional


data because distance metrics become less meaningful in high-
dimensional spaces, leading to poor clustering results.
DBSCAN clustering
• Real-Time Applications

1. Geospatial Data Analysis:

o Application: Identifying geographical clusters, such as hotspots of disease


outbreaks or crime areas.
o Example: Analyzing spatial distribution of earthquakes to detect regions
with high seismic activity.
2. Market Segmentation:

o Application: Clustering customers based on purchasing behavior to identify


distinct market segments.

• Example: Segmenting customers of an e-commerce platform based on their


browsing and purchasing patterns.
DBSCAN clustering
3. Anomaly Detection:

o Application: Detecting anomalies in data, such as fraud detection in


financial transactions or identifying network intrusions.
o Example: Identifying unusual patterns in network traffic that may
indicate security breaches.
4. Image Processing:

o Application: Clustering similar images or regions within images for tasks


like image segmentation.
o Example: Grouping pixels in satellite images to identify different land cover
types.

You might also like