Unit 5 ML
Unit 5 ML
Unit 5 ML
RV College of
Engineering
Improvi
Chapter 9
Unsupervised Learning
7/23/2023 1
Go, change the world
RV College of
Engineering Introduction
• Unsupervised learning is a machine learning concept where the unlabeled and unclassified information is analysed to
discover hidden knowledge.
• The algorithms work on the data without any prior training, but they are constructed in such a way that they can identify
patterns, groupings, sorting order, and numerous other interesting knowledge from the set of data.
• Unsupervised vs Supervised
• Unsupervised: This analysis may reveal an interesting correlation between the features or a common Behaviour
within the subgroup of the data, which provides better understanding of the data.
• Association
• Grouping
• Supervised
• Prediction
• Classification
• Utilization by data scientists to reduce the dimensionalities in sample Data set for the conference
data to simplify modelling attendees
7/23/2023 3
Go, change the world
RV College of CLUSTERING
Engineering
Text data mining: this includes tasks such as text categorization, text
clustering, document summarization, concept extraction, sentiment
analysis, and entity relation modelling.
Data mining: simplify the data mining task by grouping a large number of
features from an extremely large data set to make the analysis manageable
7/23/2023 4
Go, change the world
RV College of
Engineering
Clustering Techniques
• Partitioning methods
• Hierarchical methods
• Density-based methods
7/23/2023 5
Go, change the world
RV College of
Engineering Partition Methods - KMeans
7/23/2023 6
Go, change the world
RV College of
Engineering Partition Methods - KMeans
7/23/2023 7
Advantages: Go, change the world
• K-meansRVis College of
Partition Methods - KMeans
relatively scalable and efficient in processing large data sets.
Engineering
Disadvantages:
• K-means is not suitable for discovering clusters with nonconvex shapes or clusters of very different size
• It is sensitive to noise and outlier data points (can influence the mean value)
7/23/2023 8
Go, change the world
RV College of
Engineering Partition Methods - KMedoids
• Minimize the sensitivity of k-means to outliers
• Each remaining object is clustered with the representative object (Medoid) to which is the most similar
• The algorithm minimizes the sum of the dissimilarities between each object and its corresponding
representative object
E: the sum of absolute error for all objects in the data set
P: the data point in the space representing an object
Oi: is the representative object of cluster Ci
7/23/2023 9
Go, change the world
RV College of
Engineering Partition Methods - KMedoids
7/23/2023 10
RV College of Partition Methods – K - Medoids Go, change the world
Engineering
7/23/2023 11
Go, change the world
RV College of
Engineering Partition Methods - KMedoids
7/23/2023 12
Go, change the world
RV College of
Engineering Introduction
9.4.3 Partitioning methods
7/23/2023 13
Go, change the world
RV College of
Engineering Introduction
Partitioning methods
Elbow method
7/23/2023 14
Go, change the world
RV College of
Engineering Introduction
Partitioning Around Medoids (PAM)
7/23/2023 15
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)
7/23/2023 16
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)
Iteration 2
Now, we will select another non-medoid point (7, 4) and make it a temporary medoid for the second cluster. Hence,
•M1 = (3, 4)
•M2 = (7, 4)
Now, let us calculate the distance between all the data points and the current medoids.
Iteration 3
Now, let us again change the medoid of cluster 2 to (6, 4). Hence, the new medoids for the clusters are M1=(3, 4) and M2=
(6, 4 ).
Let us calculate the distance between the data points and the above medoids to find the new cluster. The results have been
tabulated as follows.
Again, the clusters haven’t changed. Hence, clusters are:
Now, let us again calculate the cost for each cluster and find
their sum. The total cost this time will be
3+4+4+2+0+2+1+3+3+0=22.
Hence, we will revert the change and the point (7, 4) will again
be made the medoid for cluster 2.
7/23/2023 18
Go, change the world
RV College of
Engineering Example Partitioning Around Medoids (PAM)
• So, the clusters after this iteration will be cluster1 = {(2, 6), (3, 8), (4, 7), (3, 4)} and cluster 2= {(7,4), (6,2), (6, 4), (7,3),
(8,5), (7,6)}. The medoids are (3,4) and (7,4).
• The set of medoids for which the cost is the least, the medoids, and the associated clusters are made permanent. So,
after all the iterations, you will get the final clusters and their medoids.
• The K-Medoids clustering algorithm is a computation-intensive algorithm that requires many iterations.
• In each iteration, we need to calculate the distance between the medoids and the data points, assign clusters, and
compute the cost.
• Hence, K-Medoids clustering is not well suited for large data sets.
7/23/2023 19
RV College of Introduction to hierarchical clustering Go, change the world
Engineering
methods
• Situations arise when the data needs to be partitioned into groups at different levels such as in a hierarchy.
• The hierarchical clustering methods are used to group the data into hierarchy or tree-like structure.
• For example, in a machine learning problem of organizing employees of a university in different departments,
first the employees are grouped under the different departments in the university, and then within each
department, the employees can be grouped according to their roles such as professors, assistant professors,
supervisors, lab assistants, etc. This creates a hierarchical structure of the employee data and eases
visualization and analysis.
• There are two main hierarchical clustering methods: agglomerative clustering and divisive clustering.
• Agglomerative clustering is a bottom-up technique which starts with individual objects as clusters and then
iteratively merges them to form larger clusters.
On the other hand, the divisive method starts with one cluster with all given objects and then splits it iteratively
to form smaller clusters.
7/23/2023 20
RV College of Introduction to hierarchical clustering Go, change the world
Engineering
methods
A dendrogram is a commonly used tree structure
representation of step-by-step creation of hierarchical
clustering.
There are four standard methods to measure the distance between clusters:
7/23/2023 22
Go, change the world
RV College of
Engineering Clustering Algms
• Distance measure is used to decide when to terminate the clustering algorithm
• For example, in an agglomerative clustering, - MIN distance between two neighbouring clusters becomes
less than the user-defined threshold.
• when an algorithm uses the minimum distance Dmin to measure the distance between the clusters –
nearest neighbour clustering algorithm, and if the decision to stop the algorithm is based on a
user-defined limit on Dmin , then it is called single linkage algorithm.
• when an algorithm uses the maximum distance Dmax to measure the distance between the clusters, then it
is referred to as furthest neighbour clustering algorithm, and if the decision to stop the algorithm is based
on a user defined limit on Dmax then it is called complete linkage algorithm.
• As minimum and maximum measures provide two extreme options to measure distance between the
clusters, they are prone to the outliers and noisy data.
• Instead, the use of mean and average distance helps in avoiding such problem and provides more
consistent results.
7/23/2023 23
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
• Density-based methods – DBSCAN In the case of the other shaped clusters such as S-shaped or uneven
shaped clusters, the above two types of method do not provide accurate results.
• The density-based clustering approach provides a solution to identify clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse area within the data set and then run
the clustering algorithm.
• DBSCAN is one of the popular density-based algorithm which creates clusters by using connected
regions with high density.
• Clusters are dense regions in the data space, separated by regions of the lower density of points.
• The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”.
• The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at
least a minimum number of points.
7/23/2023 24
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
Real-life data may contain irregularities, like:
1.Clusters can be of arbitrary shape such as those shown in the figure below.
2.Data may contain noise.
Parameters Required For DBSCAN Algorithm
• Eps - It defines the neighborhood around a data point i.e. if the distance
between two points is lower or equal to ‘eps’ then they are considered
neighbors.
• If the eps value is chosen too small then a large part of the data will
be considered as an outlier.
• If it is chosen very large then the clusters will merge and the
majority of the data points will be in the same clusters.
• One way to find the eps value is based on the k-distance graph.
MinPts: As a general rule, the minimum MinPts can be derived from the
number of dimensions D in the dataset as, MinPts >= D+1. The minimum
7/23/2023
value of MinPts must be chosen at least 3. 25
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
• In this algorithm, we have 3 types of data points.
• Core Point: A point is a core point if it has more than MinPts points within eps.
• Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood of a core
point.
• Noise or outlier: A point which is not a core point or border point.
7/23/2023 26
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
7/23/2023 27
RV College of
Density-Based Spatial Clustering Of Go, change the world
Engineering Applications With Noise (DBSCAN)
There are three points in the data set, (2.8, 4.5)
(1.2, 2.5) (1, 2.5) that have 4 neighbourhood points
around them, hence they would be called core
points and as already mentioned, if the core point
is not assigned to any cluster, a new cluster is
formed.
7/23/2023 29
RV College of FINDING PATTERN USING ASSOCIATION RULE Go, change the world
Engineering
Application: Market Basket Analysis - that retailers use for cross-selling of their products
3. Association rule
7/23/2023 31