MDA Session 4
MDA Session 4
MDA Session 4
2 4
5 6
7 8
1
15-06-2023
9 10
2.00
1.00 Select clustering procedure
0.00
-1.00 Decide on no. of clusters
-2.00
Interpret the profile of clusters
11 12
Distance and (dis-)similarity measures between cases Popular choices of distance / dissimilarity between
multivariate observations
Basic principle of clustering: STEP I
p
( x y ) ( x y) ( x y )
0.5 0.5
d E ( x, y ) 2 t
d M ( x, y) ( x y )t S 1 ( x y )
Use of Euclidean distance (other choices)
i i
i 1
p
Use correlation to cluster variables d CB ( x, y) | xi yi |
i 1
13 14
2
15-06-2023
dist in R
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist Mahalanobish distance in R
Method =
15 16
Clustering
methods
There are three major classes of clustering methods – from oldest to newest:
Agglomerative Divisive Sequential Parallel Optimize
Hierarchical methods
Partitioning methods
Model-based methods Linkage based
17 18
Agglomerative hierarchical clustering begins with 𝑛 clusters, each containing a The key in a hierarchical clustering algorithm is specifying how to determine the
single object. two “closest” clusters at any given step.
At each step, the two “closest” clusters are merged together. For the first step, join the two objects whose (Euclidean?) distance is smallest.
So, with steps iterated, there are 𝑛 clusters, then 𝑛 − 1 clusters, then 𝑛 − 2, In subsequent stages: should we join two individual objects together, or merge
….., 1 cluster an object into a cluster that already has multiple objects?
The R function hclust will perform a variety of hierarchical clustering methods. Join on the basis of Inter-cluster dissimilarity (distance)
one of three linkage methods
distance between cluster/group Centroids
Within group sum of squares (Wald’s method)
19 20
3
15-06-2023
• How is it different
from 2001?
21 22
23 24
25 26
4
15-06-2023
At each step, the method searches over all possible ways to join a pair of clusters A hierarchical algorithm actually produces not one partition of the data, but lots of
so that the K-means criterion 𝑊𝑆𝑆 is minimized for that step. partitions.
It begins with each object as its own cluster (so that 𝑊𝑆𝑆 = 0) and concludes There is a clustering partition for each step 1, 2, . . . , 𝑛.
with all objects in one cluster. The series of merging can be represented at a glance by a treelike structure called a
The R function hclust performs Ward’s method if the option method = dendrogram.
’ward’ is specified. To get a single 𝑘-cluster solution, we would cut the dendrogram horizontally at a point
that would produce 𝑘 groups (The R function cutree can do this).
It is strongly recommended to examine the full dendrogram first before determining
where to cut it.
A natural set of clusters may be apparent from a glance at the dendrogram.
27 28
Standardization of Observations
If the variables in our data set are of different types or are measured on very
different scales, then some variables may play an inappropriately dominant role
in the clustering process.
In this case, it is recommended to standardize the variables in some way before
clustering the objects.
Possible standardization approaches:
1. Divide each column by its sample standard deviation, so that all variables have standard
deviation 1.
2. Divide each variable by its sample range (max − min); Milligan and Cooper (1988) found
that this approach best preserved the clustering structure.
3. Convert data to 𝑧-scores by (for each variable) subtracting the sample mean and then
dividing by the sample standard deviation – a common option in clustering software
packages.
29