MDA Session 4

15-06-2023
Review: Project formulation and proposal

Identify the goal, by defining the (business) perspective
Cluster Analysis
What are the research questions that you are hoping to answer? Who are the
target respondents? How do you envisage and plan using survey responses to
arrive at these answers? Objective: To separate individual observations/
items/ objects, into groups, or clusters, on the basis
The questionnaire (ideally one, keeping time-frame in mind – but in exceptional of the values for the 𝑞 variables measured on each
cases, subsequent surveys can be taken) should be designed keeping the object.
following additional aspects:
The objects within each cluster should be similar
Univariate analysis (based on analyzing questions in isolation) should yield useful results and objects in different clusters should be dissimilar.
(techniques already known to you) relevant to the project goal
Multivariate analysis – of four types – can be carried out meaningfully based on the The dissimilarity between any two objects is typically
survey data that address the stated business problem and help achieve the goal quantified using a distance measure (like Euclidean
At least 15-20 items distance).
A ‘sufficiently large’ number of responses: ( how do you plan to conduct survey / target) A type of unsupervised classification – as the nature
Proposal format (standard guideline. Title page+ max 3 pages (be as precise and of the groups (or the number of groups, typically) are
contextual, as possible) including reference +appendix containing questionnaire unknown before the objects are classified into
clusters.
2 4
Applications of Cluster The Need for Clustering Algorithms

Analysis Suppose 𝑛 objects are to segregated into a small number (say, 𝑘) of clusters,
In marketing: researchers attempt to find where 𝑘 < 𝑛.
distinct clusters of the consumer population If the number of clusters 𝑘 is decided ahead of time, then the number of ways to
so that several distinct marketing strategies partition the 𝑛 objects into 𝑘 clusters is a “Stirling number of the second kind.”
can be used for the clusters. Example: There are 46, 771, 289, 738, 810 ways to form 𝑘 = 4 (non-empty)
In ecology: scientists classify plants or clusters from 𝑛 = 25 objects.
animals into various groups based on some
measurable characteristics.
> library(copula)
In genetics: Researchers may separate genes > options(scipen = 999)
into several classes based on their expression > Stirling2( 25,4)
ratios measured at different time points. [1] 46771289738810
5 6
The Need for Clustering Algorithms (Continued) 11

4
If the number of clusters ( 𝑘) is flexible, the possible number of partitions is much higher. 5
Example: There are over 4638590 TRILLION possible partitions of 𝑛 = 25 objects if we let
the number of clusters 𝑘 vary. (This is called the Bell number for n.) 12
3
Challenging even for a modern-day computer to explore ALL the possible clustering
partitions to see which is best. 6 {1,3,6,7,9}
{11,13}
Hence, need intelligently designed algorithms that will search among the best possible
partitions relatively quickly. {2,5,8} 10
{12}
2 13
{4}
7
{10}
x <- seq(10,50,10)
library(numbers)
{1,3,6,7,9,11,13} {1,3,6,7,9}
sapply(x, bell)
> bell(25)
1 {2,5,8,12} {11,13} 9
[1] 4638590332229998592 8
{4,10} {2,5,8,12}
{4,10}
7 8
1
15-06-2023
Lessons to be learnt from

Which attribute to use for clustering?
the clustering game
• Color combination & design
• size Differences (distinction) are almost always unavoidable
both within cluster as well as between cluster. But
• location between cluster variation should be dominant.
• All of the above The notion of difference/similarity/distance may vary

from one analyst to another, possibly depending on what
factors are perceived to be more important.
What is their relative importance ? Choice of NUMBER of clusters could be tricky /

controversial.
9 10
Clustering based on single attribute

is relatively easy (quiz!)
Formulate the problem
Define/select variables to be used for clustering
Zcost
3.00 Select a distance measure
2.00
1.00 Select clustering procedure
0.00
-1.00 Decide on no. of clusters
-2.00
Interpret the profile of clusters
11 12
Distance and (dis-)similarity measures between cases Popular choices of distance / dissimilarity between
multivariate observations
Basic principle of clustering: STEP I
p
 ( x  y )  ( x  y) ( x  y )  
0.5 0.5
d E ( x, y )  2 t
d M ( x, y)  ( x  y )t S 1 ( x  y )
Use of Euclidean distance (other choices)        
i i
i 1  
Standardize observations to give equal weights to all variables

Not always (squared)
Similarity and distance in presence/absence type attributes City –block Mahalanobis
Euclidean
 define similarity s = (# of attributes with match)/(# no of attributes)
distance distance
 s = 1/(1+d) distance (Manhattan distance)
Differential weights for 1-1 and 0-0 match (presence/absence) can be accommodated
p
Use correlation to cluster variables d CB ( x, y)   | xi  yi |
  i 1
Other choices: maximum norm, Minkowski distance etc.
13 14
2
15-06-2023
dist in R
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist Mahalanobish distance in R
Method =
mahalanobis(x, center, cov)
15 16
Clustering
methods
Types of Clustering Algorithms

Non-hierarchical
Hierarchical
K-means
There are three major classes of clustering methods – from oldest to newest:
Agglomerative Divisive Sequential Parallel Optimize
Hierarchical methods
Partitioning methods
Model-based methods Linkage based
Hierarchical methods cluster the data in a series of 𝑛 steps, typically joining

observations together step by step to form clusters. Single linkage Min
Partitioning methods first determine 𝑘, and then typically attempt to find the
partition into 𝑘 clusters that optimizes some objective function of the data. Complete linkage Max
 Model-based clustering takes a statistical approach, formulating a model that
categorizes the data into subpopulations and using maximum likelihood to Average linkage
estimate the model parameters.

Variance based :
Ward’s method
SS within cluster is minimized
Centroid based Distance between centroids
17 18
Hierarchical Clustering Defining Closeness of Clusters
Agglomerative hierarchical clustering begins with 𝑛 clusters, each containing a The key in a hierarchical clustering algorithm is specifying how to determine the
single object. two “closest” clusters at any given step.
At each step, the two “closest” clusters are merged together. For the first step, join the two objects whose (Euclidean?) distance is smallest.
So, with steps iterated, there are 𝑛 clusters, then 𝑛 − 1 clusters, then 𝑛 − 2, In subsequent stages: should we join two individual objects together, or merge
….., 1 cluster an object into a cluster that already has multiple objects?
The R function hclust will perform a variety of hierarchical clustering methods. Join on the basis of Inter-cluster dissimilarity (distance)
 one of three linkage methods
 distance between cluster/group Centroids
Within group sum of squares (Wald’s method)
19 20
3
15-06-2023
• Cluster the provinces

in terms of density of
health-care workers.
• How is it different
from 2001?
21 22
Canadian health-workers density in 2001
23 24
Linkage Methods in Hierarchical Clustering

Linkage between two groups The single linkage algorithm (“nearest neighbor” clustering), at each step, joins
the clusters whose minimum distance between objects is smallest, i.e., joins the
clusters 𝐴 and 𝐵 with the smallest
• Complete linkage 𝑑 = min 𝑑
∈ , ∈
• Single linkage The complete linkage algorithm (“farthest neighbor clustering” ), at each step,
• Average linkage joins the clusters whose maximum distance between objects is smallest, i.e., joins
the clusters 𝐴 and 𝐵 with the smallest
𝑑 = max 𝑑
∈ , ∈
The average linkage algorithm, at each step, joins the clusters whose average
distance between objects is smallest, i.e., joins the clusters A and B with the
smallest
1
•Centroid method 𝑑 = 𝑑
𝑛 𝑛
∈ ∈
where 𝑛 and 𝑛 are the number of objects in clusters 𝐴 and 𝐵, respectively
25 26
4
15-06-2023
Ward’s Method (1963) Dendrograms
At each step, the method searches over all possible ways to join a pair of clusters A hierarchical algorithm actually produces not one partition of the data, but lots of
so that the K-means criterion 𝑊𝑆𝑆 is minimized for that step. partitions.
It begins with each object as its own cluster (so that 𝑊𝑆𝑆 = 0) and concludes There is a clustering partition for each step 1, 2, . . . , 𝑛.
with all objects in one cluster. The series of merging can be represented at a glance by a treelike structure called a
The R function hclust performs Ward’s method if the option method = dendrogram.
’ward’ is specified. To get a single 𝑘-cluster solution, we would cut the dendrogram horizontally at a point
that would produce 𝑘 groups (The R function cutree can do this).
It is strongly recommended to examine the full dendrogram first before determining
where to cut it.
A natural set of clusters may be apparent from a glance at the dendrogram.
27 28
Standardization of Observations
If the variables in our data set are of different types or are measured on very
different scales, then some variables may play an inappropriately dominant role
in the clustering process.
In this case, it is recommended to standardize the variables in some way before
clustering the objects.
Possible standardization approaches:
1. Divide each column by its sample standard deviation, so that all variables have standard
deviation 1.
2. Divide each variable by its sample range (max − min); Milligan and Cooper (1988) found
that this approach best preserved the clustering structure.
3. Convert data to 𝑧-scores by (for each variable) subtracting the sample mean and then
dividing by the sample standard deviation – a common option in clustering software
packages.
29

MDA Session 4

Uploaded by

Copyright:

Available Formats

MDA Session 4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

MDA Session 4

Uploaded by

Copyright:

Available Formats

15-06-2023

Review: Project formulation and proposal

Applications of Cluster The Need for Clustering Algorithms

The Need for Clustering Algorithms (Continued) 11

Lessons to be learnt from

• All of the above The notion of difference/similarity/distance may vary

What is their relative importance ? Choice of NUMBER of clusters could be tricky /

Clustering based on single attribute

3.00 Select a distance measure

Standardize observations to give equal weights to all variables

Other choices: maximum norm, Minkowski distance etc.

mahalanobis(x, center, cov)

Types of Clustering Algorithms

Hierarchical methods cluster the data in a series of 𝑛 steps, typically joining

estimate the model parameters.

Centroid based Distance between centroids

Hierarchical Clustering Defining Closeness of Clusters

• Cluster the provinces

Canadian health-workers density in 2001

Linkage Methods in Hierarchical Clustering

Ward’s Method (1963) Dendrograms

You might also like