MDA Session 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

15-06-2023

Review: Project formulation and proposal


Identify the goal, by defining the (business) perspective
Cluster Analysis
What are the research questions that you are hoping to answer? Who are the
target respondents? How do you envisage and plan using survey responses to
arrive at these answers? Objective: To separate individual observations/
items/ objects, into groups, or clusters, on the basis
The questionnaire (ideally one, keeping time-frame in mind – but in exceptional of the values for the 𝑞 variables measured on each
cases, subsequent surveys can be taken) should be designed keeping the object.
following additional aspects:
The objects within each cluster should be similar
Univariate analysis (based on analyzing questions in isolation) should yield useful results and objects in different clusters should be dissimilar.
(techniques already known to you) relevant to the project goal
Multivariate analysis – of four types – can be carried out meaningfully based on the The dissimilarity between any two objects is typically
survey data that address the stated business problem and help achieve the goal quantified using a distance measure (like Euclidean
At least 15-20 items distance).
A ‘sufficiently large’ number of responses: ( how do you plan to conduct survey / target) A type of unsupervised classification – as the nature
Proposal format (standard guideline. Title page+ max 3 pages (be as precise and of the groups (or the number of groups, typically) are
contextual, as possible) including reference +appendix containing questionnaire unknown before the objects are classified into
clusters.

2 4

Applications of Cluster The Need for Clustering Algorithms


Analysis Suppose 𝑛 objects are to segregated into a small number (say, 𝑘) of clusters,
In marketing: researchers attempt to find where 𝑘 < 𝑛.
distinct clusters of the consumer population If the number of clusters 𝑘 is decided ahead of time, then the number of ways to
so that several distinct marketing strategies partition the 𝑛 objects into 𝑘 clusters is a “Stirling number of the second kind.”
can be used for the clusters. Example: There are 46, 771, 289, 738, 810 ways to form 𝑘 = 4 (non-empty)
In ecology: scientists classify plants or clusters from 𝑛 = 25 objects.
animals into various groups based on some
measurable characteristics.
> library(copula)
In genetics: Researchers may separate genes > options(scipen = 999)
into several classes based on their expression > Stirling2( 25,4)
ratios measured at different time points. [1] 46771289738810

5 6

The Need for Clustering Algorithms (Continued) 11


4
If the number of clusters ( 𝑘) is flexible, the possible number of partitions is much higher. 5
Example: There are over 4638590 TRILLION possible partitions of 𝑛 = 25 objects if we let
the number of clusters 𝑘 vary. (This is called the Bell number for n.) 12
3
Challenging even for a modern-day computer to explore ALL the possible clustering
partitions to see which is best. 6 {1,3,6,7,9}
{11,13}
Hence, need intelligently designed algorithms that will search among the best possible
partitions relatively quickly. {2,5,8} 10
{12}
2 13
{4}
7
{10}
x <- seq(10,50,10)
library(numbers)
{1,3,6,7,9,11,13} {1,3,6,7,9}
sapply(x, bell)
> bell(25)
1 {2,5,8,12} {11,13} 9
[1] 4638590332229998592 8
{4,10} {2,5,8,12}
{4,10}

7 8

1
15-06-2023

Lessons to be learnt from


Which attribute to use for clustering?
the clustering game
• Color combination & design
• size Differences (distinction) are almost always unavoidable
both within cluster as well as between cluster. But
• location between cluster variation should be dominant.

• All of the above The notion of difference/similarity/distance may vary


from one analyst to another, possibly depending on what
factors are perceived to be more important.

What is their relative importance ? Choice of NUMBER of clusters could be tricky /


controversial.

9 10

Clustering based on single attribute


is relatively easy (quiz!)
Formulate the problem
Define/select variables to be used for clustering
Zcost

3.00 Select a distance measure

2.00
1.00 Select clustering procedure

0.00
-1.00 Decide on no. of clusters

-2.00
Interpret the profile of clusters

11 12

Distance and (dis-)similarity measures between cases Popular choices of distance / dissimilarity between
multivariate observations
Basic principle of clustering: STEP I
p

 ( x  y )  ( x  y) ( x  y )  
0.5 0.5
d E ( x, y )  2 t
d M ( x, y)  ( x  y )t S 1 ( x  y )
Use of Euclidean distance (other choices)        
i i
i 1  

Standardize observations to give equal weights to all variables


Not always (squared)
Similarity and distance in presence/absence type attributes City –block Mahalanobis
Euclidean
 define similarity s = (# of attributes with match)/(# no of attributes)
distance distance
 s = 1/(1+d) distance (Manhattan distance)
Differential weights for 1-1 and 0-0 match (presence/absence) can be accommodated

p
Use correlation to cluster variables d CB ( x, y)   | xi  yi |
  i 1

Other choices: maximum norm, Minkowski distance etc.

13 14

2
15-06-2023

dist in R
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/dist Mahalanobish distance in R
Method =

mahalanobis(x, center, cov)

15 16

Clustering
methods

Types of Clustering Algorithms


Non-hierarchical
Hierarchical
K-means

There are three major classes of clustering methods – from oldest to newest:
Agglomerative Divisive Sequential Parallel Optimize
Hierarchical methods
Partitioning methods
Model-based methods Linkage based

Hierarchical methods cluster the data in a series of 𝑛 steps, typically joining


observations together step by step to form clusters. Single linkage Min
Partitioning methods first determine 𝑘, and then typically attempt to find the
partition into 𝑘 clusters that optimizes some objective function of the data. Complete linkage Max
 Model-based clustering takes a statistical approach, formulating a model that
categorizes the data into subpopulations and using maximum likelihood to Average linkage

estimate the model parameters.


Variance based :
Ward’s method
SS within cluster is minimized

Centroid based Distance between centroids

17 18

Hierarchical Clustering Defining Closeness of Clusters

Agglomerative hierarchical clustering begins with 𝑛 clusters, each containing a The key in a hierarchical clustering algorithm is specifying how to determine the
single object. two “closest” clusters at any given step.
At each step, the two “closest” clusters are merged together. For the first step, join the two objects whose (Euclidean?) distance is smallest.
So, with steps iterated, there are 𝑛 clusters, then 𝑛 − 1 clusters, then 𝑛 − 2, In subsequent stages: should we join two individual objects together, or merge
….., 1 cluster an object into a cluster that already has multiple objects?
The R function hclust will perform a variety of hierarchical clustering methods. Join on the basis of Inter-cluster dissimilarity (distance)
 one of three linkage methods
 distance between cluster/group Centroids
Within group sum of squares (Wald’s method)

19 20

3
15-06-2023

• Cluster the provinces


in terms of density of
health-care workers.

• How is it different
from 2001?

21 22

Canadian health-workers density in 2001

23 24

Linkage Methods in Hierarchical Clustering


Linkage between two groups The single linkage algorithm (“nearest neighbor” clustering), at each step, joins
the clusters whose minimum distance between objects is smallest, i.e., joins the
clusters 𝐴 and 𝐵 with the smallest
• Complete linkage 𝑑 = min 𝑑
∈ , ∈
• Single linkage The complete linkage algorithm (“farthest neighbor clustering” ), at each step,
• Average linkage joins the clusters whose maximum distance between objects is smallest, i.e., joins
the clusters 𝐴 and 𝐵 with the smallest
𝑑 = max 𝑑
∈ , ∈
The average linkage algorithm, at each step, joins the clusters whose average
distance between objects is smallest, i.e., joins the clusters A and B with the
smallest
1
•Centroid method 𝑑 = 𝑑
𝑛 𝑛
∈ ∈
where 𝑛 and 𝑛 are the number of objects in clusters 𝐴 and 𝐵, respectively

25 26

4
15-06-2023

Ward’s Method (1963) Dendrograms

At each step, the method searches over all possible ways to join a pair of clusters A hierarchical algorithm actually produces not one partition of the data, but lots of
so that the K-means criterion 𝑊𝑆𝑆 is minimized for that step. partitions.
It begins with each object as its own cluster (so that 𝑊𝑆𝑆 = 0) and concludes There is a clustering partition for each step 1, 2, . . . , 𝑛.
with all objects in one cluster. The series of merging can be represented at a glance by a treelike structure called a
The R function hclust performs Ward’s method if the option method = dendrogram.
’ward’ is specified. To get a single 𝑘-cluster solution, we would cut the dendrogram horizontally at a point
that would produce 𝑘 groups (The R function cutree can do this).
It is strongly recommended to examine the full dendrogram first before determining
where to cut it.
A natural set of clusters may be apparent from a glance at the dendrogram.

27 28

Standardization of Observations

If the variables in our data set are of different types or are measured on very
different scales, then some variables may play an inappropriately dominant role
in the clustering process.
In this case, it is recommended to standardize the variables in some way before
clustering the objects.
Possible standardization approaches:
1. Divide each column by its sample standard deviation, so that all variables have standard
deviation 1.
2. Divide each variable by its sample range (max − min); Milligan and Cooper (1988) found
that this approach best preserved the clustering structure.
3. Convert data to 𝑧-scores by (for each variable) subtracting the sample mean and then
dividing by the sample standard deviation – a common option in clustering software
packages.

29

You might also like