Block 18 ST3188
Block 18 ST3188
Block 18 ST3188
Learning Objectives
describe the basic concept and scope of cluster analysis and its importance in market research
discuss the statistics associated with cluster analysis
explain the procedure for conducting cluster analysis, including formulating the problem,
selecting a distance measure, selecting a clustering procedure, deciding on the number of
clusters, interpreting clusters and profiling clusters
describe the purpose and methods for evaluating the quality of clsutering results and assessing
reliability and validity.
Readings
Malhotra, N.K., D. Nunan and D.F. Birks. Marketing Research: An Applied Approach. (Pearson,
2017) 5th edition [ISBN 9781292103129] Chapter 25.
Activity 18.1
Discuss the similarity and difference between cluster analysis and discriminant analysis.
Activity 18.2
What is a ‘cluster’?
Uses of cluster analysis
Segmenting the market: Recognising customers’ differences is the key to successful marketing,
which can lead to a closer matching between products and customer needs. Consumers may be
clustered on the basis of benefits sought from the purchase of a product (benefit segmentation).
Understanding buyer behaviour: For example, what kind of strategies do car buyers use for buying
a car?
Identifying new product opportunities: Clustering brands and products so that competitive sets
within the market can be determined.
Selecting test markets: Grouping cities into homogeneous clusters in order to test various marketing
strategies.
Reducing data: Achieve simplicity through reducing the original dimensionality of the data.
However, note that cluster analysis is a distribution-free method.
Activity 18.3
What are some of the uses of cluster analysis in marketing?
Activity 18.4
Briefly define the following terms: dendrogram, icicle plot, agglomeration schedule and cluster
membership.
Perhaps the most important part of formulating the clustering problem is selecting the variables on
which the clustering is based. Inclusion of even one or two irrelevant variables may distort an
otherwise useful clustering solution. Basically, the set of variables selected should describe the
similarity between objects in terms which are relevant to the market research problem. The variables
should be selected based on past research, theory or a consideration of the hypotheses being tested. In
exploratory research, the researcher should activity judgement and intuition.
The most commonly-used measure of similarity is the Euclidean distance, or its square. The
Euclidean distance is the square root of the sum of the squared differences in values for each variable.
Other distance measures are also available. The city-block or Manhattan distance between two
objects is the sum of the absolute differences in values for each variable. The Chebychev
distancebetween two objects is the maximum absolute difference in values for any variable.
The Euclidean distance between objects ii and jj is:
𝛿𝑖𝑗 = √∑(𝑥𝑖𝑘=𝑥𝑗𝑘)2
𝑘=1
For p=2, the Euclidean distance corresponds to the ‘straight line’ distance between the two
points (𝑥𝑖1 𝑥𝑖2 ) and (𝑥𝑗1 , 𝑥𝑗2 ).
A weight, 𝑤𝑘 could be assigned to variable k if it was believed that more importance should be
attached to some variables over others giving:
𝛿𝑖𝑗 = √∑ 𝑤𝑘 (𝑥𝑖𝑘=𝑥𝑗𝑘)2
𝑘=1
𝛿𝑖𝑗 = ∑ |𝑥𝑖𝑘=𝑥𝑗𝑘) |
𝑘=1
Compared to Euclidean distance, this gives less relative weight to large differences.
The Chebychev distance is:
Activity 18.5
What is the most commonly-used measure of similarity in cluster analysis?
Clustering procedures
Figure 25.4 of the textbook provides a classification of clustering procedures.
Hierarchical clustering is characterised by the development of a hierarchy or tree-like structure.
Hierarchical methods can be agglomerative or divisive.
Agglomerative clustering starts with each object in a separate cluster. Clusters are formed by
grouping objects into larger and larger clusters. This process is continued until all objects are
members of a single cluster.
Divisive clustering starts with all the objects grouped in a single cluster. Clusters are divided or split
until each object is in a separate cluster.
Agglomerative methods are commonly used in market research. They consist of linkage methods,
error sums of squares or variance methods and centroid methods.
The single linkage method is based on the minimum distance or the nearest neighbour rule. At every
stage, the distance between two clusters is the distance between their two closest points.
The complete linkage method is similar to single linkage, except that it is based on the maximum
distance or the farthest neighbour approach. In complete linkage, the distance between two clusters is
calculated as the distance between their two farthest points.
The average linkage method works similarly. However, in this method, the distance between two
clusters is defined as the average of the distances between all pairs of objects, where one member of
the pair is from each of the clusters.
Figure 25.5 of the textbook shows the linkage methods of clustering.
A variance method attempts to generate clusters to minimise the within-cluster variance. A
commonly-used variance method is Ward’s procedure. For each cluster, the means for all the
variables are computed. Next, for each object, the squared Euclidean distance to the cluster means is
calculated. These distances are summed for all the objects. At each stage, the two clusters with the
smallest increase in the overall sum of squares within cluster distances are combined.
In the centroid method, the distance between two clusters is the distance between their centroids
(means for all the variables). Every time objects are grouped, a new centroid is computed.
Of the hierarchical methods, average linkage and Ward’s methods have been shown to perform better
than the other procedures.
Figure 25.6 of the textbook shows other agglomerative clustering methods.
The non-hierarchical clustering methods are frequently referred to as kk-means clustering. These
methods include sequential threshold, parallel threshold and optimising partitioning.
In the sequential threshold method, a cluster centre is selected and all objects within a pre-specified
threshold value from the centre are grouped together. Next a new cluster centre or seed is selected,
and the process is repeated for the unclustered points. Once an object is clustered with a seed, it is no
longer considered for clustering with subsequent seeds.
The parallel threshold method operates similarly, except that several cluster centres are selected
simultaneously and objects within the threshold level are grouped with the nearest centre.
The optimising partitioning method differs from the two threshold procedures in that objects can
later be reassigned to clusters to optimise an overall criterion, such as average within-cluster distance
for a given number of clusters.
It has been suggested that the hierarchical and non-hierarchical methods be used in tandem. First,
an initial clustering solution is obtained using a hierarchical procedure, such as average linkage or
Ward’s. The number of clusters and cluster centroids so obtained are used as inputs to the optimising
partitioning method.
The choice of a clustering method and the choice of a distance measure are interrelated. For example,
squared Euclidean distances should be used with Ward’s procedure and centroid methods. Several
non-hierarchical procedures also use squared Euclidean distances.
Activity 18.6
Present a classification of clustering procedures.
Activity 18.7
On what basis may a researcher decide which variables should be selected to formulate a clustering
problem?
Activity 18.8
Why is the average linkage method usually preferred to single linkage and complete linkage?
Activity 18.9
What are the two major disadvantages of non-hierarchical clustering procedures?
Activity 18.10
What guidelines are available for deciding on the number of clusters?
Activity 18.11
What is involved in the interpretation of clusters?
Activity 18.12
What role may qualitative methods play in the interpretation of clusters?
Activity 18.13
What are some of the additional variables used for profiling the clusters?
Activity 18.14
Describe some procedures available for assessing the quality of clustering solutions.
Activity 18.15
How is cluster analysis used to group variables?
Number V1 V2 V3 V4 V5 V6
1 6 4 7 3 2 3
2 2 3 1 4 5 4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
5 1 3 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3
10 3 5 3 6 4 6
11 1 3 2 3 5 3
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
20 2 3 2 4 7 2
Table 25.2 of the textbook shows the results of hierarchical clustering using squared Euclidean
distance and Ward’s procedure. The output includes the agglomeration schedule that reports which
clusters were combined at which stage of the agglomeration, as well as cluster membership for two,
three and four cluster solutions.
Figure 25.7 of the textbook shows the vertical icicle plot for this cluster analysis, which details the
cases belonging to each cluster for different numbers of clusters.
Figure 25.8 of the textbook provides the dendrogram which suggests a three-cluster solution may be
appropriate.
Table 25.3 of the textbook shows the cluster centroids (i.e. the means of variables) for the three-
cluster solution. Cluster centroids are useful for profiling clusters as the mean values indicate a
representative member of each cluster.
A B C D E
A -
B 3 -
C 8 7 -
D 11 9 6 -
E 10 9 7 5 -
Using the nearest neighbour (single linkage) method, first we look for the closest pair of individuals,
i.e. Adam and Brian. We construct a new distance table appropriate for the four clusters existing at the
end of the first stage - the distance between two clusters is defined as the distance between their
nearest members.
(A, B) C D E
(A, B) -
C 7 -
D 9 6 -
E 9 7 5 -
We repeat the procedure until all objects are in one cluster. More specifically, the next most similar
pair of objects is Donna and Eve.
(A, B) C (D, E)
(A, B) -
C 7 -
(D, E) 9 6 -
3 2 (A, B) (C, D, E) 6
4 1 (A, B, C, D, E) 7
Using the farthest neighbour (complete linkage) method, first we look for the most remote pair of
individuals. The agglomeration schedule is:
3 2 (A, B) (C, D, E) 7
4 1 (A, B, C, D, E) 11
The sets of clusters produced by the farthest neighbour method coincide with those from the nearest
neighbour method, although the distance levels differ at which the clusters merge. Generally, the
nearest and farthest neighbour methods give different results, sometimes very different, especially
when there are many objects or individuals to be clustered. Both methods only depend on the ordinal
properties of the distances.
The dendrogram for the nearest neighbour (single linkage) clustering is:
The dendrogram for the farthest neighbour (complete linkage) clustering is:
Discussion points
1. ‘The consequences of inappropriate validation of cluster analysis solutions can be disastrous.’
2. ‘User-friendly statistical packages can create cluster solutions in situations where naturally-
occurring clusters do not exist.’
Theoretical, conceptual or practical considerations may dictate the choice of the number of clusters.
In hierarchical clustering, the distances at which clusters are being combined have a large bearing on
the choice of the number of clusters.
The number of clusters should be selected so as to make the relative size of the clusters meaningful.
Solution to Activity 18.11
The clusters should be interpreted in terms of cluster centroids which represent the average values of
the objects contained in the cluster on each of the variables. This enables the researcher to describe
each cluster by assigning it a name or label, which may be obtained from the cluster programme or
through discriminant analysis.
Comparison of cluster analysis results obtained on the same data using different
distance measures determines the stability of the solutions.
Comparison of results obtained with different methods can also serve as a check.
Often, clustering is done separately on each half of the data split randomly into two
halves. Subsequently, the cluster centroids can be compared across the two
subsamples.
Sometimes clustering is done on a reduced set of variables (using a random deletion
of variables) and compared with those obtained by clustering based on the entire set
of variables.
In non-hierarchical clustering, the solution may depend on the order of cases in the
dataset. Multiple runs should be made using different orders of cases until the solution
stabilises.
This first box confirms the number of observations used in the cluster analysis and reports any
missing values (here there are no missing values).
The proximity matrix reports how far apart each pair of observations is based on the distance measure
specified when running the cluster analysis (here squared Euclidean distance was used). Hierarchical
clustering procedures base the agglomeration by combining objects which are in closest proximity to
each other.
Clearly, for large numbers of observations, the proximity matrix becomes very large so typically
would not be reported. Nevertheless, when reviewing the agglomeration schedule it can be
worthwhile to cross-reference this with the proximity matrix.
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Coefficients Next Stage
Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 14 16 1.000 0 0 6
2 6 7 2.000 0 0 7
3 2 13 3.500 0 0 15
4 5 11 5.000 0 0 11
5 3 8 6.500 0 0 16
6 10 14 8.167 0 1 9
7 6 12 10.500 2 0 10
8 9 20 13.000 0 0 11
9 4 10 15.583 0 6 12
10 1 6 18.500 0 7 13
11 5 9 23.000 4 8 15
12 4 19 27.750 9 0 17
13 1 17 33.100 10 0 14
14 1 15 41.333 13 0 16
15 2 5 51.833 3 11 18
16 1 3 64.500 14 5 19
17 4 18 79.667 12 0 18
18 2 4 172.667 15 17 19
19 1 2 328.600 16 18 0
The agglomeration schedule gives information on the cases being combined at each stage of a
hierarchical clustering process. We begin with all n=20n=20 cases as individual clusters each of size
1. At the first stage, SPSS combines the two cases which are closest together based on the distance
measure used (again, squared Euclidean distance was specified). We see that cases 14 and 16 were
combined. Subsequently, cases 6 and 7 were combined, then 2 and 13 etc. The process continues until
we have one cluster with all n=20n=20 cases within it. The right-hand columns report when a
previously combined case first appeared. For example, case 14 was first combined in stage 1 and next
appeared in stage 6.
Note the ‘Coefficient’ column provides a (scaled) measure of how close the objects are at each stage
of the clustering. As we advance through the stages of hierarchical clustering, we have to combine
clusters which are further and further apart. When there is a ‘large’ increase in the Coefficient
column, at that stage we are combining quite distant clusters and so we may wish to halt the
agglomeration just before this happens - otherwise we combine distant objects which we might be
unwilling to consider as ‘similar’ or ‘homogeneous’.
Cluster Membership
Case 4 Cluster 3 Cluster 2 Cluster
1 1 1 1
2 2 2 2
3 1 1 1
4 3 3 2
5 2 2 2
6 1 1 1
7 1 1 1
8 1 1 1
9 2 2 2
10 3 3 2
11 2 2 2
12 1 1 1
13 2 2 2
14 3 3 2
15 1 1 1
16 3 3 2
17 1 1 1
18 4 3 2
19 3 3 2
20 2 2 2
If we asked SPSS to consider a range of cluster solutions based on our desire to identify an
approximate number of clusters (for example, from 2 to 4 clusters), then the cluster membership box
reports to which cluster each individual case is assigned, for each type of solution. Note these results
also appear as new columns in your original data matrix.
Eyeballing the four-cluster solution column, we see that if we had four clusters then only one
observation (participant 18) would appear in cluster 4. We might be unwilling to entertain such a
small cluster (although 1 observation in 20 might represent approximately 5% of the population,
assuming the (random) sample was fairly representative of the population from which it was drawn)
and we see that with three clusters this individual would be assigned to cluster 3. (Looking back at the
proximity matrix, we see observation 18 is quite far from all other observations.)
The icicle plot shows how the cluster compositions for all possible cluster solutions, i.e. from 1 to 20
clusters for this example. The vertical axis details the number of clusters and the individual
participants are at the top of the plot (the numbers correspond to the participant number). The icicles
indicate the split between cluster solutions in terms of which observations are included in which
cluster. For example, for two clusters, the shortest icicle is between participants 2 and 8. So a two-
cluster solution would have the following clusters.
For three clusters, the next shortest icicle is between participants 4 and 20, hence splitting cluster 2 to
give the following clusters.
For four clusters, the next shortest icicle is between participants 18 and 19, hence splitting cluster 3 to
give the following clusters.
The dendrogram provides a convenient way to determine the appropriate number of clusters.
Although the final decision is subjective, the dendrogram clearly shows the proximity when cases are
combined. On the left-hand side we see the individual participants (the numbers correspond to the
participant number). The horizontal axis is a rescaled distance measure showing the (rescaled)
distance when combining occurs. The vertical lines in the diagram depict the combining of cases -
simply trace back the cluster members to the left-hand side.
Using the above dendrogram, it seems reasonable to identify three distinct clusters.
o Cluster 1: 1, 3, 6, 7, 8, 12, 15 and 17.
Finally, we would like to profile the clusters by examining the group centroids. Given we have saved
the cluster membership in the data matrix, we can obtain the means of each variable for each cluster
as follows. Use Analyze >> Compare Means >> Means……, then move the variables used for
clustering (here 𝑉1 to 𝑉6 ) into the ‘Dependent List:’, and move the three-cluster solution column
(called ‘CLU3_1’) to the ‘Independent List:’. Click ‘OK’.
Report
I You can
I try to
Shopping combine I don't save a lot
get the
Shopping is bad for shopping care of money
Ward Method best buys
is fun your with about by
while
budget eating shopping comparing
shopping
out prices
Mean 5.75 3.63 6.00 3.13 1.88 3.88
N 8 8 8 8 8 8
1
Std.
1.035 .916 1.069 .835 .835 .641
Deviation
Mean 1.67 3.00 1.83 3.50 5.50 3.33
N 6 6 6 6 6 6
2
Std.
.516 .632 .753 1.049 1.049 .816
Deviation
Mean 3.50 5.83 3.33 6.00 3.50 6.00
N 6 6 6 6 6 6
3
Std.
.548 .753 .816 .632 .837 1.549
Deviation
Mean 3.85 4.10 3.95 4.10 3.45 4.35
N 20 20 20 20 20 20
Total
Std.
1.899 1.410 2.012 1.518 1.761 1.496
Deviation
We see that cluster 1 members seem to be the shopaholics (high mean scores for 𝑉1 and 𝑉3 , and a low
mean score for 𝑉5 ). Cluster 2 members seem to be those who hate shopping (low mean scores
for 𝑉1 and 𝑉3 , and a high mean score for 𝑉5 ), while cluster 3 members seem to be price-conscious
consumers (high mean scores for 𝑉2 , 𝑉4 and 𝑉6 ).
2. Selected SPSS output follows (using the default cluster analysis options in SPSS).
Agglomeration Schedule
Cluster Combined Stage Cluster First Appears
Stage Coefficients Next Stage
Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 2 20 0.000 0 0 12
2 18 19 0.000 0 0 3
3 17 18 0.000 0 2 10
4 3 6 0.000 0 0 15
5 14 15 0.500 0 0 9
6 8 12 1.000 0 0 11
7 5 11 1.500 0 0 13
8 1 4 2.000 0 0 14
9 10 14 2.833 0 5 14
10 16 17 4.333 0 3 13
11 8 9 5.833 6 0 18
12 2 13 7.833 1 0 16
13 5 16 11.000 7 10 16
14 1 10 14.367 8 9 17
15 3 7 18.367 4 0 17
16 2 5 31.867 12 13 19
17 1 3 46.417 14 15 18
18 1 8 65.758 17 11 19
19 1 2 138.050 18 16 0
Cluster Membership
Case 4 Clusters 3 Clusters 2 Clusters
1 1 1 1
2 2 2 2
3 3 1 1
4 1 1 1
5 2 2 2
6 3 1 1
7 3 1 1
8 4 3 1
9 4 3 1
10 1 1 1
11 2 2 2
12 4 3 1
13 2 2 2
14 1 1 1
15 1 1 1
16 2 2 2
17 2 2 2
18 2 2 2
19 2 2 2
20 2 2 2
Report
Ward Method Comfort Style Durability
Mean 4.91 4.82 5.36
1 N 11 11 11
Std. Deviation .944 1.328 1.362
Mean 3.22 2.44 2.89
2 N 9 9 9
Std. Deviation 0.667 0.882 1.167
Mean 4.15 3.75 4.25
Total N 20 20 20
Std. Deviation 1.182 1.65 1.773
An examination of the agglomeration schedule reveals that the coefficient suddenly jumps from stage
18 to 19 (from 65.758 to 138.050). Therefore, it appears that a two-cluster solution is appropriate (the
jump from stage 18 to 19 indicates we would be merging two distant clusters together). The same
conclusion is reached by looking at the dendrogram (on the rescaled distance axis, the final two
clusters would be merged at a distance of 25, which is far greater than the distances when smaller
clusters were combined).
o Cluster 3: 15 (members: 4, 5, 6, 7, 11, 12, 14, 17, 18, 20, 35, 36, 37, 38 and 39).
o Cluster 4: 10 (members: 8, 10, 13, 16, 19, 22, 23, 25, 42 and 45).
2. The following procedures can provide adequate checks on the quality of clustering results.
These are vital if managers are to appreciate what constitutes robust clustering solutions.
Perform cluster analysis on the same data using different distance measures. Compare the
results across measures to determine the stability of the solutions.
Use different methods of clustering and compare the results.
Split the data randomly into halves. Perform clustering separately on each half. Compare
cluster centroids across the two subsamples.
Delete variables randomly. Perform clustering based on the reduced set of variables. Compare
the results with those obtained by clustering based on the entire set of variables.
In non-hierarchical clustering, the solution may depend on the order of cases in the dataset.
Make multiple runs using different orders of cases until the solution stabilises.