Clustering-Lab-Notebook-Assignment
Clustering-Lab-Notebook-Assignment
Clustering-Lab-Notebook-Assignment
Katelyn Smith
December 2, 2022
Purpose
To demonstrate the application of clustering using K-Means and heirarchical clustering.
Directions
1. Read about the 2019 Survey of Consumer Finances (SCF) dataset in this file. The variable names and
descriptions are here along with directions on how to upload and read the data.
2. Create an RMD file with minimal code to accomplish the following
• Use heirarchical clustering to cluster the scfp200 dataset using complete and single linkage, and plot
the associated dendrograms.
• Choose a sensible value for K (the number of clusters) based on your dendrograms.
• Use K-means clustering to cluster the scfp dataset.
• Retrieve the cluster assignments for each of your clustering methods, and comment on the differences in
the results (particularly the differences between K-means and complete linkage heirarchical clustering
results).
3. Complete the RMarkdown file to report on the work and results. The RMarkdown should include the
following.
• An overview of the question to be addressed and the method that is used.
• A discussion of the data, including variables and source.
• Explanation of the process used.
• Summary and discussion of the results.
4. Upload the RMarkdown file AND the pdf or html output file here.
The data
This dataset was obtained from the Survey of Consumer Finances (SCF) page of the Federal Reserve website.
More information and a link to download the “Summary Extract Public Data” csv file are available on that
page.
For the purposes of this assignment, we will use a random sample of 200 individuals from the SCF 2019
Summary Extract Public Data file (the original file contains almost 29000 observations). I have omitted the
first 3 columns, which were random identifier variables and a weighting variable (i.e. those variables did not
encode actual information about the consumer surveyed).
1
In an RMarkdown file, run the following code. This loads the dataset. No other preprocessing is needed, as
all variables are either numeric or already encoded as dummy variables.
# Read in the data from my files
# This should be changed for replication
scfp200 <- read_csv('/Users/Katelyn/OneDrive/Documents/Adv Data Modeling/scfp200.csv')
2
single
50
185 41165106
40
30
Height
178
98
103
39
168
20
128
135
151
148
159
5868
46
187
164
88
48
18312
40
89
67
96
69
15
17
85
80
19
167
19984
36
158
200
127
37
10
173
29
177
156
79
16
111
44
186
9415
14652
175
31
154
11
59
70
64
176
9720
33
115
162
145 2
174
73
117
129
171
122
131
51
142
22
23
170
123
27
63
28
42
38
74
195
539
1097
149
125
141
45
92
184
113
160
32
140
713
144
110 4
8
57
47
136
121
43
155
26
61
114
49
189
126
181
50
54
138
99
180
93
21
197
91
56
13
193
120
55
83
134
132
182
166
95
25
86
163
65
90
169
196
198
161
152
101
30
78
118
105
100
82
107
191
60
77
10
150
2
133
179
194
76
35
102
108
124
87
157
18
34
143
14
119
188
190
81
153
172
112
72
24
137
139
116 6
104
1306
0
147
66
192
75
dist(scfp200_scaled)
hclust (*, "single")
This is a very large, complicated dendrogram grouped near the y-axis. I would like to cut it into 10 clusters
to further simplify it.
3
# Plot the dendrogram of the complete linkage hierarchical clustering
plot(hc.out.single, main = "complete")
complete
50
185 41165106
40
30
Height
178
98
103
39
168
20
128
135
151
148
159
5868
46
187
164
88
48
18312
40
89
67
96
69
15
17
85
80
19
167
19984
36
158
200
127
37
10
173
29
177
156
79
16
111
44
186
9415
14652
175
31
154
11
59
70
64
176
9720
33
115
162
145 2
174
73
117
129
171
122
131
51
142
22
23
170
123
27
63
28
42
38
74
195
539
1097
149
125
141
45
92
184
113
160
32
140
713
144
110 4
8
57
47
136
121
43
155
26
61
114
49
189
126
181
50
54
138
99
180
93
21
197
91
56
13
193
120
55
83
134
132
182
166
25
86
95
163
65
90
169
196
198
161
152
101
30
78
118
105
100
82
107
191
60
77
10
150
2
133
179
194
76
35
102
108
124
87
157
18
34
143
14
119
188
190
81
153
172
112
72
24
137
139
116 6
104
1306
0
147
66
192
75
dist(scfp200_scaled)
hclust (*, "single")
This dendrogram looks very similar to the single linkage. I would like to cut it into 10 clusters to compare.
# Mutate the data to count the number of observations in each single linkage cluster
scfp200_single <- mutate(scfp200_df, cluster = cut.sing)
count(scfp200_single, cluster)
## cluster n
## 1 1 191
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
4
## 6 6 1
## 7 7 1
## 8 8 1
## 9 9 1
## 10 10 1
# Mutate the data to count the number of observations in each complete linkage cluster
scfp200_complete <- mutate(scfp200_df, cluster = cut.com)
count(scfp200_complete, cluster)
## cluster n
## 1 1 14
## 2 2 156
## 3 3 19
## 4 4 2
## 5 5 1
## 6 6 3
## 7 7 1
## 8 8 2
## 9 9 1
## 10 10 1
Looking at the breakdown of the number of observations in each cluster, it is clear that the single linkage is
not the preferred method for a data set as large as this Consumer Finance data set. It has grouped almost
every observation into the first cluster, then grouped only one observation into each of the other clusters.
The complete linkage counts look more balanced. The first three clusters contain the majority of the
observations, but there are still some observations in clusters four through ten, showing that this is a more
balanced option for clustering this Consumer Finance data.
K-Means Clustering
# Run the K-Mean cluster with k=10 to match the hierarchical clustering
km.out <- kmeans(scfp200,10, nstart = 20)
## cluster n
## 1 1 2
## 2 2 9
## 3 3 4
## 4 4 1
## 5 5 177
## 6 6 2
## 7 7 2
## 8 8 1
## 9 9 1
## 10 10 1
From the counts, there are many more observations in cluster 1 than in the complete clustering. This is
overall showing a less even spread amongst clusters.
5
Comparing K-Means and Complete Hierarchal Clustering
table(km.out$cluster,cut.com)
## cut.com
## 1 2 3 4 5 6 7 8 9 10
## 1 0 0 2 0 0 0 0 0 0 0
## 2 1 2 4 0 0 2 0 0 0 0
## 3 0 0 3 0 0 1 0 0 0 0
## 4 0 0 0 0 1 0 0 0 0 0
## 5 13 153 6 2 0 0 0 2 0 1
## 6 0 0 2 0 0 0 0 0 0 0
## 7 0 1 1 0 0 0 0 0 0 0
## 8 0 0 1 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 1 0
## 10 0 0 0 0 0 0 1 0 0 0
7 and 4 identical 9 and 9 are identical
From this confusion matrix, we can see that the clusters obtained from complete hierarchical clustering and
k-means clustering are somewhat different. Cluster 7 in k-means clustering is identical to cluster 4 of the
complete hierarchical clustering. Clusters 9 and 5 of both methods are identical. The other clusters differ
vastly between the two methods. Overall, complete hierarchical clustering seems to show the best spread for
this data.