Clustering-Lab-Notebook-Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Clustering Lab Notebook - Consumer Finances

Katelyn Smith

December 2, 2022

Purpose
To demonstrate the application of clustering using K-Means and heirarchical clustering.

Directions
1. Read about the 2019 Survey of Consumer Finances (SCF) dataset in this file. The variable names and
descriptions are here along with directions on how to upload and read the data.
2. Create an RMD file with minimal code to accomplish the following
• Use heirarchical clustering to cluster the scfp200 dataset using complete and single linkage, and plot
the associated dendrograms.

• Choose a sensible value for K (the number of clusters) based on your dendrograms.
• Use K-means clustering to cluster the scfp dataset.
• Retrieve the cluster assignments for each of your clustering methods, and comment on the differences in
the results (particularly the differences between K-means and complete linkage heirarchical clustering
results).
3. Complete the RMarkdown file to report on the work and results. The RMarkdown should include the
following.
• An overview of the question to be addressed and the method that is used.
• A discussion of the data, including variables and source.
• Explanation of the process used.
• Summary and discussion of the results.
4. Upload the RMarkdown file AND the pdf or html output file here.

The data
This dataset was obtained from the Survey of Consumer Finances (SCF) page of the Federal Reserve website.
More information and a link to download the “Summary Extract Public Data” csv file are available on that
page.
For the purposes of this assignment, we will use a random sample of 200 individuals from the SCF 2019
Summary Extract Public Data file (the original file contains almost 29000 observations). I have omitted the
first 3 columns, which were random identifier variables and a weighting variable (i.e. those variables did not
encode actual information about the consumer surveyed).

Getting the data


To use the SCFP 2019 data set, first download the scfp200.csv file from Moodle.
In RStudioCloud, go to the Files pane in the lower-right-hand corner. Choose Upload and upload the
scfp200.csv file.

1
In an RMarkdown file, run the following code. This loads the dataset. No other preprocessing is needed, as
all variables are either numeric or already encoded as dummy variables.
# Read in the data from my files
# This should be changed for replication
scfp200 <- read_csv('/Users/Katelyn/OneDrive/Documents/Adv Data Modeling/scfp200.csv')

## Rows: 200 Columns: 348


## -- Column specification --------------------------------------------------------
## Delimiter: ","
## dbl (348): HHSEX, AGE, AGECL, EDUC, EDCL, MARRIED, KIDS, LF, LIFECL, FAMSTRU...
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Scaling the data and omitting NAs


# Omit all rows with NA
scfp200 <- na.omit(scfp200)

# Scale the dataset to standardize and prepare for hierarchal clustering


scfp200_scaled <- scale(scfp200)

Perform the hierarchal clustering with a single linkage


# Create the hierarchical clustering object using the single linkage method
hc.out.single <- hclust(dist(scfp200_scaled), method = "single")

# Print out the summary of the clustering


summary(hc.out.single)

## Length Class Mode


## merge 398 -none- numeric
## height 199 -none- numeric
## order 200 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
# Plot the dendrogram of the single linkage hierarchical clustering
plot(hc.out.single, main = "single")

2
single
50

185 41165106
40
30
Height

178
98
103
39
168
20

128
135
151
148
159
5868
46
187
164
88
48
18312
40
89
67
96
69
15
17
85
80
19
167
19984
36
158
200
127
37
10

173
29
177
156
79
16
111
44
186
9415
14652
175
31
154
11
59
70
64
176

9720
33
115
162
145 2
174
73
117
129
171
122
131
51
142
22
23
170
123
27
63
28
42
38
74
195
539
1097
149
125
141
45
92
184
113
160
32
140
713
144
110 4
8
57
47
136
121
43
155
26
61
114
49
189
126
181
50
54
138
99
180
93
21
197
91
56
13
193
120
55
83

134
132
182
166

95
25
86
163
65
90

169
196
198
161
152
101
30

78
118
105
100
82

107
191
60
77
10
150

2
133
179
194

76
35
102
108
124
87
157
18
34

143

14
119
188
190
81
153
172
112
72
24
137
139
116 6
104
1306
0

147
66

192
75
dist(scfp200_scaled)
hclust (*, "single")
This is a very large, complicated dendrogram grouped near the y-axis. I would like to cut it into 10 clusters
to further simplify it.

Cutting the tree


# Cut the tree using k=10 indicating that there will be 10 clusters
# This k-value was chosen based upon the dendrogram
cut.sing <- cutree(hc.out.single, k = 10)

Perform the hierarchal clustering with a complete linkage


# Create the hierarchical clustering object using the complete linkage method
hc.out.complete <- hclust(dist(scfp200_scaled), method = "complete")

# Print out the summary of the clustering


summary(hc.out.complete)

## Length Class Mode


## merge 398 -none- numeric
## height 199 -none- numeric
## order 200 -none- numeric
## labels 0 -none- NULL
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character

3
# Plot the dendrogram of the complete linkage hierarchical clustering
plot(hc.out.single, main = "complete")

complete
50

185 41165106
40
30
Height

178
98
103
39
168
20

128
135
151
148
159
5868
46
187
164
88
48
18312
40
89
67
96
69
15
17
85
80
19
167
19984
36
158
200
127
37
10

173
29
177
156
79
16
111
44
186
9415
14652
175
31
154
11
59
70
64
176

9720
33
115
162
145 2
174
73
117
129
171
122
131
51
142
22
23
170
123
27
63
28
42
38
74
195
539
1097
149
125
141
45
92
184
113
160
32
140
713
144
110 4
8
57
47
136
121
43
155
26
61
114
49
189
126
181
50
54
138
99
180
93
21
197
91
56
13
193
120
55
83

134
132
182
166
25
86

95
163
65
90

169
196
198
161
152
101
30

78
118
105
100
82

107
191
60
77
10
150

2
133
179
194

76
35
102
108
124
87
157
18
34

143

14
119
188
190
81
153
172
112
72
24
137
139
116 6
104
1306
0

147
66

192
75
dist(scfp200_scaled)
hclust (*, "single")
This dendrogram looks very similar to the single linkage. I would like to cut it into 10 clusters to compare.

Cutting the tree


# Cut the complete linkage tree at the same place as the single linkage tree to best compare the two met
# This k-value was chosen based upon the dendrogram
cut.com <- cutree(hc.out.complete, k = 10)

Counting the observations in each group


# Put the matrix into data frame form
scfp200_df <- data.frame(scfp200)

# Mutate the data to count the number of observations in each single linkage cluster
scfp200_single <- mutate(scfp200_df, cluster = cut.sing)
count(scfp200_single, cluster)

## cluster n
## 1 1 191
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1

4
## 6 6 1
## 7 7 1
## 8 8 1
## 9 9 1
## 10 10 1
# Mutate the data to count the number of observations in each complete linkage cluster
scfp200_complete <- mutate(scfp200_df, cluster = cut.com)
count(scfp200_complete, cluster)

## cluster n
## 1 1 14
## 2 2 156
## 3 3 19
## 4 4 2
## 5 5 1
## 6 6 3
## 7 7 1
## 8 8 2
## 9 9 1
## 10 10 1
Looking at the breakdown of the number of observations in each cluster, it is clear that the single linkage is
not the preferred method for a data set as large as this Consumer Finance data set. It has grouped almost
every observation into the first cluster, then grouped only one observation into each of the other clusters.
The complete linkage counts look more balanced. The first three clusters contain the majority of the
observations, but there are still some observations in clusters four through ten, showing that this is a more
balanced option for clustering this Consumer Finance data.

K-Means Clustering
# Run the K-Mean cluster with k=10 to match the hierarchical clustering
km.out <- kmeans(scfp200,10, nstart = 20)

# Count the number of observations in each cluster


scfp200_km <- mutate(scfp200_df, cluster = km.out$cluster)
count(scfp200_km, cluster)

## cluster n
## 1 1 2
## 2 2 9
## 3 3 4
## 4 4 1
## 5 5 177
## 6 6 2
## 7 7 2
## 8 8 1
## 9 9 1
## 10 10 1
From the counts, there are many more observations in cluster 1 than in the complete clustering. This is
overall showing a less even spread amongst clusters.

5
Comparing K-Means and Complete Hierarchal Clustering
table(km.out$cluster,cut.com)

## cut.com
## 1 2 3 4 5 6 7 8 9 10
## 1 0 0 2 0 0 0 0 0 0 0
## 2 1 2 4 0 0 2 0 0 0 0
## 3 0 0 3 0 0 1 0 0 0 0
## 4 0 0 0 0 1 0 0 0 0 0
## 5 13 153 6 2 0 0 0 2 0 1
## 6 0 0 2 0 0 0 0 0 0 0
## 7 0 1 1 0 0 0 0 0 0 0
## 8 0 0 1 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0 0 1 0
## 10 0 0 0 0 0 0 1 0 0 0
7 and 4 identical 9 and 9 are identical
From this confusion matrix, we can see that the clusters obtained from complete hierarchical clustering and
k-means clustering are somewhat different. Cluster 7 in k-means clustering is identical to cluster 4 of the
complete hierarchical clustering. Clusters 9 and 5 of both methods are identical. The other clusters differ
vastly between the two methods. Overall, complete hierarchical clustering seems to show the best spread for
this data.

You might also like