Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
Summer 2024
Today’s Agenda
• Group Quiz
• Unsupervised Learning
• Clustering
Clustering • K-Means
• Gaussian Mixture Model & EM
• Hierarchical Clustering
2
Performance Evaluation
in predictive modeling
Warm-up exercise
Scenario #1
Scenario #2
Quizzes
Dana Zhan
Jun Tan Peinan Wang
Supervised vs
Unsupervised
Figure from: Goyal, Anil. (2018). Learning a Multiview Weighted
Majority Vote Classifier: Using PAC-Bayesian Theory and Boosting.
https://miro.medium.com/max/4096/0*p3zNp9D8YFmX32am.jpg
Classical Machine Learning Methods
What is clustering?
Goal: Grouping items that “belong together” (i.e. have similar features)
10
Why do we need clustering?
Outlier Detection
• Clustering can help detecting anomalies and outliers in the data
Privacy Preservation
• Clustering can help preserve privacy by associating user data with cluster IDs instead of specific users.
Clustering Applications (Customer segmentation)
Who is your:
Most valued customer?
At the risk of churn?
Will be interested in a particular
promotion/product?
…………………
Unsupervised Learning (clustering)
Spend ($)
$10,000
20 Number of Purchases
Unsupervised Learning (clustering)
Create clusters using World Economic Indicators data
What makes a GOOD clustering?
The Silhouette Score is a single metric that effectively combines both inter-cluster similarity (separation
between clusters) and intra-cluster similarity (compactness within clusters). It provides a comprehensive
evaluation of cluster quality.
Interpretation:
• High Silhouette Score (closer to 1): Indicates that the data point is well-matched to its own cluster
and poorly matched to neighboring clusters. This suggests good clustering quality.
• Low Silhouette Score (closer to -1): Suggests that the data point may have been assigned to the
wrong cluster, as it is more similar to points in a neighboring cluster.
• Silhouette Score around 0: Indicates that the data point is on or very close to the decision boundary
between two neighboring clusters.
Types of Clustering Methods
for 𝑖 = 1 to 𝑚
𝑐 (&) ≔ index (from 1 to 𝐾) of cluster centroid
closest to 𝑥 (&)
for 𝑘 = 1 to 𝐾
𝜇( ≔ average (mean) of points assigned to cluster 𝑘
Complexity of K-Means
The time complexity of the K-means algorithm is generally
expressed as:
𝑶(𝒌 ∗ 𝒏 ∗ 𝒅 ∗ 𝒊)
Making sure the initialized centroids are “good” is critical to finding
quality local optima. Our purely random approach was wasteful
since it’s very possible that initial centroids start close together.
Getting smarter with Idea: Try to select a set of points farther away from each other.
37
Failure Modes of • If we don’t meet the assumption of spherical clusters, we
k-means will get unexpected results
39
Quiz Time!
How to choose k?
cluster heterogeneity
Lowest possible
# of clusters k
Note: You will usually have to run k-means multiple times for each k
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)
It turns out the univariate (one-dimensional) gaussian can be extended to the multivariate (multi-
dimensional) case. The form of a d-dimensional gaussian:
In higher dimensions, a Gaussian is fully specified by a mean vector 𝝁 and a d-by-d covariance
matrix 𝚺. Here |Σ| refers to the determinant of the covariance matrix e.g.
Probability Density Function (PDF), or density of a continuous random variable, is a function whose value at
any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be
interpreted as providing a relative likelihood that the value of the random variable would be close to that sample.*
* https://en.wikipedia.org/wiki/Probability_density_function
Expectation Maximization (EM)
E step:
• Randomly select mean and variance for each cluster, then figure out the
likelihood of each data point 𝑥𝑖 coming from each Gaussian.
M step:
• Use the probability assignments from the previous E step to re-estimate
the Gaussian’s mean and variance to better fit the data points.
Expectation
Maximization
(EM) Algorithm
Motivation
Mixture Models
Discovering
Shapes
Hierarchical Clustering
Divisive, a.k.a. top-down
Methods • Start with all the data in one big cluster and then
recursively split the data into smaller clusters
• Example: recursive k-means
Start with all the data in one cluster, and then run k-means to divide the data into smaller clusters.
Repeatedly run k-means on each cluster to make sub-clusters.
Divisive Clustering - Example
Wikipedia Wikipedia
Athletes Non-athletes
Wikipedia
Athletes Non-athletes
Baseball Soccer/ Musicians, Scholars, politicians,
Ice hockey artists, actors government officials
Choices to Make
58
Step 1
Single Linkage
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝐶!, 𝐶" = min 𝑑 𝑥& , 𝑥4
0" ∈2# ,0$ ∈2%
60
This formula means we are defining the distance between two clusters as the
smallest distance between any pair of points between the clusters.
Step 3 • Merge closest pair of clusters
61
Step 3 • Merge closest pair of clusters
62
Repeat
4. Repeat step 3 until all points are in one cluster
63
Notice that the height of the
Repeat dendrogram is growing as we group
4. Repeat step 3 until all pointspoints
are infarther
one from
cluster
each other
Repeat
Repeat
66
4. Repeat step 3 until all points are in on
Repeat
The tall links in the dendrogram
show us we are merging clusters
that are far away from each other
67
Final result after merging all
Repeat clusters
4. Repeat step 3 until all points are in one cluster
68
Final Result
Cut Dendrogram
70
Cut Dendrogram • Every branch that crosses 𝐷 ∗ becomes its own cluster
71
• Week 5 – Nearest Neighbor Methods &
Dimensionality Reduction (PCA)
Coming up
Next…