0% found this document useful (0 votes)
31 views65 pages

Week 4 - Lecture Slides - K-Means, Mixture Models, & EM

Uploaded by

fantiaoxi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views65 pages

Week 4 - Lecture Slides - K-Means, Mixture Models, & EM

Uploaded by

fantiaoxi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

CS6140: Machine Learning

Week 4 – Clustering, Mixture Models, & Expectation Maximization

Dr. Ryan Rad

Summer 2024
Today’s Agenda

• Group Quiz
• Unsupervised Learning
• Clustering
Clustering • K-Means
• Gaussian Mixture Model & EM
• Hierarchical Clustering

2
Performance Evaluation
in predictive modeling
Warm-up exercise

Scenario #1

• Classification with 2 classes: Score: 52%

Scenario #2

• Classification with 8 classes – Score: 24%

Which model is better?


Ji Chen
Jun Tan Liyao Zhang

Quizzes

Dana Zhan
Jun Tan Peinan Wang
Supervised vs
Unsupervised
Figure from: Goyal, Anil. (2018). Learning a Multiview Weighted
Majority Vote Classifier: Using PAC-Bayesian Theory and Boosting.
https://miro.medium.com/max/4096/0*p3zNp9D8YFmX32am.jpg
Classical Machine Learning Methods
What is clustering?

Goal: Grouping items that “belong together” (i.e. have similar features)

Constraint: We only use the features X, not the labels Y (unsupervised)

Grouping unlabeled examples is called clustering

10
Why do we need clustering?

Lack of Labeled Data


• Data Labeling (annotation) is not only expensive and labor intensive but sometimes impossible.

Summarize and Understand the Data


• Clustering helps in understanding the natural grouping in a dataset. Their purpose is to make sense to
partition the data into some group of logical groupings.

Outlier Detection
• Clustering can help detecting anomalies and outliers in the data

Privacy Preservation
• Clustering can help preserve privacy by associating user data with cluster IDs instead of specific users.
Clustering Applications (Customer segmentation)

Who is your:
Most valued customer?
At the risk of churn?
Will be interested in a particular
promotion/product?
…………………
Unsupervised Learning (clustering)
Spend ($)

$10,000

Sell specific products to the


targeted group of audience

20 Number of Purchases
Unsupervised Learning (clustering)
Create clusters using World Economic Indicators data
What makes a GOOD clustering?

Which one demonstrates a “better” clustering?


What makes a good clustering?

• Organizing data into clusters such that there is:


• high intra-cluster similarity (better association)
• low inter-cluster similarity (more distinct clusters)
Silhouette Score

The Silhouette Score is a single metric that effectively combines both inter-cluster similarity (separation
between clusters) and intra-cluster similarity (compactness within clusters). It provides a comprehensive
evaluation of cluster quality.

How the Silhouette Score Works:


• For each data point, Calculate:
• The average distance to all other points within the same cluster
(intra-cluster distance, a)
• The average distance to all points in the nearest neighboring cluster
(inter-cluster distance, b)

Interpretation:
• High Silhouette Score (closer to 1): Indicates that the data point is well-matched to its own cluster
and poorly matched to neighboring clusters. This suggests good clustering quality.
• Low Silhouette Score (closer to -1): Suggests that the data point may have been assigned to the
wrong cluster, as it is more similar to points in a neighboring cluster.
• Silhouette Score around 0: Indicates that the data point is on or very close to the decision boundary
between two neighboring clusters.
Types of Clustering Methods

Centroid-based Clustering: Density-based Clustering


k-means is the most widely-used Density-based clustering connects areas
centroid-based clustering algorithm. of high example density into clusters.

Distribution-based Clustering Hierarchical Clustering


This clustering approach assumes data Hierarchical clustering creates a tree
is composed of distributions, such of clusters. Hierarchical clustering, not
as Gaussian distributions. surprisingly, is well suited to
hierarchical data, such as taxonomies.

Photos taken from: https://developers.google.com/machine-learning/clustering/clustering-algorithms


Slide credit: Andrew Ng
Slide credit: Andrew Ng
Slide credit: Andrew Ng
Slide credit: Andrew Ng
Slide credit: Andrew Ng
K-means algorithm
• Input:
• 𝐾 (number of clusters)
• Training set 𝑥 (") , 𝑥 ($) , 𝑥 (%) , ⋯ , 𝑥 (&)

• 𝑥 (') ∈ ℝ( (note: drop 𝑥) = 1 convention)

Slide credit: Andrew Ng


K-means algorithm
• Randomly initialize 𝐾 cluster centroids 𝜇!, 𝜇", ⋯ , 𝜇# ∈ ℝ$
Repeat {

for 𝑖 = 1 to 𝑚
𝑐 (&) ≔ index (from 1 to 𝐾) of cluster centroid
closest to 𝑥 (&)

for 𝑘 = 1 to 𝐾
𝜇( ≔ average (mean) of points assigned to cluster 𝑘

} Slide credit: Andrew Ng


K-means Objective Function
• 𝑐 (') = Index of cluster (1, 2, … K) to which
example 𝑥 ' is currently assigned
• 𝜇* = cluster centroid 𝑘 (𝜇* ∈ ℝ( )
• 𝜇+ (") = cluster centroid of cluster to which
example 𝑥 ' has been assigned
• Optimization objective:
&
1 $
𝐽 𝑐 "
, ⋯ , 𝑐 , 𝜇" , ⋯ , 𝜇, = / 𝑥 ' − 𝜇+ "
& Inertia
𝑚
'-"
" &
min 𝐽 𝑐 , ⋯ , 𝑐 , 𝜇" , ⋯ , 𝜇,
+ $ ,⋯,+ %
0$,⋯,0&
Slide credit: Andrew Ng
K-means Objective Function
Randomly initialize 𝐾 cluster centroids 𝜇!, 𝜇", ⋯ , 𝜇# ∈ ℝ$

Repeat{ Cluster assignment step


"
! "
1 $ '
𝐽 𝑐 ,⋯,𝑐 , 𝜇! , ⋯ , 𝜇# = ) 𝑥 − 𝜇& !
for 𝑖 = 1 to 𝑚 𝑚
$%!

𝑐 (&) ≔ index (from 1 to 𝐾) of cluster centroid


closest to 𝑥 (&)
Centroid update step
"
! "
1 $ '
𝐽 𝑐 ,⋯,𝑐 , 𝜇! , ⋯ , 𝜇# = ) 𝑥 − 𝜇& !
for 𝑘 = 1 to 𝐾 $%!
𝑚
𝜇( ≔ average (mean) of points assigned to cluster 𝑘
} Slide credit: Andrew Ng
What does it mean for
something to converge
to a local optima?

Initial settings will greatly


impact results!
• k is the number of clusters
• n is the number of data points
• d is the number of dimensions of each data point
• i is the number of iterations required for convergence

Complexity of K-Means
The time complexity of the K-means algorithm is generally
expressed as:

𝑶(𝒌 ∗ 𝒏 ∗ 𝒅 ∗ 𝒊)
Making sure the initialized centroids are “good” is critical to finding
quality local optima. Our purely random approach was wasteful
since it’s very possible that initial centroids start close together.

Getting smarter with Idea: Try to select a set of points farther away from each other.

Initialization k-means++ does a slightly smarter random initialization


1. Choose first cluster 𝜇! from the data uniformly at random
2. For the current set of centroids (starting with just 𝜇! ), compute
the distance between each datapoint and its closest centroid
K-Means ++ 3. Choose a new centroid from the remaining data points with
probability of 𝑥" being chosen proportional to 𝑑 𝑥" #
4. Repeat 2 and 3 until we have selected 𝑘 centroids
Pros
• Improves quality of local minima
• Faster convergence to local minima
K-Mans ++ Cons
• Computationally more expensive at beginning
when compared to simple random
initialization
Pros / Cons
Problems with k-means

• In real life, cluster assignments are not always clear cut


• E.g. The moon landing: Science? World News? Conspiracy?

• Because we minimize Euclidean distance, k-means


assumes all the clusters are spherical

• We can change this with weighted Euclidean distance


• Still assumes every cluster is the same shape/orientation

37
Failure Modes of • If we don’t meet the assumption of spherical clusters, we
k-means will get unexpected results

disparate cluster sizes overlapping clusters different


shaped/oriented
clusters

39
Quiz Time!
How to choose k?

No right answer! Depends on your application.


General, look for the “elbow” in the graph

cluster heterogeneity
Lowest possible
# of clusters k

Note: You will usually have to run k-means multiple times for each k
Gaussian Mixture Model (GMM)
Gaussian Mixture Model (GMM)

It turns out the univariate (one-dimensional) gaussian can be extended to the multivariate (multi-
dimensional) case. The form of a d-dimensional gaussian:

In higher dimensions, a Gaussian is fully specified by a mean vector 𝝁 and a d-by-d covariance
matrix 𝚺. Here |Σ| refers to the determinant of the covariance matrix e.g.

In two dimension, the Gaussian's parameters might look like this:


Mixture of
Gaussians
Mixture of Gaussians
Mixture of Gaussians

For example, consider the following GMM:


Clarification (Probability vs Likelihood)

Probability Density Function (PDF), or density of a continuous random variable, is a function whose value at
any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be
interpreted as providing a relative likelihood that the value of the random variable would be close to that sample.*

𝒇 𝒙 : a (continuous) probability density function

if X is continuous, the probability that X takes on any specific value x is 0

In a nutshell: Probability refers to the chance that a particular


outcome occurs based on the values of parameters in a model.
Likelihood refers to how well a sample provides support for
particular values of a parameter in a model.

* https://en.wikipedia.org/wiki/Probability_density_function
Expectation Maximization (EM)

E step:
• Randomly select mean and variance for each cluster, then figure out the
likelihood of each data point 𝑥𝑖 coming from each Gaussian.

M step:
• Use the probability assignments from the previous E step to re-estimate
the Gaussian’s mean and variance to better fit the data points.
Expectation
Maximization
(EM) Algorithm
Motivation

If we try to learn clusters in hierarchies, we can


• Avoid choosing the # of clusters beforehand
• Use dendrograms to help visualize different granularities of
clusters
• Allow us to use any distance metric
o K-means requires Euclidean distance
• Can often find more complex shapes than k-means
k-means

Mixture Models
Discovering
Shapes

Hierarchical Clustering
Divisive, a.k.a. top-down

Methods • Start with all the data in one big cluster and then
recursively split the data into smaller clusters
• Example: recursive k-means

Agglomerative, a.k.a. bottom-up:

• Start with each data point in its own cluster.


Merge clusters until all points are in one big
cluster.
• Example: single linkage
Divisive Clustering

Start with all the data in one cluster, and then run k-means to divide the data into smaller clusters.
Repeatedly run k-means on each cluster to make sub-clusters.
Divisive Clustering - Example

Wikipedia Wikipedia
Athletes Non-athletes

Wikipedia
Athletes Non-athletes
Baseball Soccer/ Musicians, Scholars, politicians,
Ice hockey artists, actors government officials
Choices to Make

For decisive clustering, you need to make the following choices:


• Which algorithm to use
• How many clusters per split
• When to split vs when to stop
• Max cluster size
Number of points in cluster falls below threshold
• Max cluster radius
distance to furthest point falls below threshold
• Specified # of clusters
split until pre-specified # of clusters is reached
Algorithm at a glance
• Initialize each point in its own cluster

Agglomerative • Define a distance metric between clusters

Clustering • While there is more than one cluster


• Merge the two closest clusters

58
Step 1

1. Initialize each point to be its own cluster


Step 2

2. Define a distance metric between clusters

Single Linkage
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝐶!, 𝐶" = min 𝑑 𝑥& , 𝑥4
0" ∈2# ,0$ ∈2%

60
This formula means we are defining the distance between two clusters as the
smallest distance between any pair of points between the clusters.
Step 3 • Merge closest pair of clusters

3. Merge the two closest clusters

61
Step 3 • Merge closest pair of clusters

3. Merge the two closest clusters

62
Repeat
4. Repeat step 3 until all points are in one cluster

63
Notice that the height of the
Repeat dendrogram is growing as we group
4. Repeat step 3 until all pointspoints
are infarther
one from
cluster
each other
Repeat
Repeat

• Looking at the dendrogram, we can see there is


a bit of an outlier!
4. Repeat step 3 until all points are i

• Can tell by seeing a point join a


cluster with a really large distance.

66
4. Repeat step 3 until all points are in on

Repeat
The tall links in the dendrogram
show us we are merging clusters
that are far away from each other

67
Final result after merging all
Repeat clusters
4. Repeat step 3 until all points are in one cluster

68
Final Result
Cut Dendrogram

• Choose a distance 𝐷 ∗ to “cut” the


dendrogram
• Use the largest clusters with
distance < 𝐷 ∗
• Usually ignore the idea of the
nested clusters after cutting

70
Cut Dendrogram • Every branch that crosses 𝐷 ∗ becomes its own cluster

71
• Week 5 – Nearest Neighbor Methods &
Dimensionality Reduction (PCA)

Coming up
Next…

• Homework #2 due May 31 (@ 7pm Pacific Time)


I’ll introduce course projects in Week 6
• Team Formation due June 14
• Course Survey due June 14
Lab Session:
Clustering with K-Means,
GMM, Agglomerative
Questions?

You might also like