Unsupervised Learning Notes
Unsupervised Learning Notes
Often, the number of clusters (K) is unclear, and we need an approach to select
it. In order to do so, we need evaluation metrics.
o Inertia: Sum of squared distances from each point xi to its cluster ck.
Smaller values correspond to tighter clusters.
Value sensitive to number of points in cluster is drawback.
o Distortion: Average of squared distances from each point xi to its cluster
ck.
Smaller values correspond to tighter clusters.
Doesn’t generally increase as more points are added.
When the similarity of points in the cluster is more important, you should use
distortion. And if you are more concerned that clusters have similar numbers of
points, then you should use inertia.
Always pick the model with low inertia value. Elbow method to find the best
value of k.
This works both for inertia and
distortion.
Inertia penalizes different
number of points within clusters
and leads to more
balance, whereas distortion will
penalize average distance and
lead to more similar clusters.
If K-Means is slow then we can use MiniBatchKMeans to speed up the process.
If elbow in elbow method is not clear, we can use silhouette method to find the
value of k.
Summary of K-Means
They are helpful to find structures within our data set and when we want to partition our
data set into smaller pieces.
Types of Unsupervised Learning:
Type of
Unsupervised Data Example Algorithms
Learning
- Some features can be spurious correlations, which means they correlate into the
data set but not outside your data set as long as new data comes in.
- Algorithms find hard to sort through non meaningful features if you have too many
features.
Similar to feature selection, you can use Unsupervised Machine Learning models such
as Principal Components Analysis.
1. Anomaly detection
Suspicious fraud patterns such as small clusters of credit card transactions with high
volume of attempts, small amounts, at new merchants. This creates a new cluster and
this is presented as an anomaly so perhaps there’s fraudulent transactions happening.
2. Customer segmentation
You could segment the customers by recency, frequency, average amount of visits in
the last 3 months. Or another common type of segmentation is by demographics and
the level of engagement, for example: single costumers, new parents, empty nesters,
etc. And the combinations of each with the preferred marketing channel, so you can use
these insights for future marketing campaigns.
Logistic regressions per cluster, this means training one model for each segment of
your data to try to improve classification.
This means to come to a reduced, more compact version of those images so they can
still contain most of the data that can tell us what the image is about.
2. Image tracking
Reduce the noise to the primary factors that are relevant in a video capture. The
benefits of reducing the data set can greatly speed up the computational efficiency of
the detection algorithms.
K-means Clustering
Next, the number of centroids increases by one, and the centroid for each cluster are
recalculated as the points with the average distance to all points in a given cluster. Then
we keep repeating this process until no example is assigned to another cluster.
And this process is repeated k-times, hence the name k-means. This algorithm
converges when clusters do not move anymore.
We can also create multiple clusters, and we can have multiple solutions, by multiple
solutions we mean that the clusters are not going to move anymore (they converged)
but we can converge in different places where we no longer move those centroids.
To overcome this, there is a smarter initialization of K-mean clusters called K-means ++,
which helps to avoid getting stuck at local optima. This is the default implementation of
the K-means.
Sometimes you want to split your data into a predetermined number of groups or
segments. Often, the number of clusters (K) is unclear, and you need an approach to
select it.
A common metric is Inertia, defined as the sum of squares distance from each point to
its cluster.
Smaller values of Inertia correspond to tighter clusters, this means that we are
penalizing spread out clusters and rewarding tighter clusters to their centroids.
The drawback of this metric is that its value sensitive to number of points in clusters.
The more points you add, the more you will continue penalizing the inertia of a cluster,
even if those points are relatively closer to the centroids than the existing points.
Another metric is Distortion defined as the average of squared distance from each
point to its cluster.
An advantage is that distortion doesn’t generally increase as more points are added
(relative to inertia). This means that It doesn’t increase distortion, as closer points will
aid an actual decreasing the average distance.
Inertia will always increase as more members are added to each cluster, while this will
not be the case with distortion.
When the similarity of the points in the cluster are very relevant, you should use
distortion and if you are more concerned that clusters should have a similar number of
points, then you should use inertia.
To find the cluster with a low entropy metric, you run a few k-means clustering models
with different initial configurations, compare the results, and determine which one of the
different initializations of configurations lead to the lowest inertia or distortion.
Distance metrics:
Clusters mainly depends on distance.
Euclidean distance or L2 distance
Manhattan distance or L1 distance and always > L1 distance
Cosine distance
Jaccard distance
Manhattan is used in business cases where there is very high dimensionality.
Cosine (angle between points) distance is insensitive to scaling with respect to
origin. It is better for data such as text where location of occurrence is less important
and is more robust to curse of dimensionality.
Euclidean is useful to coordinate based measures and it is more sensitive to curse of
dimensionality.
Jaccard distance applies to sets like word occurrences.
This is jaccard
distance where we
can group similar text
documents.
It takes the distance between specific points and declare that as the distance
between 2 clusters and then it tries to find for all these pairwise linkages
which one is the minimum and then we will combine those together as we
move up to a higher hierarchy.
Pros: It helps ensuring a clear separation between clusters.
Cons: It won’t be able to separate out cleanly if there is some noise between
2 different clusters.
Takes the average of all the points for a given cluster and use those averages
or clusters centroids to determine the distance between the different clusters.
Pros: The same as the single and complete linkage.
Cons: It also tends to break apart a larger existing cluster.
Computes the inertia for all pairs of points and picks the pair that will
ultimately minimizes the value of inertia.
The pros and cons are the same as the average linkage.
# and finally, fit the instance on the data and then predict clusters for new data
agg=agg.fit(X1)
y_predict=agg.predict(X2)
DBSCAN:
Stands for Density Based Spatial Clustering of Application with Noise.
It helps us to find outliers.
It clusters data rather than partitioning it.
Noise is not a part of cluster.
If there is a point that does not have n_clu neighbors and is not reached from a
core point, then it is noise point.
Min_sample is
n_clu.
In DBSCAN, we
cannot predict.
Db.predict is
error.
Mean Shift:
A partitioning algorithm that assigns points to the nearest cluster centroid.
Algorithm ends when all points are assigned to cluster. For K-Means, the centroid is
the mean of all points in a
We can think of the weighted mean as cluster.
assigning more weight to those points closer
to the original point within our window. For mean shift, centroid is the
highest point of local density.
The kernel used here can
be RBF kernel similar to
Gaussian kernel.
Compare different clustering:
Dimensionality reduction:
There is a very common situation, especially with enterprise datasets that often contain
many features.
Data can be represented by fewer dimensions (features).
o Reduce dimensionality by selecting subset (feature elimination).
o Combine linear and non-linear transformations (PCA and non-negative
matrix factorization).
The main difference is that it can only be applied to matrices that have positive values as inputs, for
example:
pixels in a matrix
positive attributes that can be zero or higher
In the case of word and vocabulary recognition, each row in the matrix can be considered a
document, while each column can be considered a topic.
There are advantages and disadvantages of only dealing with non negative values.
An advantage, is that NMF leads to features that tend to be more interpretable. For example, in
facial recognition, the decomposed components match to something more interpretable like, for
example, the nose, the eyebrows, or the mouth.
A disadvantage is that NMF truncates negative values by default to impose the added constraint of
only positive values. This truncation tends to lose more information than other decomposition
methods.
Unlike PCA, it does not have to use orthogonal latent vectors, and can end up using vectors that
point in the same direction.
In the case of Natural Language Processing, NMF works as below given these inputs, parameters to
tune, and outputs:
Inputs
Given vectorized inputs, which are usually pre-processed using count vectorizer or vectorizers in the
form Term Frequency - Inverse Document Frequency (TF-IDF).
Parameters to tune
Number of Topics
Text Preprocessing (stop words, min/max document frequency, parts of speech, etc)
Output
Syntax
The syntax consists of importing the class containing the clustering method:
nmf=NMF(n_components=3, init='random')
and fit the instance and create a transformed version of the data:
x_nmf=NMF.fit(X)