0% found this document useful (0 votes)
43 views

Unsupervised Learning Notes

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Unsupervised Learning Notes

Uploaded by

neeharika.sssvv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Unsupervised Learning

 Here data points have unknown outcome.


 There are two types of unsupervised learning:
o Clustering – Identify unknow structure in data.
o Dimensionality reduction – Use structural characteristics to simplify the
data.

Common clustering use cases:


 Classification
 Anomaly detection
 Customer segmentation
 Improve supervised learning.
Common dimension reduction use cases:
 High resolution images
o Image processing
o Image tracking
o Image compression
Clustering:
K-Means:
 In K-means, results depend on initial cluster assignment.
 Euclidean distance

 Often, the number of clusters (K) is unclear, and we need an approach to select
it. In order to do so, we need evaluation metrics.
o Inertia: Sum of squared distances from each point xi to its cluster ck.
 Smaller values correspond to tighter clusters.
 Value sensitive to number of points in cluster is drawback.
o Distortion: Average of squared distances from each point xi to its cluster
ck.
 Smaller values correspond to tighter clusters.
 Doesn’t generally increase as more points are added.

 When the similarity of points in the cluster is more important, you should use
distortion. And if you are more concerned that clusters have similar numbers of
points, then you should use inertia.
 Always pick the model with low inertia value. Elbow method to find the best
value of k.
This works both for inertia and
distortion.
Inertia penalizes different
number of points within clusters
and leads to more
balance, whereas distortion will
penalize average distance and
lead to more similar clusters.
 If K-Means is slow then we can use MiniBatchKMeans to speed up the process.
 If elbow in elbow method is not clear, we can use silhouette method to find the
value of k.

Summary of K-Means

Unsupervised algorithms are relevant when we don’t have an outcome or labeled


variable we are trying to predict.

They are helpful to find structures within our data set and when we want to partition our
data set into smaller pieces.
Types of Unsupervised Learning:

Type of
Unsupervised Data Example Algorithms
Learning

Use unlabeled data, K-means, Hierarchical


Segmenting costumers
Clustering Identify unknown Agglomerative Clustering,
into different groups
structure in data DBSCAN, Mean shift

Reducing size without


Use structural Principal Components
Dimensionality losing too much
characteristics to Analysis, Non-negative
Reduction information from our
simplify data Matrix, Factorization
original data set

Dimensionality reduction is important in the context of large amounts of data.

The Curse of Dimensionality

In theory, a large number of features should improve performance. In theory, as models


have more data to learn from, they should be more successful. But in practice, too many
features lead to worse performance. There are several reasons why too many features
end up leading to worse performance. If you have too many features, several things can
be wrong, for example:

- Some features can be spurious correlations, which means they correlate into the
data set but not outside your data set as long as new data comes in.

- Too many features create more noise than signal.

- Algorithms find hard to sort through non meaningful features if you have too many
features.

- The number of training examples required increases exponentially with


dimensionality.

- Higher dimensions slows performance.

- Larger data sets are computationally more expensive.

- Higher incidence of outliers.


To fix these problems in real life, it's best to reduce the dimension of the data set.

Similar to feature selection, you can use Unsupervised Machine Learning models such
as Principal Components Analysis.

Common uses of clustering cases in the real world

1. Anomaly detection

Example: Fraudulent transactions.

Suspicious fraud patterns such as small clusters of credit card transactions with high
volume of attempts, small amounts, at new merchants. This creates a new cluster and
this is presented as an anomaly so perhaps there’s fraudulent transactions happening.

2. Customer segmentation

You could segment the customers by recency, frequency, average amount of visits in
the last 3 months. Or another common type of segmentation is by demographics and
the level of engagement, for example: single costumers, new parents, empty nesters,
etc. And the combinations of each with the preferred marketing channel, so you can use
these insights for future marketing campaigns.

3. Improve supervised learning

Logistic regressions per cluster, this means training one model for each segment of
your data to try to improve classification.

Common uses of Dimension Reduction in the real world

1. Turn high resolution images into compressed images.

This means to come to a reduced, more compact version of those images so they can
still contain most of the data that can tell us what the image is about.

2. Image tracking

Reduce the noise to the primary factors that are relevant in a video capture. The
benefits of reducing the data set can greatly speed up the computational efficiency of
the detection algorithms.
K-means Clustering

K-means clustering is an iterative process in which similar observations are grouped


together. To do that, this algorithm starts by taking taking 2 random points known as
centroids and starts calculating the distance of each observation to the centroid, and
assigning each cluster to the nearest centroid. After the first iteration every point
belongs to a cluster.

Next, the number of centroids increases by one, and the centroid for each cluster are
recalculated as the points with the average distance to all points in a given cluster. Then
we keep repeating this process until no example is assigned to another cluster.

And this process is repeated k-times, hence the name k-means. This algorithm
converges when clusters do not move anymore.

We can also create multiple clusters, and we can have multiple solutions, by multiple
solutions we mean that the clusters are not going to move anymore (they converged)
but we can converge in different places where we no longer move those centroids.

Advantages and Disadvantages of K-Means

The main advantage of k-means algorithm is that it is easy to compute. One


disadvantage is that this algorithm is sensitive to the choice of the initial points, so
different initial configurations may yield different results.

To overcome this, there is a smarter initialization of K-mean clusters called K-means ++,
which helps to avoid getting stuck at local optima. This is the default implementation of
the K-means.

Model Selection, choosing K number of clusters

Sometimes you want to split your data into a predetermined number of groups or
segments. Often, the number of clusters (K) is unclear, and you need an approach to
select it.

A common metric is Inertia, defined as the sum of squares distance from each point to
its cluster.

Smaller values of Inertia correspond to tighter clusters, this means that we are
penalizing spread out clusters and rewarding tighter clusters to their centroids.
The drawback of this metric is that its value sensitive to number of points in clusters.
The more points you add, the more you will continue penalizing the inertia of a cluster,
even if those points are relatively closer to the centroids than the existing points.

Another metric is Distortion defined as the average of squared distance from each
point to its cluster.

Smaller values of distortion correspond to tighter clusters.

An advantage is that distortion doesn’t generally increase as more points are added
(relative to inertia). This means that It doesn’t increase distortion, as closer points will
aid an actual decreasing the average distance.

Inertia Vs. Distortion

Both are measures of entropy per cluster.

Inertia will always increase as more members are added to each cluster, while this will
not be the case with distortion.

When the similarity of the points in the cluster are very relevant, you should use
distortion and if you are more concerned that clusters should have a similar number of
points, then you should use inertia.

Finding the right cluster

To find the cluster with a low entropy metric, you run a few k-means clustering models
with different initial configurations, compare the results, and determine which one of the
different initializations of configurations lead to the lowest inertia or distortion.

Distance metrics:
Clusters mainly depends on distance.
 Euclidean distance or L2 distance
 Manhattan distance or L1 distance and always > L1 distance
 Cosine distance

 Jaccard distance
Manhattan is used in business cases where there is very high dimensionality.
Cosine (angle between points) distance is insensitive to scaling with respect to
origin. It is better for data such as text where location of occurrence is less important
and is more robust to curse of dimensionality.
Euclidean is useful to coordinate based measures and it is more sensitive to curse of
dimensionality.
Jaccard distance applies to sets like word occurrences.

 This is jaccard
distance where we
can group similar text
documents.

Hierarchical Agglomerative clustering (HAC):


 With hierarchical agglomerative clustering, we will try to continuously split out
and merge new clusters successively until we reach a level of convergence.
 This clustering algorithm will try to continuously split out and merge new
clusters successively until it reaches a level of convergence.
 This algorithm identifies first the pair of points which has the minimal distance,
and it turns it into the first cluster, then the second pair of points with the
second minimal distance will form the second cluster, and so on. As the
algorithm continues doing this with all the pairs of closest points, we can turn
our points into just one cluster, which is why HAC also needs a stopping
criterion.
 There are a few linkage types or methods to measure the distance between
clusters. these are the most common:

Single linkage: minimum pairwise distance between clusters.

 It takes the distance between specific points and declare that as the distance
between 2 clusters and then it tries to find for all these pairwise linkages
which one is the minimum and then we will combine those together as we
move up to a higher hierarchy.
 Pros: It helps ensuring a clear separation between clusters.
 Cons: It won’t be able to separate out cleanly if there is some noise between
2 different clusters.

Complete linkage: maximum pairwise distance between clusters.


 Instead of taking the minimum distance given the points within each cluster, it
will take the maximum value. Then from those maximum distances it decides
which one is the smallest and then we can move up that hierarchy.
 Pro: It would do a much better job of separating out the clusters if there’s a bit
of noise or overlapping points of two different clusters.
 Cons: Tends to break apart a larger existing cluster depending on where that
maximum distance of those different points may end up lying

Average linkage: Average pairwise distance between clusters.

 Takes the average of all the points for a given cluster and use those averages
or clusters centroids to determine the distance between the different clusters.
 Pros: The same as the single and complete linkage.
 Cons: It also tends to break apart a larger existing cluster.

Ward linkage: Cluster merge is based on inertia.

 Computes the inertia for all pairs of points and picks the pair that will
ultimately minimizes the value of inertia.
 The pros and cons are the same as the average linkage.

Syntax for Agglomerative Clusters

# First, import AgglomarativeClustering.

from sklearn.cluster import AgglomerativeClustering

# then create an instance of class,

agg = AgglomerativeClustering (n_clusters=3, affinity=‘euclidean’,


linkage=‘ward’)

# and finally, fit the instance on the data and then predict clusters for new data

agg=agg.fit(X1)

y_predict=agg.predict(X2)

 Agglomerative clustering is a bottom-up approach.

DBSCAN:
 Stands for Density Based Spatial Clustering of Application with Noise.
 It helps us to find outliers.
 It clusters data rather than partitioning it.
Noise is not a part of cluster.

 If there is a point that does not have n_clu neighbors and is not reached from a
core point, then it is noise point.
Min_sample is
n_clu.

In DBSCAN, we
cannot predict.
Db.predict is
error.

Mean Shift:
 A partitioning algorithm that assigns points to the nearest cluster centroid.

 Algorithm ends when all points are assigned to cluster. For K-Means, the centroid is
the mean of all points in a
We can think of the weighted mean as cluster.
assigning more weight to those points closer
to the original point within our window. For mean shift, centroid is the
highest point of local density.
The kernel used here can
be RBF kernel similar to
Gaussian kernel.
Compare different clustering:
Dimensionality reduction:
There is a very common situation, especially with enterprise datasets that often contain
many features.
 Data can be represented by fewer dimensions (features).
o Reduce dimensionality by selecting subset (feature elimination).
o Combine linear and non-linear transformations (PCA and non-negative
matrix factorization).

Principal Component Analysis (PCA):


Reduce the dimensions by not losing important information and tries to maintain good
variance as well. It understands the underlying important features.
Single Value Decomposition is a matrix factorization method normally used for PCA.
The dataset that we work with does not need to be
We do not require a square dataset. square, as we see here, our original dataset,
a is going to be an n by n matrix with m and n not being
equal. We can decompose A into the matrices u s and
v. And u and v here can be thought of as just rotations in
space, one in the m space, and by m space one in the n by
n space. And they encode the information of V1 and V2
directions only, but not the length. They are going to be
more of auxiliary or technical matrices, where the real
geometric idea is going to lie with S. Now, the matrix S is
going to store the actual lengths of those vectors. So recall
those longer vectors will tell you, which ones should be
your primary vectors in regards to where to project your
data down onto. So S as we see here, given where the
stars are, is what's going to be called a diagonal matrix.
Meaning only the non zero entries, in that matrix are across
that diagonal.
k<n

Importance of feature scaling:


 PCA and SVD (Singular Value Decomposition) seek to find the vector that
capture the most variance.
 Variance is sensitive to axis scale.
 Must scale data.
Other popular manifold learning methods exists, such as ISO map, which will use
nearest neighbors and try to maintain the nearest neighbor ordering in a way, or TSNE,
which tries to keep similar points closer together, and dissimilar points further apart and
can be very good for visualization. They are going to be several ways to do
decomposition, and generally we say try a few out. A good approach would be to try
those out, and then perhaps if you are able to move down to two or three dimensions,
using EDA and visualization to see how well you were able to come up with clusters,
or maintain the amount of variance that was originally there.

Non-Negative Matrix Decomposition:


Non-Negative Matrix Decomposition is another way of reducing the number of dimensions. Similar to
PCA, it is also a matrix decomposition method in the form V=W*H.

The main difference is that it can only be applied to matrices that have positive values as inputs, for
example:

 pixels in a matrix
 positive attributes that can be zero or higher

In the case of word and vocabulary recognition, each row in the matrix can be considered a
document, while each column can be considered a topic.

NMF has proven to be powerful for:

 word and vocabulary recognition


 image processing,
 text mining
 transcribing
 encoding and decoding
 decomposition of video, music, or images

There are advantages and disadvantages of only dealing with non negative values.

An advantage, is that NMF leads to features that tend to be more interpretable. For example, in
facial recognition, the decomposed components match to something more interpretable like, for
example, the nose, the eyebrows, or the mouth.

A disadvantage is that NMF truncates negative values by default to impose the added constraint of
only positive values. This truncation tends to lose more information than other decomposition
methods.
Unlike PCA, it does not have to use orthogonal latent vectors, and can end up using vectors that
point in the same direction.

NMF for NLP

In the case of Natural Language Processing, NMF works as below given these inputs, parameters to
tune, and outputs:

Inputs

Given vectorized inputs, which are usually pre-processed using count vectorizer or vectorizers in the
form Term Frequency - Inverse Document Frequency (TF-IDF).

Parameters to tune

The main two parameters are:

 Number of Topics
 Text Preprocessing (stop words, min/max document frequency, parts of speech, etc)

Output

The output of NMF will be two matrices:

1. W Matrix telling us how the terms relate to the different topics.


2. H Matrix telling us how to use those topics to reconstruct our original documents.

Syntax

The syntax consists of importing the class containing the clustering method:

from sklearn.decomposition import NMF

creating the instance of the class:

nmf=NMF(n_components=3, init='random')

and fit the instance and create a transformed version of the data:

x_nmf=NMF.fit(X)

You might also like