0% found this document useful (0 votes)

43 views

Unsupervised Learning Notes

Uploaded by

neeharika.sssvv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views

Unsupervised Learning Notes

Uploaded by

neeharika.sssvv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 21

Unsupervised Learning

 Here data points have unknown outcome.

 There are two types of unsupervised learning:
o Clustering – Identify unknow structure in data.
o Dimensionality reduction – Use structural characteristics to simplify the
data.

Common clustering use cases:

 Classification
 Anomaly detection
 Customer segmentation
 Improve supervised learning.
Common dimension reduction use cases:
 High resolution images
o Image processing
o Image tracking
o Image compression
Clustering:
K-Means:
 In K-means, results depend on initial cluster assignment.
 Euclidean distance

 Often, the number of clusters (K) is unclear, and we need an approach to select
it. In order to do so, we need evaluation metrics.
o Inertia: Sum of squared distances from each point xi to its cluster ck.
 Smaller values correspond to tighter clusters.
 Value sensitive to number of points in cluster is drawback.
o Distortion: Average of squared distances from each point xi to its cluster
ck.
 Smaller values correspond to tighter clusters.
 Doesn’t generally increase as more points are added.

 When the similarity of points in the cluster is more important, you should use
distortion. And if you are more concerned that clusters have similar numbers of
points, then you should use inertia.
 Always pick the model with low inertia value. Elbow method to find the best
value of k.
This works both for inertia and
distortion.
Inertia penalizes different
number of points within clusters
and leads to more
balance, whereas distortion will
penalize average distance and
lead to more similar clusters.
 If K-Means is slow then we can use MiniBatchKMeans to speed up the process.
 If elbow in elbow method is not clear, we can use silhouette method to find the
value of k.

Summary of K-Means

Unsupervised algorithms are relevant when we don’t have an outcome or labeled

variable we are trying to predict.

They are helpful to find structures within our data set and when we want to partition our
data set into smaller pieces.
Types of Unsupervised Learning:

Type of
Unsupervised Data Example Algorithms
Learning

Use unlabeled data, K-means, Hierarchical

Segmenting costumers
Clustering Identify unknown Agglomerative Clustering,
into different groups
structure in data DBSCAN, Mean shift

Reducing size without

Use structural Principal Components
Dimensionality losing too much
characteristics to Analysis, Non-negative
Reduction information from our
simplify data Matrix, Factorization
original data set

Dimensionality reduction is important in the context of large amounts of data.

The Curse of Dimensionality

In theory, a large number of features should improve performance. In theory, as models

have more data to learn from, they should be more successful. But in practice, too many
features lead to worse performance. There are several reasons why too many features
end up leading to worse performance. If you have too many features, several things can
be wrong, for example:

- Some features can be spurious correlations, which means they correlate into the
data set but not outside your data set as long as new data comes in.

- Too many features create more noise than signal.

- Algorithms find hard to sort through non meaningful features if you have too many
features.

- The number of training examples required increases exponentially with

dimensionality.

- Higher dimensions slows performance.

- Larger data sets are computationally more expensive.

- Higher incidence of outliers.

To fix these problems in real life, it's best to reduce the dimension of the data set.

Similar to feature selection, you can use Unsupervised Machine Learning models such
as Principal Components Analysis.

Common uses of clustering cases in the real world

1. Anomaly detection

Example: Fraudulent transactions.

Suspicious fraud patterns such as small clusters of credit card transactions with high
volume of attempts, small amounts, at new merchants. This creates a new cluster and
this is presented as an anomaly so perhaps there’s fraudulent transactions happening.

2. Customer segmentation

You could segment the customers by recency, frequency, average amount of visits in
the last 3 months. Or another common type of segmentation is by demographics and
the level of engagement, for example: single costumers, new parents, empty nesters,
etc. And the combinations of each with the preferred marketing channel, so you can use
these insights for future marketing campaigns.

3. Improve supervised learning

Logistic regressions per cluster, this means training one model for each segment of
your data to try to improve classification.

Common uses of Dimension Reduction in the real world

1. Turn high resolution images into compressed images.

This means to come to a reduced, more compact version of those images so they can
still contain most of the data that can tell us what the image is about.

2. Image tracking

Reduce the noise to the primary factors that are relevant in a video capture. The
benefits of reducing the data set can greatly speed up the computational efficiency of
the detection algorithms.
K-means Clustering

K-means clustering is an iterative process in which similar observations are grouped

together. To do that, this algorithm starts by taking taking 2 random points known as
centroids and starts calculating the distance of each observation to the centroid, and
assigning each cluster to the nearest centroid. After the first iteration every point
belongs to a cluster.

Next, the number of centroids increases by one, and the centroid for each cluster are
recalculated as the points with the average distance to all points in a given cluster. Then
we keep repeating this process until no example is assigned to another cluster.

And this process is repeated k-times, hence the name k-means. This algorithm
converges when clusters do not move anymore.

We can also create multiple clusters, and we can have multiple solutions, by multiple
solutions we mean that the clusters are not going to move anymore (they converged)
but we can converge in different places where we no longer move those centroids.

Advantages and Disadvantages of K-Means

The main advantage of k-means algorithm is that it is easy to compute. One

disadvantage is that this algorithm is sensitive to the choice of the initial points, so
different initial configurations may yield different results.

To overcome this, there is a smarter initialization of K-mean clusters called K-means ++,
which helps to avoid getting stuck at local optima. This is the default implementation of
the K-means.

Model Selection, choosing K number of clusters

Sometimes you want to split your data into a predetermined number of groups or
segments. Often, the number of clusters (K) is unclear, and you need an approach to
select it.

A common metric is Inertia, defined as the sum of squares distance from each point to
its cluster.

Smaller values of Inertia correspond to tighter clusters, this means that we are
penalizing spread out clusters and rewarding tighter clusters to their centroids.
The drawback of this metric is that its value sensitive to number of points in clusters.
The more points you add, the more you will continue penalizing the inertia of a cluster,
even if those points are relatively closer to the centroids than the existing points.

Another metric is Distortion defined as the average of squared distance from each
point to its cluster.

Smaller values of distortion correspond to tighter clusters.

An advantage is that distortion doesn’t generally increase as more points are added
(relative to inertia). This means that It doesn’t increase distortion, as closer points will
aid an actual decreasing the average distance.

Inertia Vs. Distortion

Both are measures of entropy per cluster.

Inertia will always increase as more members are added to each cluster, while this will
not be the case with distortion.

When the similarity of the points in the cluster are very relevant, you should use
distortion and if you are more concerned that clusters should have a similar number of
points, then you should use inertia.

Finding the right cluster

To find the cluster with a low entropy metric, you run a few k-means clustering models
with different initial configurations, compare the results, and determine which one of the
different initializations of configurations lead to the lowest inertia or distortion.

Distance metrics:
Clusters mainly depends on distance.
 Euclidean distance or L2 distance
 Manhattan distance or L1 distance and always > L1 distance
 Cosine distance

 Jaccard distance
Manhattan is used in business cases where there is very high dimensionality.
Cosine (angle between points) distance is insensitive to scaling with respect to
origin. It is better for data such as text where location of occurrence is less important
and is more robust to curse of dimensionality.
Euclidean is useful to coordinate based measures and it is more sensitive to curse of
dimensionality.
Jaccard distance applies to sets like word occurrences.

 This is jaccard
distance where we
can group similar text
documents.

Hierarchical Agglomerative clustering (HAC):

 With hierarchical agglomerative clustering, we will try to continuously split out
and merge new clusters successively until we reach a level of convergence.
 This clustering algorithm will try to continuously split out and merge new
clusters successively until it reaches a level of convergence.
 This algorithm identifies first the pair of points which has the minimal distance,
and it turns it into the first cluster, then the second pair of points with the
second minimal distance will form the second cluster, and so on. As the
algorithm continues doing this with all the pairs of closest points, we can turn
our points into just one cluster, which is why HAC also needs a stopping
criterion.
 There are a few linkage types or methods to measure the distance between
clusters. these are the most common:

Single linkage: minimum pairwise distance between clusters.

 It takes the distance between specific points and declare that as the distance
between 2 clusters and then it tries to find for all these pairwise linkages
which one is the minimum and then we will combine those together as we
move up to a higher hierarchy.
 Pros: It helps ensuring a clear separation between clusters.
 Cons: It won’t be able to separate out cleanly if there is some noise between
2 different clusters.

Complete linkage: maximum pairwise distance between clusters.

 Instead of taking the minimum distance given the points within each cluster, it
will take the maximum value. Then from those maximum distances it decides
which one is the smallest and then we can move up that hierarchy.
 Pro: It would do a much better job of separating out the clusters if there’s a bit
of noise or overlapping points of two different clusters.
 Cons: Tends to break apart a larger existing cluster depending on where that
maximum distance of those different points may end up lying

Average linkage: Average pairwise distance between clusters.

 Takes the average of all the points for a given cluster and use those averages
or clusters centroids to determine the distance between the different clusters.
 Pros: The same as the single and complete linkage.
 Cons: It also tends to break apart a larger existing cluster.

Ward linkage: Cluster merge is based on inertia.

 Computes the inertia for all pairs of points and picks the pair that will
ultimately minimizes the value of inertia.
 The pros and cons are the same as the average linkage.

Syntax for Agglomerative Clusters

# First, import AgglomarativeClustering.

from sklearn.cluster import AgglomerativeClustering

# then create an instance of class,

agg = AgglomerativeClustering (n_clusters=3, affinity=‘euclidean’,

linkage=‘ward’)

# and finally, fit the instance on the data and then predict clusters for new data

agg=agg.fit(X1)

y_predict=agg.predict(X2)

 Agglomerative clustering is a bottom-up approach.

DBSCAN:
 Stands for Density Based Spatial Clustering of Application with Noise.
 It helps us to find outliers.
 It clusters data rather than partitioning it.
Noise is not a part of cluster.

 If there is a point that does not have n_clu neighbors and is not reached from a
core point, then it is noise point.
Min_sample is
n_clu.

In DBSCAN, we
cannot predict.
Db.predict is
error.

Mean Shift:
 A partitioning algorithm that assigns points to the nearest cluster centroid.

 Algorithm ends when all points are assigned to cluster. For K-Means, the centroid is
the mean of all points in a
We can think of the weighted mean as cluster.
assigning more weight to those points closer
to the original point within our window. For mean shift, centroid is the
highest point of local density.
The kernel used here can
be RBF kernel similar to
Gaussian kernel.
Compare different clustering:
Dimensionality reduction:
There is a very common situation, especially with enterprise datasets that often contain
many features.
 Data can be represented by fewer dimensions (features).
o Reduce dimensionality by selecting subset (feature elimination).
o Combine linear and non-linear transformations (PCA and non-negative
matrix factorization).

Principal Component Analysis (PCA):

Reduce the dimensions by not losing important information and tries to maintain good
variance as well. It understands the underlying important features.
Single Value Decomposition is a matrix factorization method normally used for PCA.
The dataset that we work with does not need to be
We do not require a square dataset. square, as we see here, our original dataset,
a is going to be an n by n matrix with m and n not being
equal. We can decompose A into the matrices u s and
v. And u and v here can be thought of as just rotations in
space, one in the m space, and by m space one in the n by
n space. And they encode the information of V1 and V2
directions only, but not the length. They are going to be
more of auxiliary or technical matrices, where the real
geometric idea is going to lie with S. Now, the matrix S is
going to store the actual lengths of those vectors. So recall
those longer vectors will tell you, which ones should be
your primary vectors in regards to where to project your
data down onto. So S as we see here, given where the
stars are, is what's going to be called a diagonal matrix.
Meaning only the non zero entries, in that matrix are across
that diagonal.
k<n

Importance of feature scaling:

 PCA and SVD (Singular Value Decomposition) seek to find the vector that
capture the most variance.
 Variance is sensitive to axis scale.
 Must scale data.
Other popular manifold learning methods exists, such as ISO map, which will use
nearest neighbors and try to maintain the nearest neighbor ordering in a way, or TSNE,
which tries to keep similar points closer together, and dissimilar points further apart and
can be very good for visualization. They are going to be several ways to do
decomposition, and generally we say try a few out. A good approach would be to try
those out, and then perhaps if you are able to move down to two or three dimensions,
using EDA and visualization to see how well you were able to come up with clusters,
or maintain the amount of variance that was originally there.

Non-Negative Matrix Decomposition:

Non-Negative Matrix Decomposition is another way of reducing the number of dimensions. Similar to
PCA, it is also a matrix decomposition method in the form V=W*H.

The main difference is that it can only be applied to matrices that have positive values as inputs, for
example:

 pixels in a matrix
 positive attributes that can be zero or higher

In the case of word and vocabulary recognition, each row in the matrix can be considered a
document, while each column can be considered a topic.

NMF has proven to be powerful for:

 word and vocabulary recognition

 image processing,
 text mining
 transcribing
 encoding and decoding
 decomposition of video, music, or images

There are advantages and disadvantages of only dealing with non negative values.

An advantage, is that NMF leads to features that tend to be more interpretable. For example, in
facial recognition, the decomposed components match to something more interpretable like, for
example, the nose, the eyebrows, or the mouth.

A disadvantage is that NMF truncates negative values by default to impose the added constraint of
only positive values. This truncation tends to lose more information than other decomposition
methods.
Unlike PCA, it does not have to use orthogonal latent vectors, and can end up using vectors that
point in the same direction.

NMF for NLP

In the case of Natural Language Processing, NMF works as below given these inputs, parameters to
tune, and outputs:

Inputs

Given vectorized inputs, which are usually pre-processed using count vectorizer or vectorizers in the
form Term Frequency - Inverse Document Frequency (TF-IDF).

Parameters to tune

The main two parameters are:

 Number of Topics
 Text Preprocessing (stop words, min/max document frequency, parts of speech, etc)

Output

The output of NMF will be two matrices:

1. W Matrix telling us how the terms relate to the different topics.

2. H Matrix telling us how to use those topics to reconstruct our original documents.

Syntax

The syntax consists of importing the class containing the clustering method:

from sklearn.decomposition import NMF

creating the instance of the class:

nmf=NMF(n_components=3, init='random')

and fit the instance and create a transformed version of the data:

x_nmf=NMF.fit(X)

PAG 2 Chemistry Questions
No ratings yet
PAG 2 Chemistry Questions
18 pages
UNIT-6 K Means Clustering
No ratings yet
UNIT-6 K Means Clustering
12 pages
Machine Learning Unsupervised
No ratings yet
Machine Learning Unsupervised
20 pages
Ml Unit5 Notes
No ratings yet
Ml Unit5 Notes
18 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
K Means Clustering
No ratings yet
K Means Clustering
22 pages
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
No ratings yet
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
23 pages
K Means
No ratings yet
K Means
9 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
unit-4 ML
No ratings yet
unit-4 ML
16 pages
M5
No ratings yet
M5
40 pages
13: Clustering: Unsupervised Learning - Introduction
No ratings yet
13: Clustering: Unsupervised Learning - Introduction
4 pages
w6 Clustering
No ratings yet
w6 Clustering
29 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
Unsupervised Machine Learning
No ratings yet
Unsupervised Machine Learning
10 pages
M5
No ratings yet
M5
40 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
R20 machine learning unit 4
No ratings yet
R20 machine learning unit 4
49 pages
Week 9
No ratings yet
Week 9
66 pages
DM&BAFall2204 2
No ratings yet
DM&BAFall2204 2
61 pages
Variance
No ratings yet
Variance
6 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Week 10
No ratings yet
Week 10
50 pages
Unit-5
No ratings yet
Unit-5
33 pages
ML+Clustering
No ratings yet
ML+Clustering
33 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
K-Mean
No ratings yet
K-Mean
9 pages
2021 Clustering
No ratings yet
2021 Clustering
50 pages
ML - Unit - 2
No ratings yet
ML - Unit - 2
13 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Unit 2
No ratings yet
Unit 2
89 pages
Unit 4
No ratings yet
Unit 4
74 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
17 pages
1.supervised and Unsupervised
No ratings yet
1.supervised and Unsupervised
42 pages
Introduction-to-Unsupervised-Machine-Learning
No ratings yet
Introduction-to-Unsupervised-Machine-Learning
9 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
04-FSSR_DS610_2024=2025T1_Kmeans
No ratings yet
04-FSSR_DS610_2024=2025T1_Kmeans
57 pages
Clustering FinancialData
No ratings yet
Clustering FinancialData
38 pages
Unit-4
No ratings yet
Unit-4
53 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
ML_Unit-3
No ratings yet
ML_Unit-3
22 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unsupervised Learning: K-Means Clustering
No ratings yet
Unsupervised Learning: K-Means Clustering
23 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
2.3. Clustering - Scikit-Learn 1
No ratings yet
2.3. Clustering - Scikit-Learn 1
24 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
Clustering-Part1.pptx
No ratings yet
Clustering-Part1.pptx
84 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
No ratings yet
Week 4 - Lecture Slides - K-Means, Mixture Models, & EM
65 pages
6 - Into To Data Science Techniques and Clustering
No ratings yet
6 - Into To Data Science Techniques and Clustering
16 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Clustering Lec 1 Introduction To Clustering
No ratings yet
Clustering Lec 1 Introduction To Clustering
48 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Francis Bacon Was An English Renaissance Statesman and Philosopher
No ratings yet
Francis Bacon Was An English Renaissance Statesman and Philosopher
6 pages
Virtual Learning Environment
No ratings yet
Virtual Learning Environment
12 pages
The Use of Magnetic Flux Leakage Testing Method and Apparatus For Steel Pipe
No ratings yet
The Use of Magnetic Flux Leakage Testing Method and Apparatus For Steel Pipe
7 pages
AIML LAB(2025)
No ratings yet
AIML LAB(2025)
16 pages
The Rulesof People
No ratings yet
The Rulesof People
10 pages
Love Marriage and Family Ties in the Middle Ages IMR International Medieval Research 2nd edition Edition I. Davis All Chapters Instant Download
100% (11)
Love Marriage and Family Ties in the Middle Ages IMR International Medieval Research 2nd edition Edition I. Davis All Chapters Instant Download
81 pages
Automation in Hematology Lab
No ratings yet
Automation in Hematology Lab
8 pages
Bow - Mathematics 3
No ratings yet
Bow - Mathematics 3
7 pages
Phy110 Unit 3 [by Alfa] lpu
No ratings yet
Phy110 Unit 3 [by Alfa] lpu
22 pages
Photonics 2018 Ramesh
No ratings yet
Photonics 2018 Ramesh
3 pages
A Review of Shunt Active Power Filters With Fuzzy Logic Controller PDF
No ratings yet
A Review of Shunt Active Power Filters With Fuzzy Logic Controller PDF
5 pages
Tile Adhesive Data Sheet
No ratings yet
Tile Adhesive Data Sheet
2 pages
Class Four BGS Chap 1,2
100% (1)
Class Four BGS Chap 1,2
6 pages
Model Question Paper For Madhyamik Examination 2024
100% (1)
Model Question Paper For Madhyamik Examination 2024
5 pages
Lesson Plan in Englih 6
No ratings yet
Lesson Plan in Englih 6
8 pages
12th Geography Important Questions With Solutions Watermarked
No ratings yet
12th Geography Important Questions With Solutions Watermarked
79 pages
I Got A Cheat Ability in A Different World, and Become Extraordinary Even in The Real World Vol 4
No ratings yet
I Got A Cheat Ability in A Different World, and Become Extraordinary Even in The Real World Vol 4
233 pages
Week 2 Evolution of Nursing
No ratings yet
Week 2 Evolution of Nursing
73 pages
Sustainable Tourism on Red Island Beach Banyuwangi
No ratings yet
Sustainable Tourism on Red Island Beach Banyuwangi
12 pages
london business_mba_brochure
No ratings yet
london business_mba_brochure
27 pages
The Open Group Standard: Risk Analysis (O-RA), Version 2.0.1
No ratings yet
The Open Group Standard: Risk Analysis (O-RA), Version 2.0.1
47 pages
Lec1 PLC
No ratings yet
Lec1 PLC
11 pages
8 Jet Stream
No ratings yet
8 Jet Stream
17 pages
Yatri-Report Final Final The Very Last Review
No ratings yet
Yatri-Report Final Final The Very Last Review
26 pages
2018 - A Generic Battery Model and Its Parameter Identification
No ratings yet
2018 - A Generic Battery Model and Its Parameter Identification
18 pages
(PDF) Inhibition of Corrosion - Mechanisms and Classifications An Overview
No ratings yet
(PDF) Inhibition of Corrosion - Mechanisms and Classifications An Overview
9 pages
LEARNING UNIT 8 - Legal Language, Argument and Logic (Theme 2)
No ratings yet
LEARNING UNIT 8 - Legal Language, Argument and Logic (Theme 2)
10 pages
Semester 1 - 2019 - November - Basic Electrical Engineering Pattern 2019
No ratings yet
Semester 1 - 2019 - November - Basic Electrical Engineering Pattern 2019
4 pages
Memories and The City
No ratings yet
Memories and The City
3 pages