Unit IV

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 96

Unit IV

Unsupervised Learning Techniques


Although most of the applications of Machine Learning
today are based on supervised learning (and as a result,
this is where most of the investments go to), the vast
majority of the available data is unlabeled: we have the
input features X, but we do not have the labels y.
The computer scientist Yann LeCun famously said that
“if intelligence was a cake, unsupervised learning would
be the cake, supervised learning would be the icing on
the cake, and reinforcement learning would be the
cherry on the cake.”
In other words, there is a huge potential in
unsupervised learning that we have only barely started
to sink our teeth into.
Say you want to create a system that will take a few
pictures of each item on a manufacturing production
line and detect which items are defective.
 You can fairly easily create a system that will take
pictures automatically, and this might give you thou‐
sands of pictures every day. You can then build a
reasonably large dataset in just a few weeks.
 But wait, there are no labels! If you want to train a
regular binary classifier that will predict whether an
item is defective or not, you will need to label every
single picture as “defective” or “normal.”
This will generally require human experts to sit down
and manually go through all the pictures. This is a
long, costly, and tedious task, so it will usually only be
done on a small subset of the available pictures.
 As a result, the labeled dataset will be quite small,
and the classifier’s performance will be disappointing.
Moreover, every time the company makes any change
to its products, the whole process will need to be
started over from scratch.
Wouldn’t it be great if the algorithm could just exploit
the unlabeled data without needing humans to label
every picture? Enter unsupervised learning.
Clustering
As you enjoy a hike in the mountains, you stumble upon a
plant you have never seen before. You look around and you
notice a few more.
They are not identical, yet they are sufficiently similar for you
to know that they most likely belong to the same species (or
at least the same genus).
 You may need a botanist to tell you what species that is, but
you certainly don’t need an expert to identify groups of
similar-looking objects.
This is called clustering: it is the task of identifying similar
instances and assigning them to clusters, or groups of similar
instances
 Just like in classification, each instance gets assigned to a group.
However, unlike classification, clustering is an unsupervised task.
 Consider Figure : on the left is the iris dataset), where each instance’s
species (i.e., its class) is represented with a different marker. It is a
labeled dataset, for which classification algorithms such as Logistic
Regression, SVMs, or Random Forest classifiers are well suited.
 On the right is the same dataset, but without the labels, so you
cannot use a classification algorithm anymore. This is where
clustering algorithms step in: many of them can easily detect the
lower-left cluster.
 It is also quite easy to see with our own eyes, but it is not so obvious
that the upper-right cluster is composed of two distinct sub-clusters.
 That said, the dataset has two additional features (sepal length and
width), not represented here, and clustering algorithms can make
good use of all features, so in fact they identify the three clusters
fairly well (e.g., using a Gaussian mix‐ture model, only 5 instances
out of 150 are assigned to the wrong cluster)
K-Means
K-Means Clustering is an unsupervised learning algorithm
that is used to solve the clustering problems in machine
learning or data science.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm,
which groups the unlabeled dataset into different clusters.
Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and
so on.
“It is an iterative algorithm that divides the unlabeled dataset
into k different clusters in such a way that each dataset
belongs only one group that has similar properties.”
It allows us to cluster the data into different groups
and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the
need for any training.
It is a centroid-based algorithm, where each cluster is
associated with a centroid. The main aim of this
algorithm is to minimize the sum of distances between
the data point and their corresponding clusters.
The algorithm takes the unlabeled dataset as input,
divides the dataset into k-number of clusters, and
repeats the process until it does not find the best
clusters. The value of k should be predetermined in
this algorithm.
The k-means clustering algorithm mainly performs
two tasks:
Determines the best value for K center points or
centroids by an iterative process.
Assigns each data point to its closest k-center. Those
data points which are near to the particular k-center,
create a cluster.
Hence each cluster has data points with some
commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means
Clustering Algorithm:
How does the K-Means Algorithm Work?
 The working of the K-Means algorithm is explained in the below
steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will
form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each
cluster.
Step-5: Repeat the third steps, which means reassign each data point
to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to
FINISH.
Step-7: The model is ready.
 Suppose we have two variables M1 and M2. The x-y axis scatter
plot of these two variables is given below:

 Let's take number k of clusters, i.e., K=2, to identify the dataset


and to put them into different clusters. It means here we will try
to group these datasets into two different clusters.
 We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or
any other point. So, here we are selecting the below two points as
k points, which are not the part of our dataset. Consider the below
Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median between
both the centroids. Consider the below image:
From the above image, it is clear that points left side of
the line is near to the K1 or blue centroid, and points to
the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear
visualization.
As we need to find the closest cluster, so we will repeat
the process by choosing a new centroid. To choose
the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids
as below:
Next, we will reassign each datapoint to the new
centroid. For this, we will repeat the same process of
finding a median line. The median will be like below
image:
From the above image, we can see, one yellow point is
on the left side of the line, and two blue points are
right to the line. So, these three points will be assigned
to new centroids.
As reassignment has taken place, so we will again go to
the step-4, which is finding new centroids or K-points.
We will repeat the process by finding the center of
gravity of centroids, so the new centroids will be as
shown in the below image:
As we got the new centroids so again will draw the
median line and reassign the data points. So, the
image will be
We can see in the above image; there are no dissimilar
data points on either side of the line, which means our
model is formed. Consider the below image:
As our model is ready, so we can now remove the
assumed centroids, and the two final clusters will be as
shown in the below image:
How to choose the value of "K number of clusters"
in K-means Clustering?
The performance of the K-means clustering algorithm
depends upon highly efficient clusters that it forms.
But choosing the optimal number of clusters is a big
task. There are some different ways to find the optimal
number of clusters, but here we are discussing the
most appropriate method to find the number of
clusters or value of K.
The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to
find the optimal number of clusters. This method uses
the concept of WCSS value. WCSS stands for Within
Cluster Sum of Squares, which defines the total
variations within a cluster. The formula to calculate
the value of WCSS (for 3 clusters) is given below:
To find the optimal value of clusters, the elbow
method follows the below steps:
It executes the K-means clustering on a given dataset for
different K values (ranges from 1-10).
For each value of K, calculates the WCSS value.
Plots a curve between calculated WCSS values and the
number of clusters K.
The sharp point of bend or a point of the plot looks like
an arm, then that point is considered as the best value
of K.
Since the graph shows the sharp bend, which looks
like an elbow, hence it is known as the elbow method.
The graph for the elbow method looks like the below
image:
Limits of K-Means
Despite its many merits, most notably being fast and
scalable, K-Means is not perfect.
 As we saw, it is necessary to run the algorithm several
times to avoid suboptimal solutions, plus you need to
specify the number of clusters, which can be quite a
hassle.
Moreover, K-Means does not behave very well when
the clusters have varying sizes, different densities, or
non-spherical shape
Using Clustering for Image Segmentation
Image segmentation is the task of partitioning an image
into multiple segments. In semantic segmentation, all pixels
that are part of the same object type get assigned to the
same segment.
 For example, in a self-driving car’s vision system, all pixels
that are part of a pedestrian’s image might be assigned to
the “pedestrian” segment (there would be one segment
containing all the pedestrians).
In instance segmentation, all pixels that are part of the same
individual object are assigned to the same segment. In this
case there would be a different segment for each pedestrian.
The state of the art in semantic or instance
segmentation today is achieved using complex
architectures based on convolutional neural networks.
 Here, we are going to do something much simpler:
color segmentation. We will simply assign pixels to the
same segment if they have a similar color. In some
applications, this may be sufficient.
For example, if you want to analyze satellite images to
measure how much total forest area there is in a
region, color segmentation may be just fine.
Using Clustering for Preprocessing
Clustering can be an efficient approach to
dimensionality reduction, in particular as a
preprocessing step before a supervised learning
algorithm.
As an example of using clustering for dimensionality
reduction, let’s tackle the digits dataset, which is a
sim‐ple MNIST-like dataset containing 1,797 grayscale
8 × 8 images representing the dig‐ its 0 to 9. First, load
the dataset:
Using Clustering for Semi-Supervised
Learning
DBSCAN
Gaussian Mixtures
A Gaussian mixture model (GMM) is a probabilistic model
that assumes that the instances were generated from a
mixture of several Gaussian distributions whose parameters
are unknown.
 All the instances generated from a single Gaussian
distribution form a cluster that typically looks like an
ellipsoid.
Each cluster can have a different ellipsoidal shape, size,
density, and orientation. When you observe an instance, you
know it was generated from one of the Gaussian
distributions, but you are not told which one, and you do not
know what the parameters of these distributions are.
Dimensionality Reduction
Many Machine Learning problems involve thousands
or even millions of features for each training instance.
Not only do all these features make training extremely
slow, but they can also make it much harder to find a
good solution, as we will see. This problem is often
referred to as the curse of dimensionality.
Fortunately, in real-world problems, it is often possible
to reduce the number of features considerably, turning
an intractable problem into a tractable one.
 For example, consider the MNIST images the pixels
on the image borders are almost always white, so you
could completely drop these pixels from the training
set without losing much information.
What is Dimensionality Reduction?
The number of input features, variables, or columns
present in a given dataset is known as dimensionality,
and the process to reduce these features is called
dimensionality reduction.
A dataset contains a huge number of input features in
various cases, which makes the predictive modeling
task more complicated. Because it is very difficult to
visualize or make predictions for the training dataset
with a high number of features, for such cases,
dimensionality reduction techniques are required to
use.
Dimensionality reduction technique can be defined
as, "It is a way of converting the higher
dimensions dataset into lesser dimensions
dataset ensuring that it provides similar
information." These techniques are widely used
in machine learning for obtaining a better fit
predictive model while solving the classification and
regression problems.
It is commonly used in the fields that deal with high-
dimensional data, such as speech recognition,
signal processing, bioinformatics, etc. It can also
be used for data visualization, noise reduction,
cluster analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in
practice, commonly known as the curse of
dimensionality. If the dimensionality of the input
dataset increases, any machine learning algorithm and
model becomes more complex.
 As the number of features increases, the number of
samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine
learning model is trained on high-dimensional data, it
becomes overfitted and results in poor performance.
Hence, it is often required to reduce the number of
features, which can be done with dimensionality reduction.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction
technique to the given dataset are given below:
By reducing the dimensions of the features, the space
required to store the dataset also gets reduced.
Less Computation training time is required for reduced
dimensions of features.
Reduced dimensions of features of the dataset help in
visualizing the data quickly.
It removes the redundant features (if present) by taking care
of multicollinearity.
Main Approaches for Dimensionality
Reduction
There are two ways to apply the dimension reduction
technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of
the relevant features and leaving out the irrelevant
features present in a dataset to build a model of high
accuracy. In other words, it is a way of selecting the
optimal features from the input dataset.
Three methods are used for the feature selection:
1) Filters Methods
In this method, the dataset is filtered, and a subset
that contains only the relevant features is taken. Some
common techniques of filters method are:
Correlation
Chi-Square Test
ANOVA
Information Gain, etc.
2) Wrappers Methods
The wrapper method has the same goal as the filter
method, but it takes a machine learning model for its
evaluation. In this method, some features are fed to
the ML model, and evaluate the performance. The
performance decides whether to add those features or
remove to increase the accuracy of the model. This
method is more accurate than the filtering method but
complex to work. Some common techniques of
wrapper methods are:
Forward Selection
Backward Selection
Bi-directional Elimination
3) Embedded Methods: Embedded methods check the
different training iterations of the machine learning
model and evaluate the importance of each feature.
Some common techniques of Embedded methods are:
LASSO
Elastic Net
Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the
space containing many dimensions into space with
fewer dimensions. This approach is useful when we
want to keep the whole information but use fewer
resources while processing the information.
Some common feature extraction techniques are:
a) Principal Component Analysis
b) Linear Discriminant Analysis
c) Kernel PCA
d) Quadratic Discriminant Analysis
Principal Component Analysis (PCA)
Principal Component Analysis, or PCA, is a
dimensionality-reduction method that is often used to
reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one
that still contains most of the information in the large
set.
Reducing the number of variables of a data set
naturally comes at the expense of accuracy, but the
trick in dimensionality reduction is to trade a little
accuracy for simplicity.
Because smaller data sets are easier to explore and
visualize and make analyzing data much easier and
faster for machine learning algorithms without
extraneous variables to process.
So to sum up, the idea of PCA is simple — reduce the
number of variables of a data set, while preserving as
much information as possible.
STEP 1: STANDARDIZATION:
STEP BY STEP EXPLANATION OF PCA

The aim of this step is to standardize the range of the


continuous initial variables so that each one of them
contributes equally to the analysis.
More specifically, the reason why it is critical to perform
standardization prior to PCA, is that the latter is quite
sensitive regarding the variances of the initial
variables.
That is, if there are large differences between the ranges
of initial variables, those variables with larger ranges
will dominate over those with small ranges
(For example, a variable that ranges between 0 and 100
will dominate over a variable that ranges between 0
and 1), which will lead to biased results. So,
transforming the data to comparable scales can
prevent this problem.
Mathematically, this can be done by subtracting the
mean and dividing by the standard deviation for each
value of each variable.
STEP 2: COVARIANCE MATRIX COMPUTATION:
The aim of this step is to understand how the variables
of the input data set are varying from the mean with
respect to each other, or in other words, to see if there
is any relationship between them.
Because sometimes, variables are highly correlated in
such a way that they contain redundant information.
So, in order to identify these correlations, we compute
the covariance matrix.
The covariance matrix is a p × p symmetric matrix
(where p is the number of dimensions) that has as
entries the covariances associated with all possible
pairs of the initial variables. For example, for a 3-
dimensional data set with 3 variables x, y, and z, the
covariance matrix is a 3×3 matrix of this from:
Since the covariance of a variable with itself is its
variance (Cov(a,a)=Var(a)), in the main diagonal (Top
left to bottom right) we actually have the variances of
each initial variable.
And since the covariance is commutative
(Cov(a,b)=Cov(b,a)), the entries of the covariance
matrix are symmetric with respect to the main
diagonal, which means that the upper and the lower
triangular portions are equal.
STEP 3: COMPUTE THE EIGENVECTORS AND
EIGENVALUES OF THE COVARIANCE MATRIX TO
IDENTIFY THE PRINCIPAL COMPONENTS
Eigenvectors and eigenvalues are the linear algebra
concepts that we need to compute from the covariance
matrix in order to determine the principal
components of the data.
Before getting to the explanation of these concepts, let’s
first understand what do we mean by principal
components.
Principal components are new variables that are
constructed as linear combinations or mixtures of the
initial variables.
These combinations are done in such a way that the new
variables (i.e., principal components) are uncorrelated
and most of the information within the initial variables is
squeezed or compressed into the first components.
 So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible
information in the first component, then maximum
remaining information in the second and so on, until
having something like shown in the scree plot below.
Organizing information in principal components this
way, will allow you to reduce dimensionality without
losing much information, and this by discarding the
components with low information and considering the
remaining components as your new variables.
An important thing to realize here is that, the
principal components are less interpretable and don’t
have any real meaning since they are constructed as
linear combinations of the initial variables.
Geometrically speaking, principal components
represent the directions of the data that explain
a maximal amount of variance, that is to say, the
lines that capture most information of the data
The relationship between variance and information
here, is that, the larger the variance carried by a line,
the larger the dispersion of the data points along it,
and the larger the dispersion along a line, the more the
information it has. To put all this simply, just think of
principal components as new axes that provide the
best angle to see and evaluate the data, so that the
differences between the observations are better visible.
HOW PCA CONSTRUCTS THE PRINCIPAL
COMPONENTS
As there are as many principal components as there are
variables in the data, principal components are
constructed in such a manner that the first principal
component accounts for the largest possible
variance in the data set.
For example, let’s assume that the scatter plot of our
data set is as shown below, can we guess the first
principal component ? Yes, it’s approximately the line
that matches the purple marks because it goes through
the origin and it’s the line in which the projection of
the points (red dots) is the most spread out.
Or mathematically speaking, it’s the line that
maximizes the variance (the average of the squared
distances from the projected points (red dots) to the
origin).
The second principal component is calculated in the
same way, with the condition that it is uncorrelated
with (i.e., perpendicular to) the first principal
component and that it accounts for the next highest
variance.
This continues until a total of p principal components
have been calculated, equal to the original number of
variables.
Now that we understood what we mean by principal
components, let’s go back to eigenvectors and
eigenvalues.
 What you firstly need to know about them is that they
always come, so that every eigenvector has an eigen
value. in pairs
And their number is equal to the number of dimensions
of the data. For example, for a 3-dimensional data set,
there are 3 variables, therefore there are 3 eigenvectors
with 3 corresponding eigenvalues.
By ranking your eigenvectors in order of their
eigenvalues, highest to lowest, you get the principal
components in order of significance.
STEP 4: FEATURE VECTOR:
As we saw in the previous step, computing the
eigenvectors and ordering them by their eigenvalues in
descending order, allow us to find the principal
components in order of significance.
In this step, what we do is, to choose whether to keep all
these components or discard those of lesser
significance (of low eigenvalues), and form with the
remaining ones a matrix of vectors that we call Feature
vector.
So, the feature vector is simply a matrix that has as
columns the eigenvectors of the components that we
decide to keep.
 This makes it the first step towards dimensionality
reduction, because if we choose to keep
only p eigenvectors (components) out of n, the final
data set will have only p dimensions.
LAST STEP: RECAST THE DATA ALONG THE
PRINCIPAL COMPONENTS AXES:
In the previous steps, apart from standardization, you
do not make any changes on the data, you just select
the principal components and form the feature vector,
but the input data set remains always in terms of the
original axes (i.e, in terms of the initial variables).
In this step, which is the last one, the aim is to use the
feature vector formed using the eigenvectors of the
covariance matrix, to reorient the data from the
original axes to the ones represented by the principal
components (hence the name Principal Components
Analysis). This can be done by multiplying the
transpose of the original data set by the transpose of
the feature vector.
Implementation and demonstration
A pseudo code of PCA is given in the following.
Using Scikit-Learn
Randomized PCA
Kernel PCA.

In previous Chapter we discussed the kernel trick, a


mathematical technique that implicitly maps
instances into a very high-dimensional space (called
the feature space), enabling nonlinear classification
and regression with Support Vector Machines.
 Recall that a linear decision boundary in the high-
dimensional feature space corresponds to a complex
nonlinear decision boundary in the original space
It turns out that the same trick can be applied to PCA,
making it possible to perform complex nonlinear
projections for dimensionality reduction. This is called
Kernel PCA (kPCA).
 It is often good at preserving clusters of instances
after projection, or sometimes even unrolling datasets
that lie close to a twisted manifold

You might also like