Machine Learning-4
Machine Learning-4
B.Tech. (CSE)
Books
Overview of supervised learning T1/Ch. 2
K-nearest neighbour T1/Ch. 2.3.2, R2/Ch.8.2
Multiple linear regression T1/Ch. 3.2.3
Shrinkage methods (Ridge regression, Lasso regression) T1/Ch. 3.4
Logistic regression T1/Ch. 4.3
●
T1: T Hastie, R.Tibshirani and J Friedman, The Elements of Statistical Linear Discriminant Analysis T1/Ch. 4.4
Learning – Data Mining Inference and Prediction, 2 nd Edition, Springer Feature selection
Module II
T1/Ch. 5.3
●
It involves automatically discovering natural grouping in data.
●
Unlike supervised learning (like predictive modeling), clustering algorithms only
interpret the input data and find natural groups or clusters in feature space.
●
Dataset without labels are given.
Clustering
●
Clustering techniques are applied when there is no class to be predicted but rather
when the instances are to be divided into natural groups.
●
A cluster is often an area of density in the feature space where examples from the
domain (observations or rows of data) are closer to the cluster than other clusters.
●
Clustering can be helpful as a data analysis activity in order to learn more about
the problem domain, so-called pattern discovery or knowledge discovery. Used for
data exploration problem.
Clustering Problem
●
Given a dataset 𝐷 = {𝑥i |𝑖 = 1, 2, . . . , 𝑁} , 𝑥i ∈ Rp and a positive integer 𝑘.
●
The goal of clustering is to partition 𝐷 as 𝑘-disjoint subsets such that the objects
within same subset are similar and the objects in different subsets are dissimilar.
●
Similarity / Dissimilarity is with respect to some measure. If the distance between
two elements is less, then they are similar if distance is more they are dissimilar.
Clustering Apllication
●
Application:
– Market segmentation
– Document grouping
– Finding criminal prone locality
– Image segmentation
●
Example:
– To analyze satellite images to measure how much total forest area is there in a
region.
●
Used as a precursor for other machine learning algorithms.
k-Means Algorithm
●
k-Means clustering algorithm was proposed by J. Hartigan and M. A. Wong
[1979].
●
Given a set of n distinct objects, the k-Means clustering algorithm partitions the
objects into k number of clusters such that intra-cluster similarity is high, but the
inter-cluster similarity is low.
●
k-Means algorithm is an iterative algorithm that tries to partition the dataset into k
pre-defined distinct non-overlapping subgroups (clusters) where each data point
belongs to only one group.
k-Means Algorithm
●
In this algorithm, user has to specify k, the number of clusters and consider the
objects are defined with numeric attributes and thus using any one of the distance
metric to demarcate the clusters.
●
It assigns data points to a cluster such that the sum of the squared distance between
the data points and the cluster’s centroid (arithmetic mean of all the data points
that belong to that cluster) is at the minimum.
●
The less variation we have within clusters, the more homogeneous (similar) the
data points are within the same cluster.
How does k-Means work?
●
Specify number of clusters k.
●
Initialize centroids by first shuffling the dataset and then randomly selecting k data
points for the centroids without replacement.
●
Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing.
●
Compute the sum of the squared distance between data points and all centroids.
●
Assign each data point to the closest cluster (centroid).
●
Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster.
How does k-Means work?
How does k-Means work?
k-Means Algorithms
k-Means Algorithms
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Method for computing the value of k
Method for computing the value of k
Properties of k-Means Algorithm
●
Within-cluster variation (J) decreases with each iteration of the algorithm.
●
The algorithm always converges no matter the way initial cluster centres are
chosen.
●
The final clustering depends on the initial cluster centres. So one typically runs k-
means multiple times ( eg.10 times) randomly initializing cluster centres for each
run then choosing the centres that yields the smallest within-cluster variation.
●
The algorithm is not guaranteed to deliver the clustering that globally minimizes
within-cluster minimization.
●
Scaling the input features before k-means is applied, improves the result.
Limitations of k-Means Algorithm
Practice Question - k-Means Algorithm
1 How are the centers initialized in k-means algorithm?
2 What are the properties of K-Means algorithm?
3 What are the limitations of K-Means algorithm?
4 Describe a method for choosing the number of clusters for K-means algorithm.
5 Cluster the following points into 3 clusters using K-means algorithm taking A1 ,
B1 and C1 as the cluster centres and the data set as { A1 ( 2 , 9 ) , A2 ( 2 , 4 ) ,
B1 ( 5 , 7 ) , B2 ( 6 , 4 ) ,C1 ( 1 , 2 ) , C2 ( 4 , 8 ) }. Carry out two iterations of this
algorithm.
6 Write down the k-means algorithm and explain it by taking a suitable example.
Spectral Clustering
Spectral Clustering
Spectral Clustering
Graph Notation
Different Similarity Graphs
The unnormalized graph Laplacian
The Normalized graph Laplacian
The Normalized graph Laplacian
Unnormalized Spectral Clustering
Normalized Spectral Clustering
Practice Question – Spectral Clustering
1 Define normalised graph laplacian.
2 Define unnormalised graph laplacian.
3 Distinguish between normalized and unnormalized spectral clustering.
4 Write down the properties of unnormalized graph laplacian.
5 Considering a triangle having coordinates of its vertices as A ( 0 , 0 ) , B ( 0 , 1 ),
C( 1 , 0 ) as a graph G with vertices A, B, C, sides as edges, and the edge weights
as the distance between the vertices that incident on the edge, compute the Un-
normalised Laplacian of G.
6 Define spectral clustering. What is its significance? Write down the algorithm of
Unnormalized Spectral Clustering.
7 Write down the algorithm of normalized Spectral Clustering.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Intuition of PCA
Intuition of PCA
Intuition of PCA
Intuition of PCA
Goal of PCA
Maximizing Variance of the projected data
PCA Algorithm
PCA Algorithm
PCA Algorithm
PCA Algorithm
Properties of PCA
Kernel Based PCA (KPCA)
Kernel Based PCA (KPCA)
Kernel Based PCA (KPCA)
Kernel Based PCA (KPCA)
KPCA Properties
Practice Question – PCA, KPCA
1 Why maximizing the variance of projected data is required in PCA?
2 Define principal component of a feature space.
3 In the feature space which transformation is used by KPCA algorithm and why?
4 Define a kernel. Which kernels are suitable for KPCA algorithm? Briefly explain the role of a kernel
in KPCA.
5 Mention the properties of KPCA algorithm.
6 Write down the expression for the covariance matrix used in PCA. What is its order?
7 For the data set D={ x1, x2, x3} ,where x1=(5 4)T x2=(4 6)T Tand x3= (3 5)T compute the principal
components of using PCA algorithm and transform x2.
8 Consider the data set D = {( 2 1 )T ,( 1 2 )T ,( 3 1)T } . Compute the kernel matrix K for the above data
using a polynomial kernel of degree 2. Then compute the centred kernel matrix and one eigen value
and one corresponding eigen vector of .
9 Write down Write down the PCA algorithm and explain the steps.
Independent Component Analysis(ICA)
Motivation - Cocktail party problem
Independent Component Analysis(ICA)
Motivation - Cocktail party problem
●
Let n number of speakers present in a room for cocktail party.
●
They are speaking simultaneously at the party.
●
Theare are n number of microphones are placed at different distancefrom the
speakers and which records n-speakers voice.
●
Each microphone recorded the voice signals coming from each speaker of
different intensity due to the different distance between them.
●
Objective – Decompose the mixed signal of each microphone’s recording into an
independent source’s speech signal.
Independent Component Analysis(ICA)
Independent Component Analysis(ICA)
●
Independent component analysis attempts to decompose a multivariate signal
into independent non-Gausian signals.
●
ICA Assumption
– The observed data is a linear combination of independent non-Gausian
signals i.e. p(x,y) = p(x)p(y)
●
Goal:
– Find a linear transformation of the data, that yields a set of independent
components.
Independent Component Analysis(ICA)
●
Let m1, m2, m3 are 3 persons in a party simultaneously talking to each other.
● Let s1(t) ,s2(t) ,s3(t) are the original speach signals of persons m1,m2,m3 respectively
at a particular time t.
● Let x1(t) , x2(t) ,x3(t) are recordings of these 3 persons respecively at a particlular
instance t.
●
Then these recording mixtures can be represented as follows:
Independent Component Analysis(ICA)
●
The time index t can be dropped in ICA model, since we assume that each
mixture and individual components are random variable instead of a proper time
signal. The equations can be rewritten as :
●
The above equations can be expressed as vector matrix notation as:
-----(1)
Independent Component Analysis(ICA)
●
Where
– x is a vector whose elements are the mixtures
Where
– Objective : Determine W
– Notation: is the i-th row of W
Independent Component Analysis(ICA)
●
is the source signal of jth speaker
●
Note: Transform set of vertices into maximally independent set.
Independent Component Analysis(ICA)
● Algorithm: Given X, solve for W such that si are maximally independent.
● For each training patteren xi, i=1,2,3...n the update rule is
●
The function g may be
Application of ICA
●
Image denoising
●
Medical signal processing – fMRI, ECG, EEG
●
Feature extraction, face recognition
●
Compression, redundancy reduction
●
Watermarking
●
Clustering
●
Time series analysis
●
Topic extraction
●
Scientific Data Mining
PCA vs ICA
PCA ICA
It reduces the dimensions to avoid the It decomposes the mixed signal into its
problem of overfitting. independent sources’ signals.
It deals with the Principal Components. It deals with the Independent Components.
ICA is often used in applications such as PCA is commonly used in data visualization
image processing and analyzing data from and exploratory data analysis.
EEG signals.
Non-Negative Matrix Factorization(NMF)
●
It is an unsupervised ML technique.
●
It is used for feature reduction.
●
This is an alternative approach of PCA, in which data and its component are
assumed to be non-negative.
●
It is useful for modelling non-nagative data such as images.
Non-Negative Matrix Factorization(NMF)
●
For a matrix V of dimension m X n where each element vij > 0 , NMF
decomposes it into two matrices W and H of dimension m X r and r X n
respectively, where each element wij > 0 of W matrix and hij > 0 of H matrix and r
< min(m,n) such that
V=W*H
●
The problem however cannot be solved analytically, so it is generally
approximated numerically.
Non-Negative Matrix Factorization(NMF)
●
So V is decomposed into a tall matrix W and a short wide matrix H.
●
The user can specify r, the inner dimension of W and H, as long as it is
r<min(m,n)
●
Each column of V, vi, can be calculated as vi = W * hi
●
That is, each column of V is equal to the sum of each column of W after being
weighted by its corresponding row in hi
Non-Negative Matrix Factorization(NMF)
●
Problem formulation: Minimize ||V-WH|| with respect to W and H such that W,H
≥0
●
This is an NP-hard problem. So we approximate the matrices numerically as it
cannot be solved analytically.
●
There are several ways W and H may be formed .
●
One method proposed by Lee and Sung is Multiplicative update rule which is
based on gradient descent.
Non-Negative Matrix Factorization(NMF)
●
Algorithm:
– Randomly initialize W and H as non-negative
– Repeat until W and H are not stable
●
Fix W and update H using the following update rule
●
Fix H and update W using the following update rule
●
Note: The above updates are done on an element by element basis not matrix
multiplication.