0% found this document useful (0 votes)
35 views73 pages

Machine Learning-4

The document discusses different clustering algorithms and techniques. It explains what clustering is, its applications, and provides details about the k-means clustering algorithm, including how it works by assigning data points to centroids to minimize variation within clusters.

Uploaded by

ralac55582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views73 pages

Machine Learning-4

The document discusses different clustering algorithms and techniques. It explains what clustering is, its applications, and provides details about the k-means clustering algorithm, including how it works by assigning data points to centroids to minimize variation within clusters.

Uploaded by

ralac55582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Machine Learning

B.Tech. (CSE)

Nayan Ranjan Paul


Department of CSE
Silicon Institute of Technology
Syllabus
Topic Book/Chapter
Module I

Books
Overview of supervised learning T1/Ch. 2
K-nearest neighbour T1/Ch. 2.3.2, R2/Ch.8.2
Multiple linear regression T1/Ch. 3.2.3
Shrinkage methods (Ridge regression, Lasso regression) T1/Ch. 3.4
Logistic regression T1/Ch. 4.3

T1: T Hastie, R.Tibshirani and J Friedman, The Elements of Statistical Linear Discriminant Analysis T1/Ch. 4.4
Learning – Data Mining Inference and Prediction, 2 nd Edition, Springer Feature selection
Module II
T1/Ch. 5.3

2009. Bias, Variance, and model complexity T1/Ch. 7.2


Bias-variance trade off T1/Ch. 7.3

T2: S. Haykin, Neural Networks and Learning Machines, 3rd Edition, Bayesian approach and BIC T1/Ch. 7.7
Pearson Education, 2009. Cross- validation T1/Ch. 7.10
Boot strap methods T1/Ch. 7.11

T3: E. Alpaydin, Introduction to Machine Learning, 2 nd Edition, Performance of Classification algorithms(Confusion Matrix, R7
Prentice Hall of India, 2010. Precision, Recall and ROCCurve)
Module III

R1: Y. G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction Generative model for discrete data
Bayesian concept learning R2/Ch. 6
to Statistical Learning with Applications in R, 2nd Edition, Springer, Naive Bayes classifier T1/Ch. 6.6.3, R2/Ch. 6
2013. SVM for classification T1/Ch. 12.3.1, R3/Ch. 1.5, T2/Ch. 6
Reproducing Kernels R5, R6, T2/Ch. 6

R2: T. M. Mitchell, Machine Learning, 1 st Edition, McGraw-Hill SVM for regression T1/Ch. 12.3.6, R3/Ch. 1.6, T2/Ch. 6
Education, 2013 Regression and classification trees T1/Ch. 9.2.2, 9.2.3
Random forest T1/Ch. 15

R3: B. Scholkopf, A. J. Smola, Learning with Kernels, MIT Press, 2002 Module IV
Clustering (K-means, spectral clustering) T1/Ch. 13.2.1

R4: Murphy, Machine Learning - A Probabilistic Perspective, MIT Press, Feature Extraction (Principal Component Analysis (PCA) R1/10.2
2012 Kernel based PCA
Independent Component Analysis (ICA)
T1/Ch. 14.5.4
R4/Ch. 12.6, CS229

R5: N. Aronszajn. Theory of reproducing kernels. Transactions of the Non-negative matrix factorization T1/14.6
Mixture of Gaussians R4/Ch. 11.2.1, CS229
American Mathematical Society, 68 (1950): 337–404 Expectation Maximization (EM) algorithm R4/Ch. 11.4, CS229
Module V

R6: S. Saitoh. Theory of Reproducing Kernels and its Applications. Boosting methods-exponential loss and AdaBoost T1/Ch. 10.4
Longman Scientific & Technical, 1988 Numerical Optimization via gradient boosting T1/Ch. 10.10
Introduction to Reinforcement Learning T3/Ch. 18.1

R7:https://www.kdnuggets.com/2020/01/guide-precision-recall- Elements of Reinforcement Learning T3/Ch. 18.3
confusion-matrix.html Single State Case: K-Armed Bandit T3/Ch. 18.2
Model-Based Learning (Value Iteration, Policy Iteration) T3/Ch. 18.4
Module - IV
Clustering

Cluster analysis, or clustering, is an unsupervised machine learning task.


It involves automatically discovering natural grouping in data.


Unlike supervised learning (like predictive modeling), clustering algorithms only
interpret the input data and find natural groups or clusters in feature space.


Dataset without labels are given.
Clustering

Clustering techniques are applied when there is no class to be predicted but rather
when the instances are to be divided into natural groups.


A cluster is often an area of density in the feature space where examples from the
domain (observations or rows of data) are closer to the cluster than other clusters.


Clustering can be helpful as a data analysis activity in order to learn more about
the problem domain, so-called pattern discovery or knowledge discovery. Used for
data exploration problem.
Clustering Problem

Given a dataset 𝐷 = {𝑥i |𝑖 = 1, 2, . . . , 𝑁} , 𝑥i ∈ Rp and a positive integer 𝑘.


The goal of clustering is to partition 𝐷 as 𝑘-disjoint subsets such that the objects
within same subset are similar and the objects in different subsets are dissimilar.


Similarity / Dissimilarity is with respect to some measure. If the distance between
two elements is less, then they are similar if distance is more they are dissimilar.
Clustering Apllication

Application:
– Market segmentation
– Document grouping
– Finding criminal prone locality
– Image segmentation


Example:
– To analyze satellite images to measure how much total forest area is there in a
region.


Used as a precursor for other machine learning algorithms.
k-Means Algorithm

k-Means clustering algorithm was proposed by J. Hartigan and M. A. Wong
[1979].


Given a set of n distinct objects, the k-Means clustering algorithm partitions the
objects into k number of clusters such that intra-cluster similarity is high, but the
inter-cluster similarity is low.


k-Means algorithm is an iterative algorithm that tries to partition the dataset into k
pre-defined distinct non-overlapping subgroups (clusters) where each data point
belongs to only one group.
k-Means Algorithm

In this algorithm, user has to specify k, the number of clusters and consider the
objects are defined with numeric attributes and thus using any one of the distance
metric to demarcate the clusters.


It assigns data points to a cluster such that the sum of the squared distance between
the data points and the cluster’s centroid (arithmetic mean of all the data points
that belong to that cluster) is at the minimum.


The less variation we have within clusters, the more homogeneous (similar) the
data points are within the same cluster.
How does k-Means work?

Specify number of clusters k.

Initialize centroids by first shuffling the dataset and then randomly selecting k data
points for the centroids without replacement.

Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing.

Compute the sum of the squared distance between data points and all centroids.

Assign each data point to the closest cluster (centroid).

Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster.
How does k-Means work?
How does k-Means work?
k-Means Algorithms
k-Means Algorithms
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Illustration of k-Means Algorithm
Method for computing the value of k
Method for computing the value of k
Properties of k-Means Algorithm

Within-cluster variation (J) decreases with each iteration of the algorithm.

The algorithm always converges no matter the way initial cluster centres are
chosen.

The final clustering depends on the initial cluster centres. So one typically runs k-
means multiple times ( eg.10 times) randomly initializing cluster centres for each
run then choosing the centres that yields the smallest within-cluster variation.

The algorithm is not guaranteed to deliver the clustering that globally minimizes
within-cluster minimization.

Scaling the input features before k-means is applied, improves the result.
Limitations of k-Means Algorithm
Practice Question - k-Means Algorithm
1 How are the centers initialized in k-means algorithm?
2 What are the properties of K-Means algorithm?
3 What are the limitations of K-Means algorithm?
4 Describe a method for choosing the number of clusters for K-means algorithm.
5 Cluster the following points into 3 clusters using K-means algorithm taking A1 ,
B1 and C1 as the cluster centres and the data set as { A1 ( 2 , 9 ) , A2 ( 2 , 4 ) ,
B1 ( 5 , 7 ) , B2 ( 6 , 4 ) ,C1 ( 1 , 2 ) , C2 ( 4 , 8 ) }. Carry out two iterations of this
algorithm.
6 Write down the k-means algorithm and explain it by taking a suitable example.
Spectral Clustering
Spectral Clustering
Spectral Clustering
Graph Notation
Different Similarity Graphs
The unnormalized graph Laplacian
The Normalized graph Laplacian
The Normalized graph Laplacian
Unnormalized Spectral Clustering
Normalized Spectral Clustering
Practice Question – Spectral Clustering
1 Define normalised graph laplacian.
2 Define unnormalised graph laplacian.
3 Distinguish between normalized and unnormalized spectral clustering.
4 Write down the properties of unnormalized graph laplacian.
5 Considering a triangle having coordinates of its vertices as A ( 0 , 0 ) , B ( 0 , 1 ),
C( 1 , 0 ) as a graph G with vertices A, B, C, sides as edges, and the edge weights
as the distance between the vertices that incident on the edge, compute the Un-
normalised Laplacian of G.
6 Define spectral clustering. What is its significance? Write down the algorithm of
Unnormalized Spectral Clustering.
7 Write down the algorithm of normalized Spectral Clustering.
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Principal Component Analysis (PCA)
Intuition of PCA
Intuition of PCA
Intuition of PCA
Intuition of PCA
Goal of PCA
Maximizing Variance of the projected data
PCA Algorithm
PCA Algorithm
PCA Algorithm
PCA Algorithm
Properties of PCA
Kernel Based PCA (KPCA)
Kernel Based PCA (KPCA)
Kernel Based PCA (KPCA)
Kernel Based PCA (KPCA)
KPCA Properties
Practice Question – PCA, KPCA
1 Why maximizing the variance of projected data is required in PCA?
2 Define principal component of a feature space.
3 In the feature space which transformation is used by KPCA algorithm and why?
4 Define a kernel. Which kernels are suitable for KPCA algorithm? Briefly explain the role of a kernel
in KPCA.
5 Mention the properties of KPCA algorithm.
6 Write down the expression for the covariance matrix used in PCA. What is its order?
7 For the data set D={ x1, x2, x3} ,where x1=(5 4)T x2=(4 6)T Tand x3= (3 5)T compute the principal
components of using PCA algorithm and transform x2.
8 Consider the data set D = {( 2 1 )T ,( 1 2 )T ,( 3 1)T } . Compute the kernel matrix K for the above data
using a polynomial kernel of degree 2. Then compute the centred kernel matrix and one eigen value
and one corresponding eigen vector of .
9 Write down Write down the PCA algorithm and explain the steps.
Independent Component Analysis(ICA)
Motivation - Cocktail party problem
Independent Component Analysis(ICA)
Motivation - Cocktail party problem

Let n number of speakers present in a room for cocktail party.

They are speaking simultaneously at the party.

Theare are n number of microphones are placed at different distancefrom the
speakers and which records n-speakers voice.

Each microphone recorded the voice signals coming from each speaker of
different intensity due to the different distance between them.

Objective – Decompose the mixed signal of each microphone’s recording into an
independent source’s speech signal.
Independent Component Analysis(ICA)
Independent Component Analysis(ICA)

Independent component analysis attempts to decompose a multivariate signal
into independent non-Gausian signals.

ICA Assumption
– The observed data is a linear combination of independent non-Gausian
signals i.e. p(x,y) = p(x)p(y)

Goal:
– Find a linear transformation of the data, that yields a set of independent
components.
Independent Component Analysis(ICA)

Let m1, m2, m3 are 3 persons in a party simultaneously talking to each other.
● Let s1(t) ,s2(t) ,s3(t) are the original speach signals of persons m1,m2,m3 respectively
at a particular time t.
● Let x1(t) , x2(t) ,x3(t) are recordings of these 3 persons respecively at a particlular
instance t.

Then these recording mixtures can be represented as follows:
Independent Component Analysis(ICA)

The time index t can be dropped in ICA model, since we assume that each
mixture and individual components are random variable instead of a proper time
signal. The equations can be rewritten as :


The above equations can be expressed as vector matrix notation as:
-----(1)
Independent Component Analysis(ICA)

Where
– x is a vector whose elements are the mixtures

– S is a vector whose elements are the original source signals

– A is called as mixing matrix with elements aij


Independent Component Analysis(ICA)

Problem formulation: Given the observations
determine
– So equation 1 can be written as

Where
– Objective : Determine W
– Notation: is the i-th row of W
Independent Component Analysis(ICA)

is the source signal of jth speaker


Note: Transform set of vertices into maximally independent set.
Independent Component Analysis(ICA)
● Algorithm: Given X, solve for W such that si are maximally independent.
● For each training patteren xi, i=1,2,3...n the update rule is


The function g may be
Application of ICA

Image denoising

Medical signal processing – fMRI, ECG, EEG

Feature extraction, face recognition

Compression, redundancy reduction

Watermarking

Clustering

Time series analysis

Topic extraction

Scientific Data Mining
PCA vs ICA
PCA ICA

It reduces the dimensions to avoid the It decomposes the mixed signal into its
problem of overfitting. independent sources’ signals.

It focuses on maximizing the variance. It doesn’t focus on the issue of variance


among the data points.

It deals with the Principal Components. It deals with the Independent Components.

ICA is often used in applications such as PCA is commonly used in data visualization
image processing and analyzing data from and exploratory data analysis.
EEG signals.
Non-Negative Matrix Factorization(NMF)

It is an unsupervised ML technique.

It is used for feature reduction.

This is an alternative approach of PCA, in which data and its component are
assumed to be non-negative.

It is useful for modelling non-nagative data such as images.
Non-Negative Matrix Factorization(NMF)

For a matrix V of dimension m X n where each element vij > 0 , NMF
decomposes it into two matrices W and H of dimension m X r and r X n
respectively, where each element wij > 0 of W matrix and hij > 0 of H matrix and r
< min(m,n) such that

V=W*H

The problem however cannot be solved analytically, so it is generally
approximated numerically.
Non-Negative Matrix Factorization(NMF)

So V is decomposed into a tall matrix W and a short wide matrix H.

The user can specify r, the inner dimension of W and H, as long as it is
r<min(m,n)

Each column of V, vi, can be calculated as vi = W * hi

That is, each column of V is equal to the sum of each column of W after being
weighted by its corresponding row in hi
Non-Negative Matrix Factorization(NMF)

Problem formulation: Minimize ||V-WH|| with respect to W and H such that W,H
≥0

This is an NP-hard problem. So we approximate the matrices numerically as it
cannot be solved analytically.


There are several ways W and H may be formed .

One method proposed by Lee and Sung is Multiplicative update rule which is
based on gradient descent.
Non-Negative Matrix Factorization(NMF)

Algorithm:
– Randomly initialize W and H as non-negative
– Repeat until W and H are not stable

Fix W and update H using the following update rule


Fix H and update W using the following update rule


Note: The above updates are done on an element by element basis not matrix
multiplication.

You might also like