ML Mid Syllabus

Download as pdf or txt
Download as pdf or txt
You are on page 1of 182

Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Lecture - 1

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Reference Book:
Pattern classification by Richard O duda

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop


What is Machine Learning?

Making predictions or decisions from Data

Goal : Computer should view and perceive things as humans do.

Machine Learning is arguably the greatest export from computing to other


scientific fields.
Machine Learning Applications
Motivation:
Machine vision over human vision. Which one is better.?
Motivation:
Machine vision over human vision. Which one is better.?
Motivation:
Machine vision over human vision. Which one is better.?
Back to Machine Learning.

Main perception:
Build a machine that recognize pattern.
What is Pattern Recognition?
Theory, Algorithms, Systems to Put Patterns into Categories
Relate Perceived Pattern to Previously Perceived Patterns.
What is Machine Learning?

Make the machine Evaluate how good the


‘learn’ some thing machine has ‘learned’
Machine Learning

Learning = Improving with experience over some


task

A machine is said to learn if its performance P ,


at task T, improves with experience E
Learning Problems – Examples
• Handwriting recognition
learning problem
• Task T: recognizing
handwritten words within
images
• Performance measure P:
percent of words correctly
recognized
• Training experience E: a
database of handwritten
words with given
classifications
Machine Learning Pipelining

Training Training
Training Images Labels

Classifier Trained
Image Features
Training Classifier
Machine Learning Pipelining

Training Training
Labels
Training data

Classifier Trained
data Features
Training Classifier

Testing
Prediction
Trained
Data Features
Classifier

Test Data
Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Lecture - 2

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Reference Book:
Pattern classification by Richard O duda

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop


Machine Learning Pipelining

Training Training
Labels
Training data

Classifier Trained
data Features
Training Classifier

Testing
Prediction
Trained
Data Features
Classifier

Test Data
How Machine Decide?

Differentiating between low-risk


and high-risk customers from their
income and savings
How Machine Decide?

Differentiating between low-risk


and high-risk customers from their
income and savings

Discriminant: IF income > θ1 AND savings > θ2


THEN low-risk
ELSE high-risk
Machine Learning applications?
• Speech Recognition
• Face Recognition
• Handwriting Character Recognition
• Autonomous Driving
Face recognition
Training examples of a person

Test images

AT&T Laboratories, Cambridge UK


http://www.uk.research.att.com/facedatabase.html

8
OCR & Handwriting recognition
Speech Recognition
Machine Learning Pipelining

Training Training
Labels
Training data

Classifier Trained
data Features
Training Classifier

Testing
Prediction
Trained
Image Features
Classifier

Test Image
Features
• Features are the individual measurable properties of
the signal being observed.

• The set of features used for learning/recognition is


called feature vector.

• The number of used features is the dimensionality


of the feature vector.
Features
height

 x1 
x   x
weight  2
Feature Extraction
• Feature extraction aims to create discriminative features
good for learning
• Good Features
• Objects from the same class have similar feature
values.
• Objects from different classes have different values.
Feature Extraction
• Feature extraction aims to create discriminative
features good for Learning
• Good Features
• Objects from the same class have similar feature
values.
• Objects from different classes have different
values.

“Good” features “Bad” features


Contents

• Supervised learning
–Classification

• Unsupervised learning

• Reinforcement learning
CLASSIFICATION
Supervised learning - Classification
• Objective
• Make Nicolas recognize what is an apple and
what is an orange
Classification

Apples Oranges
Classification
• You had some training
example or ‘training data’
What is this???

• The examples were ‘labeled’

• You used those examples to


make the kid ‘learn’ the
difference between an apple
and an orange

Its an
apple!!!
Classification
Apple

Pear

Tomato

Cow

Dog

Horse

Given: training images and their categories


What are the categories
of these test images?
Classification
• Cancer Diagnosis – Generally more than one
variables

Malignant
Benign

Age

Tumor Size
Why supervised – The algorithm is given a number of patients
with the RIGHT ANSWER and we want the algorithm to learn to
predict for new patients
Classification
• Cancer Diagnosis – Generally more than one
variables
Predict for this
patient
Malignant
Benign

Age

Tumor Size

We want the algorithm to learn the separation line. Once a new


patient arrives with a given age and tumor size – Predict as
Malignant or Benign
Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Lecture – 3-4

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Reference Book:
Pattern classification by Richard O duda

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop


Contents

• Supervised learning
–Classification

• Unsupervised learning

_ Clustring

• Reinforcement learning
CLUSTERING
UNSUPERVISED LEARNING
CLUSTERING

There are two types of fruit in the basket,


separate them into two ‘groups’
UNSUPERVISED LEARNING
CLUSTERING
 The data was not ‘labeled’ you did not tell
Nicolas which are apples which are oranges

 May be the kid used the idea that things in


the same group should be similar to one
another as compared to things in the other Separate groups or clusters
group

 Groups - Clusters
Classification vs Clustering

Challenges ?
Classification vs Clustering

• Challenges
– Intra-class variability
– Inter-class similarity
Intra Class Variability
Object belongs to same class but looks different

The letter “T” in different type faces

Same face under different expression, pose, illumination


Inter Class Similarity
Objects belongs to different class but look alike.

Identical twins Characters that look similar


3/2/2024
Task …
• Separation of different coins using a robotic arm
Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Lecture – 5-6

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Reference Book:
Pattern classification by Richard O duda

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop


Contents

• Supervised learning
– Classification

• Unsupervised learning

• Reinforcement learning
Reinforcement Learning
• In RL, the computer is simply given a goal to
achieve.
• The computer then learns how to achieve that
goal by trial-and-error interactions with its
environment

System learns from success and failure,


reward and punishment
Reinforcement Learning
• Objective: training a pet dog
• Need to make a sequence of good decisions to train it.
 Every time dog does something good you pat
him and say ‘good dog’

 Every time dog does some thing bad you


scold him saying ‘bad dog’

 Over time dog will learn to do good things


A Fancy Problem
Sorting incoming fish on a conveyor according to
species (salmon or sea bass)

Salmon or sea bass?


(2 categories or classes)

It is a classification
problem. How to
solve it?
Approach
• Data collection
How to use it?

• Preprocessing: Use a segmentation operation to isolate fishes from


one another and from the background
Image processing?

• Information from a single fish is sent to a feature extractor whose


purpose is to reduce the data by measuring certain features
But which features to extract?

• The features are passed to a classifier that evaluates the evidence


and then takes a final decision
How to design and realize a classifier?
Approach
• Set up a camera and take some sample images to extract features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!
• Challenges??
Approach
• Set up a camera and take some sample images to extract features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…

• This is the set of all suggested features to explore for use in our
classifier!
• Challenges:
• Variations in images – lightning, occlusion, camera view angle
• Position of the fish on the conveyer belt, etc…
How data is collected and used
• Data can be raw signals (e.g. images) or features extracted from images – data is usually
expensive
• The data is divided into three parts (exact percentage of each portion depends (partially) on
data sample size)

Train Validation Test

• Train data: It is used to build a prediction model or learner (classifier)

• Validation data: It is used to estimate the prediction error (classification error) and adjust the learner
parameters

• Test data: It is used to estimate the classification error of the chosen learner on unseen data called
generalization error. The test must be kept inside a ‘vault’ and be brought out only at the end of data analysis
Feature extraction
• Feature extraction: use domain knowledge

– The sea bass is generally longer than a salmon


– The average lightness of sea bass scales is greater than that of salmon

• We will used training data in order to learn a classification rule based on


these features (length of a fish and average lightness)

• Length of fish and average lightness may not be sufficient features i.e. they
may not guarantee 100% classification results
Classification
Select the length of the fish as a possible feature
for discrimination between two classes

Histograms for the length feature for the two categories


Classification
Select the length of the fish as a possible feature
for discrimination between two classes

Decision
Boundary

Histograms for the length feature for the two categories


Cost of Taking a Decision
• A fish-packaging industry use the system to pack
fish in cans.
• Two facts
• People do not want to find sea bass in the cans
labeled salmon
• People occasionally accepts to find salmon in
the cans labeled sea-bass
• So the cost of taking a decision in favor of whom??
Cost of Taking a Decision
• A fish-packaging industry use the system to pack
fish in cans.
• Two facts
• People do not want to find sea bass in the cans labeled
salmon
• People occasionally accepts to find salmon in the cans
labeled sea-bass
• So the cost of taking a decision in favor of sea bass
Evaluation of a classifier
• How to evaluate a certain classifier?

• Classification error: The percentage of patterns (e.g.


fish) that are assigned to wrong category
• Choose a classifier that gives minimum classification
error
Two performance measure use for evaluation of classifier.

1) Precision
2) Recall
Precision : Number of relevant retrieved sample/ Total retrieved sample

Recall: Number of relevant retrieved sample/ Total number of relevant sample


Example: Suppose you have 5 documents, say
Doc 1= excel
Doc 2= word
Doc 3= excel
Doc 4= word
Doc 5= word
You wish to design a information retravel system. Suppose your system retrieve
the documents say
Doc 1= excel
Doc 2= word
Doc 4= word
Doc 5= word
Find the precision and recall of the system for excel documents retrieval ?
Precision : Number of relevant retrieved sample/ Total retrieved sample

Recall: Number of relevant retrieved sample/ Total number of relevant sample


Precision : Number of relevant retrieved sample/ Total retrieved sample
: 1/4

Recall: Number of relevant retrieved sample/ Total number of relevant sample


: 1/2
Example :

A database contains 80 records on a particular topic . A search was

conducted on that topic and 60 records were retrieved. Of the 60

records retrieved, 45 were relevant. Find precision and recall.


Precision : Number of relevant retrieved sample/ Total retrieved
sample

Precision: 45/60

Recall: Number of relevant retrieved sample/ Total number of relevant


sample

Recall: 45/80
Lecture – 7-8

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Bayesian Decision Theory
Probability Theory :
A random experiment whose outcome is not predictable.

Sample Space (S): The set of all possible outcomes


A sample space is discrete if it consists of a finite (or countably infinite) set of
outcome; otherwise it is continuous.
Example: Consider the experiment where two coins are simultaneously tossed.
The various elementary events are

S   HH , HT , TH , TT 

How to calculate probability ?


Number of outcomes favorable to E
P( E ) 
Total number of possible outcomes
Probability that at least one heads occurs = P(A) =

Probability that two heads occurs simultaneously = P(B) =

Probability that two tails occur simultaneously =

Note that P(S) =


Probability that at least one heads occurs = P(A) = 3/4

Probability that two heads occurs simultaneously = P(B) =1/4

Probability that two tails occur simultaneously = 1/4

Note that P(S) = 1


Consider a box with n white and m red balls. In this case, there
are two elementary outcomes: white ball or red ball.

Probability of “selecting a white ball” =


Consider a box with n white and m red balls. In this case, there
are two elementary outcomes: white ball or red ball.

n
Probability of “selecting a white ball” =
nm
If A1 is an event that cannot possibly occur the P(A1) = 0. If A2 is
sure to occur, P(A2) = 1.
0  P( A)  1

If A and B are mutually exclusive events then probability of their


union is same as the sum of their probabilities

If A  B  , then P( A  B)  P( A)  P( B)
If A and B are not mutually exclusive then

P( A  B)  P( A)  P( B)  P( A  B)

Conditional probability:
P( AB)
P( A | B) 
P( B)

P(A | B) = Probability of “the event A, given that B has occurred”


If two events A and B are independent:

P(A and B) = P(A)P(B)

If A and B are mutually exclusive:

P(A and B) = 0
P( AB)
P( A | B)   P( AB)  P( A | B) P( B)
P( B)
P( AB)
P( B | A)   P( AB)  P( B | A) P( A)
P( A)
Thus
P( A | B) P( B)  P( B | A) P( A)
P( B | A) P( A)
P( A | B)  Bayes Theorem
P( B)
What is bayesian decision theory
• Mathematical foundation for decision making.

• Using probabilistic approach to help making


decision (classification) so as to minimize the
risk (cost).

• The decision problem is viewed in term of


probabilities and it is assumed that all of the
relevant probabilities are known

• It makes a compromise between the decisions


and their costs and finds an optimal solution
Fish sorting - revisited

How to design it?


Notations
i {1 , 2 ,, c } : a state of nature
P(i ) : prior probability
x : feature vector
class-conditional
p(x | i ) :
density

P(i | x) : posterior probability


Decision rule
• In our example, the fish on the conveyer belt
may be salmon or sea bass and we are not
certain about it. Thus it can be described in
terms of probability as the outcome of the
experiment is not certain
• So, we assign a random variable ω to describe
the type of fish
State of nature

ω = ω1 for sea bass


ω = ω2 for salmon
Decision Rule based on Prior Information
• Prior probability: It is based on our knowledge without doing any
experimentation

• For example: if fishermen catch as much sea bass as salmon in


a season then the catch and their appearance on conveyer belt
is equiprobable:
P(1 )  P(2 )  0.5 and
P(1 )  P(2 )  1

• In general, sea bass and salmon may appear with any non-zero
probabilities
P(1 )  0 and P(2 )  0 but
P(1 )  P(2 )  1
Decision Rule based on Prior Information
• An optimal decision rule based on only prior
probabilities (without seeing the fish) is
Decide 1 if P(1 )  P(2 );otherwise decide 2

• This rule makes sense if we wish to judge one fish


but it is odd if we are judging many fish
• If P(ω 1) >> P(ω 2) then our decision will be right
most of the time
• The probability of error is the minimum of both
probabilities
P (error )  min  P (1 ), P (2 ) 
Decision Rule based on Data
• Data can be images or features extracted (e.g.
lightness of fish) from images
• In our case, different fish yield different
lightness readings, and we express this
variability in probabilistic terms
• It is expressed as a conditional probability also
called class-conditional probability function i.e.

p( x | 1 )  The pdf of x given that state of nature is 1


p( x | 2 )  The pdf of x given that state of nature is 2
Decision Rule based on Data
Decision Rule based on Data
• We can integrate the new knowledge i.e. class
conditional probability function to prior
knowledge, to compute posteriori probability or
posterior using Baye’s rule

p( x | i ) P(i )
P(i | x) 
p ( x)
2
where p( x)   p( x | i ) P(i )
i 1
Decision Rule based on Data
• Bayes’ rule can also be expressed as

likelihood x prior
posterior 
evidence

• Evidence can be considered as a scale factor

p(x | i ) P(i )
P(i | x) 
p ( x)
unimportant in
making decision
Decision Rule based on Data
• An optimal decision rule based on posterior
probabilities is
Decide 1 if P(1|x) > P(2|x); otherwise decide 2

p( x | 1 ) P(1 ) p( x | 2 ) P(2 )
P(1 | x)  , P(2 | x) 
p( x) p( x)
Decision rule becomes
p( x | 1 ) P(1 ) p( x | 2 ) P(2 )

p( x) p( x)
p( x | 1 ) P(1 )  p ( x | 2 ) P (2 )
Decision Rule based on Data
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2

Special cases:
1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 2
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
Classification error
Decide 1 if P(1 | x)  P(2 | x);otherwise decide 2
 The probability of error is the minimum of both probabilities

 P (1 | x) if we decide 2
p (error | x)  
 P (2 | x) if we decide 1
p (error | x)  min  P(1 | x), P(2 | x) 
Confusion Matrix
• To evaluate the classifier, we make a table of
following type
Predicted Class
Sea bass Salmon
State of nature Sea bass N1 N2
or ground truth Salmon N3 N4

N1  N 2  N seabass
N3  N 4  N salmon
Correct classifications N1  N 4
Classification rate = 
Total number of examples N seabass  N salmon
Incorrect classifications N 2  N3
Classification error = 
Total number of examples N seabass  N salmon
Lecture – 9-10

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Non Parametric Classification
K- Nearest Neighbor (KNN)
Unsupervised Learning
K-mean Clustering
The assumption that the probability density function (pdf) that generated the
data has some parametric form may not hold true in many cases

Non-parametric classifiers assume that the pdf does not have any parametric
form

It is based on finding the similarity between the data samples


Non Parametric Classification
• Given a training dataset with classes i.e. data and their labels
are known

 x , y  ,  x , y  ,  x , y  ,
1 1 2 2 3 3 ,  x n , y n  , x  d

• Objective: Decide the class (or label) of a test vector xt


• Method:
• Compute the distance between the vector xt and all examples of the
training dataset
• Decision: assign the class of the nearest neighbor (closest point) from
training dataset to the vector xt
Non Parametric Classification
xt
Feature Space

Vectors of the blue


Vectors of the red class
class

Class (xt) = Class of the nearest neighbor


Problem Statement
Can we LEARN to recognise a rugby player?

What are the “features” of a rugby player?


Features
Rugby players = short + heavy?
190cm

130cm

60kg 90kg
Features

Ballet dancers = tall + skinny?


190cm

130cm

60kg 90kg
Feature Space
Rugby players “cluster” separately in the space.

Height

Weight
The K-Nearest Neighbour Algorithm

Who’s this?

Height

Weight
The K-Nearest Neighbour Algorithm
1. Measure distance to all points

Height

Weight
The K-Nearest Neighbour Algorithm
1. Measure distance to all points
2. Find closest “k” points  (here k=3, but it could be more)

Height

Weight
The K-Nearest Neighbour Algorithm
1. Measure distance to all points
2. Find closest “k” points  (here k=3, but it could be more)
3. Assign majority class

Height

Weight
Distance Measure
“Euclidean distance”
d  ( w  w1 )  (h  h1 )
2 2

(w, h)

Height
d
(w1, h1)

Weight
The K-Nearest Neighbour Algorithm

for each testing point


measure distance to every training point
find the k closest points
identify the most common class among those k
predict that class
end

• Advantage: Surprisingly good classifier!


• Disadvantage: Have to store the entire training set in memory
K-Nearest Neighbor Rule
• Given a training dataset with classes i.e. data and their labels are known
 x , y  ,  x , y  ,  x , y  ,
1 1 2 2 3 3 ,  x n , y n  , x  d

• Objective: Decide the class (or label) of a test vector xt


• Method:
• Compute the distances between the vector xt and all examples of the training dataset
• Determine k nearest neighbors of xt
• Decision: class(xt) = majority vote of k nearest neighbors
• The choice of distance is very important
K-Nearest Neighbor Rule
• Example
Class 1
Class 1
Class 2
Class 2

1-NN 3-NN
K-Nearest Neighbor Rule

The test sample (green circle) should be classified either to the first class of blue
squares or to the second class of red triangles. If k = 3 it is assigned to the second
class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it
is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).

Image Source: Wikipedia


Decision Regions
• Voronoi Diagram
Given a set of points (referred to as sites or nodes) a Voronoi
diagram is a partition of space into regions, within which all
points are closer to some particular node than to any other
node
Voronoi Editing
• Each cell contains one sample, and every location within
the cell is closer to that sample than to any other sample.
• Every query point will be assigned the classification of
the sample within that cell.
Voronoi Editing
• Knowledge of this boundary is sufficient to classify new points.
• The boundary itself is rarely computed; many algorithms seek to retain
only those points necessary to generate an identical boundary.
Voronoi Editing
• The boundaries of the Voronoi regions separating
those regions whose nodes are of different class
contribute to a portion of the decision boundary

The decision boundary


Voronoi Editing
• The nodes of the Voronoi regions whose boundaries
did not contribute to the decision boundary are
redundant and can be safely deleted from the training
set

The edited training set.


Voronoi Editing
• A drawback of Voronoi editing is that it still leaves too many points
in the edited set by preserving many points which are well
separated from other classes and which should intuitively be
deleted

Points A1 and B1 are well separated. They are preserved by the Voronoi editing to maintain a
portion of the decision boundary. However, assuming that new points will come from the same
distribution as the training set, the portions of the decision boundary remote from the
concentration of training set points is of lesser importance.
Gabriel Graph
Two points A and B are said to be Gabriel neighbours if
their diametral sphere (i.e. the sphere such that AB is
its diameter) doesn't contain any other points

Points A and B are


Gabriel neighbours,
whereas B and C are
not

A graph where all pairs of Gabriel neighbours are


connected with an edge is called the Gabriel graph.
Unsupervised Learning
K mean clustering
CLUSTERING
Clustering is the partitioning of a data set into
subsets (clusters), so that the data in each
subset (ideally) share some common trait - often
according to some defined distance measure.

Clustering is unsupervised classification


CLUSTERING
• There is no explicit teacher and the system forms
clusters or “natural groupings” or structure in the input
pattern
CLUSTERING
• Data WITHOUT classes or labels
x1 , x2 , x3 , xn  , x  d

• Deals with finding a structure in a collection of unlabeled data.

• The process of organizing objects into groups whose members are similar
in some way

• A cluster is therefore a collection of objects which are “similar” between


them and are “dissimilar” to the objects belonging to other clusters.
CLUSTERING

 In this case we easily identify the 4 clusters into which the data
can be divided

 The similarity criterion is distance: two or more objects belong


to the same cluster if they are “close” according to a given
distance
Types of Clustering
• Hierarchical algorithms
– These find successive clusters using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative
algorithms begin with each element as a separate cluster
and merge them into successively larger clusters.

2. Divisive ("top-down"): Divisive algorithms begin with


the whole set and proceed to divide it into successively
into smaller clusters.
Types of Clustering
• Partitional clustering
• Construct a partition of a data set to produce several clusters – At once
• The process is repeated iteratively – Termination condition
• Examples
• K-means clustering
• Fuzzy c-means clustering
K MEANS CLUSTERING
K MEANS – example 1

35
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – Example 2
• Suppose we have 4 medicines and each has two
attributes (pH and weight index). Our goal is to group
these objects into K=2 groups of medicine

Medicine Weight pH-Index C


A 1 1
B 2 1
C 4 3 A B
D 5 4
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Kmeans - Examples
• Data Points – RGB Values of pixels
• Can be used for Image Segmentation

D. Comaniciu and P.
Meer, Robust Analysis
of Feature Spaces:
Color Image
Segmentation, 1997.
Kmeans - Examples

Original K=5 K=11


Kmeans - Examples
• Quantization of colors
Kmeans - Examples
• Extraction of text in degraded documents

Original Image Kmeans with k=3


Kmeans - Examples
Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Lecture – 11-12

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com


Dimensionality Reduction
Why dimensionality Reduction?
• Generally, it is easy and convenient to collect data
• An experiment

• Data accumulates in an unparalleled speed


• Data preprocessing is an important part for effective
machine learning
• Dimensionality reduction is an effective approach to
downsizing data

4
Why dimensionality Reduction?
• Most machine learning techniques may not
be effective for high-dimensional data
• Curse of Dimensionality
• Accuracy and efficiency may degrade rapidly as
the dimension increases.

5
Why dimensionality Reduction?
 Visualization: projection of high-dimensional
data onto 2D or 3D.

 Data compression: efficient storage and


retrieval.

 Noise removal: positive effect on query


accuracy.

6
Other examples

Face images Handwritten digits


7
Dimensionality Reduction
• Reduces time complexity: Less computation

• Reduces space complexity: Less parameters

• Saves the cost of observing/computing the


feature

8
Dimensionality Reduction

Key methods of dimensionality reduction

Feature Selection Feature Extraction

9
Feature selection vs extraction
• Feature selection:

Choosing k<d important features, ignoring the remaining


d – k. These are Subset selection algorithms

• Feature extraction:

Project the original xi , i =1,...,d dimensions to new k<d


dimensions, zj , j =1,...,k

10
Feature extraction
 Given a set of data points of p variables
x1, x2 ,, xn 
 Compute their low-dimensional representation:

xi    yi   ( p  d )
d p

11
Feature Selection

12
Contents: Feature Selection
• Introduction

• Feature subset search

• Models for Feature Selection


• Principle component analysis

13
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)

The data has 10,000 fields (features)

14
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)

The data has 10,000 fields (features)


You need to cut it down to 1,000 fields before
you try machine learning. Which 1,000?

15
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)

The data has 10,000 fields (features)


You need to cut it down to 1,000 fields before
you try machine learning. Which 1,000?
This process of choosing the 1,000 fields to
use is an example of Feature Selection

16
Feature Selection: why?
• Quite easy to find lots more cases from papers,
where experiments show that accuracy reduces
when you use more features

• Questions?
• Why does accuracy reduce with more features?
• How does it depend on the specific choice of features?
• What else changes if we use more features?
• So, how do we choose the right features?

17
Why accuracy reduces ?
Note: Suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning
may reduce. But you still have the original
20 features!! Why does this happen???

18
Noise/explosion
• The additional features typically add noise

• Machine learning will pick up on spurious


correlations, that might be true in the training set,
but not in the test set

• For some ML methods, more features means more


parameters to learn (more NN weights, more
decision tree nodes, etc…) – the increased space of
possibilities is more difficult to search

19
Overfitting
A statistical model is said to be overfitted, when we train it with a lot of
data (just like fitting ourselves in an oversized pants!). When a model gets
trained with so much of data, it starts learning from the noise and
inaccurate data entries in our data set. Then the model does not categorize
the data correctly, because of too much of details and noise.
Principal component Analysis (PCA)
• PCA is one of the most common feature
selection /extraction techniques
• Reduce the dimensionality of a data set by
finding a new set of variables, smaller than
the original set of variables
• Allows us to combine much of the
information contained in n features into m
features where m < n

22
PCA – Introduction

z1

• The 1st PC z1 is a minimum distance fit to a line in X space


• The 2nd PC z2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
23
Principal component Analysis (PCA)

24
PCA – Introduction
Transform n-dimensional data to a new n-
dimensions
The new dimension with the most variance is the
first principal component
The next is the second principal component, etc.

25
Recap

Variance and Covariance


• Variance is a measure of data spread in one
dimension (feature)
• Covariance measures how two dimensions
(features) vary with respect to each other

26
Recap

Covariance
• Focus on the sign (rather than exact value)
of covariance
• Positive value means that as one feature
increases or decreases the other does also
(positively correlated)
• Negative value means that as one feature
increases the other decreases and vice versa
(negatively correlated)
• A value close to zero means the features are
independent

27
Recap

Covariance Matrix
• Covariance matrix is an n × n matrix containing
the covariance values for all pairs of features in a
data set with n features (dimensions)

• The diagonal contains the covariance of a feature


with itself which is the variance (which is the
square of the standard deviation)

• The matrix is symmetric

28
PCA – Main Steps
Center data around 0

Form the covariance matrix S.

Compute its eigenvectors: ai i 1


d

The first p eigenvectors a  p


i i 1 form the p PCs.

The transformation G consists of the p PCs.


G  [a1 , a2 ,, a p ]

29
A test point x   G x 
d T p
Data

PCA – Worked Example


X Y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2.0 1.6
1.0 1.1
1.5 1.6
1.2 0.9

30
Step 1

PCA – Worked Example


 First step is to center the original data around 0
 Subtract mean from each value
X Y X  Y 
2.5 2.4 0.69 0.49
0.5 0.7 1.31 1.21
2.2 2.9 0.39 0.99
1.9 2.2 0.09 0.29
X 1.81
3.1 3.0   1.29 1.09
Y 1.91
2.3 2.7 0.49 0.79
2.0 1.6 0.19 0.31
1.0 1.1 0.81 0.81
1.5 1.6 0.31 0.31
31 1.2 0.9 0.71 1.01
Step 2

PCA – Worked Example


• Calculate the covariance matrix of the
centered data – Only 2 × 2 for this case
X Y X  Y 
2.5 2.4 0.69 0.49 0.616555556 0.615444444
0.5 0.7 1.31 1.21 cov   
2.2 2.9 0.39 0.99 0.615444444 0.716555556
1.9 2.2 0.09 0.29
X 1.81
3.1 3.0   1.29 1.09
Y 1.91 n
2.3
2.0
2.7
1.6
0.49 0.79
0.19 0.31
 X i 
 X Yi  Y 

0.81 0.81 cov( X,Y )  i1
1.0
1.5
1.1
1.6 0.31 0.31
n 1
1.2 0.9 0.71 1.01

 
32
Step 3

PCA – Worked Example


• Calculate the eigenvectors and eigenvalues of the
covariance matrix (remember linear algebra)
• Covariance matrix – square n × n ; n eigenvalues will exist
• All eigenvectors (principal components/dimensions) are orthogonal to each
other and will make a new set of dimensions for the data
• The magnitude of each eigenvalue corresponds to the variance along that new
dimension – Just what we wanted!
• We can sort the principal components according to their eigenvalues
• Just keep those dimensions with largest eigenvalues

0.490833989
eigenvalues   
1.28402771 

0.735178656 0.677873399
eigenvectors  
0.677873399 0.735178656
33

You might also like