ML Mid Syllabus

Machine Learning
Course Instructor: Mirza Adnan Baig
Email ID: adnanbaig4cs@gmail.com

Lecture - 1

Reference Book:
Pattern classification by Richard O duda
Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop

What is Machine Learning?
Making predictions or decisions from Data
Goal : Computer should view and perceive things as humans do.
Machine Learning is arguably the greatest export from computing to other

scientific fields.
Machine Learning Applications
Motivation:
Machine vision over human vision. Which one is better.?
Motivation:
Motivation:
Back to Machine Learning.
Main perception:
Build a machine that recognize pattern.
What is Pattern Recognition?
Theory, Algorithms, Systems to Put Patterns into Categories
Relate Perceived Pattern to Previously Perceived Patterns.
What is Machine Learning?
Make the machine Evaluate how good the

‘learn’ some thing machine has ‘learned’
Machine Learning
Learning = Improving with experience over some

task
A machine is said to learn if its performance P ,

at task T, improves with experience E
Learning Problems – Examples
• Handwriting recognition
learning problem
• Task T: recognizing
handwritten words within
images
• Performance measure P:
percent of words correctly
recognized
• Training experience E: a
database of handwritten
words with given
classifications
Machine Learning Pipelining
Training Training
Training Images Labels
Classifier Trained
Image Features
Training Classifier
Training Training
Labels
Training data
Classifier Trained
data Features
Training Classifier
Testing
Prediction
Trained
Data Features
Classifier
Test Data
Machine Learning

Lecture - 2

Reference Book:

Training Training
Labels
Training data
Classifier Trained
data Features
Training Classifier
Testing
Prediction
Trained
Data Features
Classifier
Test Data
How Machine Decide?
Differentiating between low-risk

and high-risk customers from their
income and savings
How Machine Decide?
Differentiating between low-risk

and high-risk customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2

THEN low-risk
ELSE high-risk
Machine Learning applications?
• Speech Recognition
• Face Recognition
• Handwriting Character Recognition
• Autonomous Driving
Face recognition
Training examples of a person
Test images
AT&T Laboratories, Cambridge UK

http://www.uk.research.att.com/facedatabase.html
8
OCR & Handwriting recognition
Speech Recognition
Training Training
Labels
Training data
Classifier Trained
data Features
Training Classifier
Testing
Prediction
Trained
Image Features
Classifier
Test Image
Features
• Features are the individual measurable properties of
the signal being observed.
• The set of features used for learning/recognition is

called feature vector.
• The number of used features is the dimensionality

of the feature vector.
Features
height
 x1 
x   x
weight  2
Feature Extraction
• Feature extraction aims to create discriminative features
good for learning
• Good Features
• Objects from the same class have similar feature
values.
• Objects from different classes have different values.
Feature Extraction
• Feature extraction aims to create discriminative
features good for Learning
• Good Features
• Objects from the same class have similar feature
values.
• Objects from different classes have different
values.
“Good” features “Bad” features

Contents
• Supervised learning
–Classification
• Unsupervised learning
• Reinforcement learning
CLASSIFICATION
Supervised learning - Classification
• Objective
• Make Nicolas recognize what is an apple and
what is an orange
Classification
Apples Oranges
Classification
• You had some training
example or ‘training data’
What is this???
• The examples were ‘labeled’
• You used those examples to

make the kid ‘learn’ the
difference between an apple
and an orange
Its an
apple!!!
Classification
Apple
Pear
Tomato
Cow
Dog
Horse
Given: training images and their categories

What are the categories
of these test images?
Classification
• Cancer Diagnosis – Generally more than one
variables
Malignant
Benign
Age
Tumor Size
Why supervised – The algorithm is given a number of patients
with the RIGHT ANSWER and we want the algorithm to learn to
predict for new patients
Classification
• Cancer Diagnosis – Generally more than one
variables
Predict for this
patient
Malignant
Benign
Age
Tumor Size
We want the algorithm to learn the separation line. Once a new

patient arrives with a given age and tumor size – Predict as
Malignant or Benign
Machine Learning

Lecture – 3-4

Reference Book:

Contents
–Classification
_ Clustring
CLUSTERING
UNSUPERVISED LEARNING
CLUSTERING
There are two types of fruit in the basket,

separate them into two ‘groups’
UNSUPERVISED LEARNING
CLUSTERING
 The data was not ‘labeled’ you did not tell
Nicolas which are apples which are oranges
 May be the kid used the idea that things in

the same group should be similar to one
another as compared to things in the other Separate groups or clusters
group
 Groups - Clusters
Classification vs Clustering
Challenges ?
Classification vs Clustering
• Challenges
– Intra-class variability
– Inter-class similarity
Intra Class Variability
Object belongs to same class but looks different
The letter “T” in different type faces
Same face under different expression, pose, illumination

Inter Class Similarity
Objects belongs to different class but look alike.
Identical twins Characters that look similar

3/2/2024
Task …
• Separation of different coins using a robotic arm
Machine Learning

Lecture – 5-6

Reference Book:

Contents
– Classification
Reinforcement Learning
• In RL, the computer is simply given a goal to
achieve.
• The computer then learns how to achieve that
goal by trial-and-error interactions with its
environment
System learns from success and failure,

reward and punishment
Reinforcement Learning
• Objective: training a pet dog
• Need to make a sequence of good decisions to train it.
 Every time dog does something good you pat
him and say ‘good dog’
 Every time dog does some thing bad you

scold him saying ‘bad dog’
 Over time dog will learn to do good things

A Fancy Problem
Sorting incoming fish on a conveyor according to
species (salmon or sea bass)
Salmon or sea bass?

(2 categories or classes)
It is a classification
problem. How to
solve it?
Approach
• Data collection
How to use it?
• Preprocessing: Use a segmentation operation to isolate fishes from

one another and from the background
Image processing?
• Information from a single fish is sent to a feature extractor whose

purpose is to reduce the data by measuring certain features
But which features to extract?
• The features are passed to a classifier that evaluates the evidence

and then takes a final decision
How to design and realize a classifier?
Approach
• Set up a camera and take some sample images to extract features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!
• Challenges??
Approach
• Set up a camera and take some sample images to extract features
• Length
• Lightness
• Width
• Number and shape of fins
• Position of the mouth, etc…
• This is the set of all suggested features to explore for use in our
classifier!
• Challenges:
• Variations in images – lightning, occlusion, camera view angle
• Position of the fish on the conveyer belt, etc…
How data is collected and used
• Data can be raw signals (e.g. images) or features extracted from images – data is usually
expensive
• The data is divided into three parts (exact percentage of each portion depends (partially) on
data sample size)
Train Validation Test
• Train data: It is used to build a prediction model or learner (classifier)
• Validation data: It is used to estimate the prediction error (classification error) and adjust the learner
parameters
• Test data: It is used to estimate the classification error of the chosen learner on unseen data called
generalization error. The test must be kept inside a ‘vault’ and be brought out only at the end of data analysis
Feature extraction
• Feature extraction: use domain knowledge
– The sea bass is generally longer than a salmon

– The average lightness of sea bass scales is greater than that of salmon
• We will used training data in order to learn a classification rule based on

these features (length of a fish and average lightness)
• Length of fish and average lightness may not be sufficient features i.e. they
may not guarantee 100% classification results
Classification
Select the length of the fish as a possible feature
for discrimination between two classes
Histograms for the length feature for the two categories

Classification
Select the length of the fish as a possible feature
for discrimination between two classes
Decision
Boundary
Histograms for the length feature for the two categories

Cost of Taking a Decision
• A fish-packaging industry use the system to pack
fish in cans.
• Two facts
• People do not want to find sea bass in the cans
labeled salmon
• People occasionally accepts to find salmon in
the cans labeled sea-bass
• So the cost of taking a decision in favor of whom??
Cost of Taking a Decision
• A fish-packaging industry use the system to pack
fish in cans.
• Two facts
• People do not want to find sea bass in the cans labeled
salmon
• People occasionally accepts to find salmon in the cans
labeled sea-bass
• So the cost of taking a decision in favor of sea bass
Evaluation of a classifier
• How to evaluate a certain classifier?
• Classification error: The percentage of patterns (e.g.

fish) that are assigned to wrong category
• Choose a classifier that gives minimum classification
error
Two performance measure use for evaluation of classifier.
1) Precision
2) Recall
Precision : Number of relevant retrieved sample/ Total retrieved sample
Recall: Number of relevant retrieved sample/ Total number of relevant sample

Example: Suppose you have 5 documents, say
Doc 1= excel
Doc 2= word
Doc 3= excel
Doc 4= word
Doc 5= word
You wish to design a information retravel system. Suppose your system retrieve
the documents say
Doc 1= excel
Doc 2= word
Doc 4= word
Doc 5= word
Find the precision and recall of the system for excel documents retrieval ?

: 1/4

: 1/2
Example :
A database contains 80 records on a particular topic . A search was
conducted on that topic and 60 records were retrieved. Of the 60
records retrieved, 45 were relevant. Find precision and recall.

Precision : Number of relevant retrieved sample/ Total retrieved
sample
Precision: 45/60
Recall: Number of relevant retrieved sample/ Total number of relevant

sample
Recall: 45/80
Lecture – 7-8

Bayesian Decision Theory
Probability Theory :
A random experiment whose outcome is not predictable.
Sample Space (S): The set of all possible outcomes

A sample space is discrete if it consists of a finite (or countably infinite) set of
outcome; otherwise it is continuous.
Example: Consider the experiment where two coins are simultaneously tossed.
The various elementary events are
S   HH , HT , TH , TT 
How to calculate probability ?

Number of outcomes favorable to E
P( E ) 
Total number of possible outcomes
Probability that at least one heads occurs = P(A) =
Probability that two heads occurs simultaneously = P(B) =
Probability that two tails occur simultaneously =
Note that P(S) =

Probability that at least one heads occurs = P(A) = 3/4
Probability that two heads occurs simultaneously = P(B) =1/4
Probability that two tails occur simultaneously = 1/4
Note that P(S) = 1

Consider a box with n white and m red balls. In this case, there
are two elementary outcomes: white ball or red ball.
Probability of “selecting a white ball” =

Consider a box with n white and m red balls. In this case, there
are two elementary outcomes: white ball or red ball.
n
Probability of “selecting a white ball” =
nm
If A1 is an event that cannot possibly occur the P(A1) = 0. If A2 is
sure to occur, P(A2) = 1.
0  P( A)  1
If A and B are mutually exclusive events then probability of their

union is same as the sum of their probabilities
If A  B  , then P( A  B)  P( A)  P( B)
If A and B are not mutually exclusive then
P( A  B)  P( A)  P( B)  P( A  B)
Conditional probability:
P( AB)
P( A | B) 
P( B)
P(A | B) = Probability of “the event A, given that B has occurred”

If two events A and B are independent:
P(A and B) = P(A)P(B)
If A and B are mutually exclusive:
P(A and B) = 0
P( AB)
P( A | B)   P( AB)  P( A | B) P( B)
P( B)
P( AB)
P( B | A)   P( AB)  P( B | A) P( A)
P( A)
Thus
P( A | B) P( B)  P( B | A) P( A)
P( B | A) P( A)
P( A | B)  Bayes Theorem
P( B)
What is bayesian decision theory
• Mathematical foundation for decision making.
• Using probabilistic approach to help making

decision (classification) so as to minimize the
risk (cost).
• The decision problem is viewed in term of

probabilities and it is assumed that all of the
relevant probabilities are known
• It makes a compromise between the decisions

and their costs and finds an optimal solution
Fish sorting - revisited
How to design it?

Notations
i {1 , 2 ,, c } : a state of nature
P(i ) : prior probability
x : feature vector
class-conditional
p(x | i ) :
density
P(i | x) : posterior probability

Decision rule
• In our example, the fish on the conveyer belt
may be salmon or sea bass and we are not
certain about it. Thus it can be described in
terms of probability as the outcome of the
experiment is not certain
• So, we assign a random variable ω to describe
the type of fish
State of nature
ω = ω1 for sea bass

ω = ω2 for salmon
Decision Rule based on Prior Information
• Prior probability: It is based on our knowledge without doing any
experimentation
• For example: if fishermen catch as much sea bass as salmon in

a season then the catch and their appearance on conveyer belt
is equiprobable:
P(1 )  P(2 )  0.5 and
P(1 )  P(2 )  1
• In general, sea bass and salmon may appear with any non-zero
probabilities
P(1 )  0 and P(2 )  0 but
P(1 )  P(2 )  1
Decision Rule based on Prior Information
• An optimal decision rule based on only prior
probabilities (without seeing the fish) is
Decide 1 if P(1 )  P(2 );otherwise decide 2
• This rule makes sense if we wish to judge one fish

but it is odd if we are judging many fish
• If P(ω 1) >> P(ω 2) then our decision will be right
most of the time
• The probability of error is the minimum of both
probabilities
P (error )  min  P (1 ), P (2 ) 
Decision Rule based on Data
• Data can be images or features extracted (e.g.
lightness of fish) from images
• In our case, different fish yield different
lightness readings, and we express this
variability in probabilistic terms
• It is expressed as a conditional probability also
called class-conditional probability function i.e.
p( x | 1 )  The pdf of x given that state of nature is 1

p( x | 2 )  The pdf of x given that state of nature is 2
• We can integrate the new knowledge i.e. class
conditional probability function to prior
knowledge, to compute posteriori probability or
posterior using Baye’s rule
p( x | i ) P(i )
P(i | x) 
p ( x)
2
where p( x)   p( x | i ) P(i )
i 1
• Bayes’ rule can also be expressed as
likelihood x prior
posterior 
evidence
• Evidence can be considered as a scale factor
p(x | i ) P(i )
P(i | x) 
p ( x)
unimportant in
making decision
• An optimal decision rule based on posterior
probabilities is
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
p( x | 1 ) P(1 ) p( x | 2 ) P(2 )
P(1 | x)  , P(2 | x) 
p( x) p( x)
Decision rule becomes
p( x | 1 ) P(1 ) p( x | 2 ) P(2 )

p( x) p( x)
p( x | 1 ) P(1 )  p ( x | 2 ) P (2 )
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Special cases:
1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 2
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
Classification error
Decide 1 if P(1 | x)  P(2 | x);otherwise decide 2
 The probability of error is the minimum of both probabilities
 P (1 | x) if we decide 2
p (error | x)  
 P (2 | x) if we decide 1
p (error | x)  min  P(1 | x), P(2 | x) 
Confusion Matrix
• To evaluate the classifier, we make a table of
following type
Predicted Class
Sea bass Salmon
State of nature Sea bass N1 N2
or ground truth Salmon N3 N4
N1  N 2  N seabass
N3  N 4  N salmon
Correct classifications N1  N 4
Classification rate = 
Total number of examples N seabass  N salmon
Incorrect classifications N 2  N3
Classification error = 
Total number of examples N seabass  N salmon
Lecture – 9-10

Non Parametric Classification
K- Nearest Neighbor (KNN)
Unsupervised Learning
K-mean Clustering
The assumption that the probability density function (pdf) that generated the
data has some parametric form may not hold true in many cases
Non-parametric classifiers assume that the pdf does not have any parametric
form
It is based on finding the similarity between the data samples

• Given a training dataset with classes i.e. data and their labels
are known
 x , y  ,  x , y  ,  x , y  ,
1 1 2 2 3 3 ,  x n , y n  , x  d
• Objective: Decide the class (or label) of a test vector xt

• Method:
• Compute the distance between the vector xt and all examples of the
training dataset
• Decision: assign the class of the nearest neighbor (closest point) from
training dataset to the vector xt
xt
Feature Space
Vectors of the blue

Vectors of the red class
class
Class (xt) = Class of the nearest neighbor

Problem Statement
Can we LEARN to recognise a rugby player?
What are the “features” of a rugby player?

Features
Rugby players = short + heavy?
190cm
130cm
60kg 90kg
Features
Ballet dancers = tall + skinny?

190cm
130cm
60kg 90kg
Feature Space
Rugby players “cluster” separately in the space.
Height
Weight
The K-Nearest Neighbour Algorithm
Who’s this?
Height
Weight
1. Measure distance to all points
Height
Weight
2. Find closest “k” points  (here k=3, but it could be more)
Height
Weight
2. Find closest “k” points  (here k=3, but it could be more)
3. Assign majority class
Height
Weight
Distance Measure
“Euclidean distance”
d  ( w  w1 )  (h  h1 )
2 2
(w, h)
Height
d
(w1, h1)
Weight
for each testing point

measure distance to every training point
find the k closest points
identify the most common class among those k
predict that class
end
• Advantage: Surprisingly good classifier!

• Disadvantage: Have to store the entire training set in memory
K-Nearest Neighbor Rule
• Given a training dataset with classes i.e. data and their labels are known
 x , y  ,  x , y  ,  x , y  ,
1 1 2 2 3 3 ,  x n , y n  , x  d
• Objective: Decide the class (or label) of a test vector xt

• Method:
• Compute the distances between the vector xt and all examples of the training dataset
• Determine k nearest neighbors of xt
• Decision: class(xt) = majority vote of k nearest neighbors
• The choice of distance is very important
• Example
Class 1
Class 1
Class 2
Class 2
1-NN 3-NN
The test sample (green circle) should be classified either to the first class of blue
squares or to the second class of red triangles. If k = 3 it is assigned to the second
class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it
is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).
Image Source: Wikipedia

Decision Regions
• Voronoi Diagram
Given a set of points (referred to as sites or nodes) a Voronoi
diagram is a partition of space into regions, within which all
points are closer to some particular node than to any other
node
Voronoi Editing
• Each cell contains one sample, and every location within
the cell is closer to that sample than to any other sample.
• Every query point will be assigned the classification of
the sample within that cell.
Voronoi Editing
• Knowledge of this boundary is sufficient to classify new points.
• The boundary itself is rarely computed; many algorithms seek to retain
only those points necessary to generate an identical boundary.
Voronoi Editing
• The boundaries of the Voronoi regions separating
those regions whose nodes are of different class
contribute to a portion of the decision boundary
The decision boundary

Voronoi Editing
• The nodes of the Voronoi regions whose boundaries
did not contribute to the decision boundary are
redundant and can be safely deleted from the training
set
The edited training set.

Voronoi Editing
• A drawback of Voronoi editing is that it still leaves too many points
in the edited set by preserving many points which are well
separated from other classes and which should intuitively be
deleted
Points A1 and B1 are well separated. They are preserved by the Voronoi editing to maintain a
portion of the decision boundary. However, assuming that new points will come from the same
distribution as the training set, the portions of the decision boundary remote from the
concentration of training set points is of lesser importance.
Gabriel Graph
Two points A and B are said to be Gabriel neighbours if
their diametral sphere (i.e. the sphere such that AB is
its diameter) doesn't contain any other points
Points A and B are

Gabriel neighbours,
whereas B and C are
not
A graph where all pairs of Gabriel neighbours are

connected with an edge is called the Gabriel graph.
Unsupervised Learning
K mean clustering
CLUSTERING
Clustering is the partitioning of a data set into
subsets (clusters), so that the data in each
subset (ideally) share some common trait - often
according to some defined distance measure.
Clustering is unsupervised classification

CLUSTERING
• There is no explicit teacher and the system forms
clusters or “natural groupings” or structure in the input
pattern
CLUSTERING
• Data WITHOUT classes or labels
x1 , x2 , x3 , xn  , x  d
• Deals with finding a structure in a collection of unlabeled data.
• The process of organizing objects into groups whose members are similar
in some way
• A cluster is therefore a collection of objects which are “similar” between

them and are “dissimilar” to the objects belonging to other clusters.
CLUSTERING
 In this case we easily identify the 4 clusters into which the data
can be divided
 The similarity criterion is distance: two or more objects belong

to the same cluster if they are “close” according to a given
distance
Types of Clustering
• Hierarchical algorithms
– These find successive clusters using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative
algorithms begin with each element as a separate cluster
and merge them into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with

the whole set and proceed to divide it into successively
into smaller clusters.
Types of Clustering
• Partitional clustering
• Construct a partition of a data set to produce several clusters – At once
• The process is repeated iteratively – Termination condition
• Examples
• K-means clustering
• Fuzzy c-means clustering
K MEANS CLUSTERING
K MEANS – example 1
35
K MEANS – Example 2
• Suppose we have 4 medicines and each has two
attributes (pH and weight index). Our goal is to group
these objects into K=2 groups of medicine
Medicine Weight pH-Index C

A 1 1
B 2 1
C 4 3 A B
D 5 4
http://home.deib.polimi.it/matteucc/Clustering/tutorial_html/AppletKM.html
Kmeans - Examples
• Data Points – RGB Values of pixels
• Can be used for Image Segmentation
D. Comaniciu and P.
Meer, Robust Analysis
of Feature Spaces:
Color Image
Segmentation, 1997.
Kmeans - Examples
Original K=5 K=11

Kmeans - Examples
• Quantization of colors
Kmeans - Examples
• Extraction of text in degraded documents
Original Image Kmeans with k=3

Kmeans - Examples
Machine Learning

Lecture – 11-12

Dimensionality Reduction
Why dimensionality Reduction?
• Generally, it is easy and convenient to collect data
• An experiment
• Data accumulates in an unparalleled speed

• Data preprocessing is an important part for effective
machine learning
• Dimensionality reduction is an effective approach to
downsizing data
4
• Most machine learning techniques may not
be effective for high-dimensional data
• Curse of Dimensionality
• Accuracy and efficiency may degrade rapidly as
the dimension increases.
5
 Visualization: projection of high-dimensional
data onto 2D or 3D.
 Data compression: efficient storage and

retrieval.
 Noise removal: positive effect on query

accuracy.
6
Other examples
Face images Handwritten digits

7
• Reduces time complexity: Less computation
• Reduces space complexity: Less parameters
• Saves the cost of observing/computing the

feature
8
Key methods of dimensionality reduction
Feature Selection Feature Extraction
9
Feature selection vs extraction
• Feature selection:
Choosing k<d important features, ignoring the remaining

d – k. These are Subset selection algorithms
• Feature extraction:
Project the original xi , i =1,...,d dimensions to new k<d

dimensions, zj , j =1,...,k
10
Feature extraction
 Given a set of data points of p variables
x1, x2 ,, xn 
 Compute their low-dimensional representation:
xi    yi   ( p  d )
d p
11
Feature Selection
12
Contents: Feature Selection
• Introduction
• Feature subset search
• Models for Feature Selection

• Principle component analysis
13
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
The data has 10,000 fields (features)
14
Introduction

You need to cut it down to 1,000 fields before
you try machine learning. Which 1,000?
15
Introduction

You need to cut it down to 1,000 fields before
you try machine learning. Which 1,000?
This process of choosing the 1,000 fields to
use is an example of Feature Selection
16
Feature Selection: why?
• Quite easy to find lots more cases from papers,
where experiments show that accuracy reduces
when you use more features
• Questions?
• Why does accuracy reduce with more features?
• How does it depend on the specific choice of features?
• What else changes if we use more features?
• So, how do we choose the right features?
17
Why accuracy reduces ?
Note: Suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning
may reduce. But you still have the original
20 features!! Why does this happen???
18
Noise/explosion
• The additional features typically add noise
• Machine learning will pick up on spurious

correlations, that might be true in the training set,
but not in the test set
• For some ML methods, more features means more

parameters to learn (more NN weights, more
decision tree nodes, etc…) – the increased space of
possibilities is more difficult to search
19
Overfitting
A statistical model is said to be overfitted, when we train it with a lot of
data (just like fitting ourselves in an oversized pants!). When a model gets
trained with so much of data, it starts learning from the noise and
inaccurate data entries in our data set. Then the model does not categorize
the data correctly, because of too much of details and noise.
Principal component Analysis (PCA)
• PCA is one of the most common feature
selection /extraction techniques
• Reduce the dimensionality of a data set by
finding a new set of variables, smaller than
the original set of variables
• Allows us to combine much of the
information contained in n features into m
features where m < n
22
PCA – Introduction
z1
• The 1st PC z1 is a minimum distance fit to a line in X space

• The 2nd PC z2 is a minimum distance fit to a line in the plane
perpendicular to the 1st PC
PCs are a series of linear least squares fits to a sample,
each orthogonal to all the previous.
23
Principal component Analysis (PCA)
24
PCA – Introduction
Transform n-dimensional data to a new n-
dimensions
The new dimension with the most variance is the
first principal component
The next is the second principal component, etc.
25
Recap
Variance and Covariance

• Variance is a measure of data spread in one
dimension (feature)
• Covariance measures how two dimensions
(features) vary with respect to each other
26
Recap
Covariance
• Focus on the sign (rather than exact value)
of covariance
• Positive value means that as one feature
increases or decreases the other does also
(positively correlated)
• Negative value means that as one feature
increases the other decreases and vice versa
(negatively correlated)
• A value close to zero means the features are
independent
27
Recap
Covariance Matrix
• Covariance matrix is an n × n matrix containing
the covariance values for all pairs of features in a
data set with n features (dimensions)
• The diagonal contains the covariance of a feature

with itself which is the variance (which is the
square of the standard deviation)
• The matrix is symmetric
28
PCA – Main Steps
Center data around 0
Form the covariance matrix S.
Compute its eigenvectors: ai i 1

d
The first p eigenvectors a  p

i i 1 form the p PCs.
The transformation G consists of the p PCs.

G  [a1 , a2 ,, a p ]
29
A test point x   G x 
d T p
Data
PCA – Worked Example

X Y
2.5 2.4
0.5 0.7
2.2 2.9
1.9 2.2
3.1 3.0
2.3 2.7
2.0 1.6
1.0 1.1
1.5 1.6
1.2 0.9
30
Step 1

 First step is to center the original data around 0
 Subtract mean from each value
X Y X  Y 
2.5 2.4 0.69 0.49
0.5 0.7 1.31 1.21
2.2 2.9 0.39 0.99
1.9 2.2 0.09 0.29
X 1.81
3.1 3.0   1.29 1.09
Y 1.91
2.3 2.7 0.49 0.79
2.0 1.6 0.19 0.31
1.0 1.1 0.81 0.81
1.5 1.6 0.31 0.31
31 1.2 0.9 0.71 1.01
Step 2

• Calculate the covariance matrix of the
centered data – Only 2 × 2 for this case
X Y X  Y 
2.5 2.4 0.69 0.49 0.616555556 0.615444444
0.5 0.7 1.31 1.21 cov   
2.2 2.9 0.39 0.99 0.615444444 0.716555556
1.9 2.2 0.09 0.29
X 1.81
3.1 3.0   1.29 1.09
Y 1.91 n
2.3
2.0
2.7
1.6
0.49 0.79
0.19 0.31
 X i 
 X Yi  Y 

0.81 0.81 cov( X,Y )  i1
1.0
1.5
1.1
1.6 0.31 0.31
n 1
1.2 0.9 0.71 1.01
 
32
Step 3

• Calculate the eigenvectors and eigenvalues of the
covariance matrix (remember linear algebra)
• Covariance matrix – square n × n ; n eigenvalues will exist
• All eigenvectors (principal components/dimensions) are orthogonal to each
other and will make a new set of dimensions for the data
• The magnitude of each eigenvalue corresponds to the variance along that new
dimension – Just what we wanted!
• We can sort the principal components according to their eigenvalues
• Just keep those dimensions with largest eigenvalues
0.490833989
eigenvalues   
1.28402771 
0.735178656 0.677873399
eigenvectors  
0.677873399 0.735178656
33

ML Mid Syllabus

Uploaded by

Copyright:

Available Formats

ML Mid Syllabus

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ML Mid Syllabus

Uploaded by

Copyright:

Available Formats

Machine Learning

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop

Making predictions or decisions from Data

Goal : Computer should view and perceive things as humans do.

Machine Learning is arguably the greatest export from computing to other

Make the machine Evaluate how good the

Learning = Improving with experience over some

A machine is said to learn if its performance P ,

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop

Differentiating between low-risk

Differentiating between low-risk

Discriminant: IF income > θ1 AND savings > θ2

AT&T Laboratories, Cambridge UK

• The set of features used for learning/recognition is

• The number of used features is the dimensionality

“Good” features “Bad” features

• The examples were ‘labeled’

• You used those examples to

Given: training images and their categories

We want the algorithm to learn the separation line. Once a new

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop

There are two types of fruit in the basket,

 May be the kid used the idea that things in

The letter “T” in different type faces

Same face under different expression, pose, illumination

Identical twins Characters that look similar

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Course Instructor: Mirza Adnan Baig

Email ID: adnanbaig4cs@gmail.com

Pattern Recognition & Machine Learning, 1st Edition, Chris Bishop

System learns from success and failure,

 Every time dog does some thing bad you

 Over time dog will learn to do good things

Salmon or sea bass?

• Preprocessing: Use a segmentation operation to isolate fishes from

• Information from a single fish is sent to a feature extractor whose

• The features are passed to a classifier that evaluates the evidence

Train Validation Test

• Train data: It is used to build a prediction model or learner (classifier)

– The sea bass is generally longer than a salmon

• We will used training data in order to learn a classification rule based on

Histograms for the length feature for the two categories

Histograms for the length feature for the two categories

• Classification error: The percentage of patterns (e.g.

Recall: Number of relevant retrieved sample/ Total number of relevant sample

Recall: Number of relevant retrieved sample/ Total number of relevant sample

Recall: Number of relevant retrieved sample/ Total number of relevant sample