ML Mid Syllabus
ML Mid Syllabus
ML Mid Syllabus
Main perception:
Build a machine that recognize pattern.
What is Pattern Recognition?
Theory, Algorithms, Systems to Put Patterns into Categories
Relate Perceived Pattern to Previously Perceived Patterns.
What is Machine Learning?
Training Training
Training Images Labels
Classifier Trained
Image Features
Training Classifier
Machine Learning Pipelining
Training Training
Labels
Training data
Classifier Trained
data Features
Training Classifier
Testing
Prediction
Trained
Data Features
Classifier
Test Data
Machine Learning
Training Training
Labels
Training data
Classifier Trained
data Features
Training Classifier
Testing
Prediction
Trained
Data Features
Classifier
Test Data
How Machine Decide?
Test images
8
OCR & Handwriting recognition
Speech Recognition
Machine Learning Pipelining
Training Training
Labels
Training data
Classifier Trained
data Features
Training Classifier
Testing
Prediction
Trained
Image Features
Classifier
Test Image
Features
• Features are the individual measurable properties of
the signal being observed.
x1
x x
weight 2
Feature Extraction
• Feature extraction aims to create discriminative features
good for learning
• Good Features
• Objects from the same class have similar feature
values.
• Objects from different classes have different values.
Feature Extraction
• Feature extraction aims to create discriminative
features good for Learning
• Good Features
• Objects from the same class have similar feature
values.
• Objects from different classes have different
values.
• Supervised learning
–Classification
• Unsupervised learning
• Reinforcement learning
CLASSIFICATION
Supervised learning - Classification
• Objective
• Make Nicolas recognize what is an apple and
what is an orange
Classification
Apples Oranges
Classification
• You had some training
example or ‘training data’
What is this???
Its an
apple!!!
Classification
Apple
Pear
Tomato
Cow
Dog
Horse
Malignant
Benign
Age
Tumor Size
Why supervised – The algorithm is given a number of patients
with the RIGHT ANSWER and we want the algorithm to learn to
predict for new patients
Classification
• Cancer Diagnosis – Generally more than one
variables
Predict for this
patient
Malignant
Benign
Age
Tumor Size
• Supervised learning
–Classification
• Unsupervised learning
_ Clustring
• Reinforcement learning
CLUSTERING
UNSUPERVISED LEARNING
CLUSTERING
Groups - Clusters
Classification vs Clustering
Challenges ?
Classification vs Clustering
• Challenges
– Intra-class variability
– Inter-class similarity
Intra Class Variability
Object belongs to same class but looks different
• Supervised learning
– Classification
• Unsupervised learning
• Reinforcement learning
Reinforcement Learning
• In RL, the computer is simply given a goal to
achieve.
• The computer then learns how to achieve that
goal by trial-and-error interactions with its
environment
It is a classification
problem. How to
solve it?
Approach
• Data collection
How to use it?
• This is the set of all suggested features to explore for use in our
classifier!
• Challenges:
• Variations in images – lightning, occlusion, camera view angle
• Position of the fish on the conveyer belt, etc…
How data is collected and used
• Data can be raw signals (e.g. images) or features extracted from images – data is usually
expensive
• The data is divided into three parts (exact percentage of each portion depends (partially) on
data sample size)
• Validation data: It is used to estimate the prediction error (classification error) and adjust the learner
parameters
• Test data: It is used to estimate the classification error of the chosen learner on unseen data called
generalization error. The test must be kept inside a ‘vault’ and be brought out only at the end of data analysis
Feature extraction
• Feature extraction: use domain knowledge
• Length of fish and average lightness may not be sufficient features i.e. they
may not guarantee 100% classification results
Classification
Select the length of the fish as a possible feature
for discrimination between two classes
Decision
Boundary
1) Precision
2) Recall
Precision : Number of relevant retrieved sample/ Total retrieved sample
Precision: 45/60
Recall: 45/80
Lecture – 7-8
S HH , HT , TH , TT
n
Probability of “selecting a white ball” =
nm
If A1 is an event that cannot possibly occur the P(A1) = 0. If A2 is
sure to occur, P(A2) = 1.
0 P( A) 1
If A B , then P( A B) P( A) P( B)
If A and B are not mutually exclusive then
P( A B) P( A) P( B) P( A B)
Conditional probability:
P( AB)
P( A | B)
P( B)
P(A and B) = 0
P( AB)
P( A | B) P( AB) P( A | B) P( B)
P( B)
P( AB)
P( B | A) P( AB) P( B | A) P( A)
P( A)
Thus
P( A | B) P( B) P( B | A) P( A)
P( B | A) P( A)
P( A | B) Bayes Theorem
P( B)
What is bayesian decision theory
• Mathematical foundation for decision making.
• In general, sea bass and salmon may appear with any non-zero
probabilities
P(1 ) 0 and P(2 ) 0 but
P(1 ) P(2 ) 1
Decision Rule based on Prior Information
• An optimal decision rule based on only prior
probabilities (without seeing the fish) is
Decide 1 if P(1 ) P(2 );otherwise decide 2
p( x | i ) P(i )
P(i | x)
p ( x)
2
where p( x) p( x | i ) P(i )
i 1
Decision Rule based on Data
• Bayes’ rule can also be expressed as
likelihood x prior
posterior
evidence
p(x | i ) P(i )
P(i | x)
p ( x)
unimportant in
making decision
Decision Rule based on Data
• An optimal decision rule based on posterior
probabilities is
Decide 1 if P(1|x) > P(2|x); otherwise decide 2
p( x | 1 ) P(1 ) p( x | 2 ) P(2 )
P(1 | x) , P(2 | x)
p( x) p( x)
Decision rule becomes
p( x | 1 ) P(1 ) p( x | 2 ) P(2 )
p( x) p( x)
p( x | 1 ) P(1 ) p ( x | 2 ) P (2 )
Decision Rule based on Data
Decide 1 if p(x|1)P(1) > p(x|2)P(2); otherwise decide 2
Special cases:
1. P(1)=P(2)
Decide 1 if p(x|1) > p(x|2); otherwise decide 2
2. p(x|1)=p(x|2)
Decide 1 if P(1) > P(2); otherwise decide 2
Classification error
Decide 1 if P(1 | x) P(2 | x);otherwise decide 2
The probability of error is the minimum of both probabilities
P (1 | x) if we decide 2
p (error | x)
P (2 | x) if we decide 1
p (error | x) min P(1 | x), P(2 | x)
Confusion Matrix
• To evaluate the classifier, we make a table of
following type
Predicted Class
Sea bass Salmon
State of nature Sea bass N1 N2
or ground truth Salmon N3 N4
N1 N 2 N seabass
N3 N 4 N salmon
Correct classifications N1 N 4
Classification rate =
Total number of examples N seabass N salmon
Incorrect classifications N 2 N3
Classification error =
Total number of examples N seabass N salmon
Lecture – 9-10
Non-parametric classifiers assume that the pdf does not have any parametric
form
x , y , x , y , x , y ,
1 1 2 2 3 3 , x n , y n , x d
130cm
60kg 90kg
Features
130cm
60kg 90kg
Feature Space
Rugby players “cluster” separately in the space.
Height
Weight
The K-Nearest Neighbour Algorithm
Who’s this?
Height
Weight
The K-Nearest Neighbour Algorithm
1. Measure distance to all points
Height
Weight
The K-Nearest Neighbour Algorithm
1. Measure distance to all points
2. Find closest “k” points (here k=3, but it could be more)
Height
Weight
The K-Nearest Neighbour Algorithm
1. Measure distance to all points
2. Find closest “k” points (here k=3, but it could be more)
3. Assign majority class
Height
Weight
Distance Measure
“Euclidean distance”
d ( w w1 ) (h h1 )
2 2
(w, h)
Height
d
(w1, h1)
Weight
The K-Nearest Neighbour Algorithm
1-NN 3-NN
K-Nearest Neighbor Rule
The test sample (green circle) should be classified either to the first class of blue
squares or to the second class of red triangles. If k = 3 it is assigned to the second
class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it
is assigned to the first class (3 squares vs. 2 triangles inside the outer circle).
Points A1 and B1 are well separated. They are preserved by the Voronoi editing to maintain a
portion of the decision boundary. However, assuming that new points will come from the same
distribution as the training set, the portions of the decision boundary remote from the
concentration of training set points is of lesser importance.
Gabriel Graph
Two points A and B are said to be Gabriel neighbours if
their diametral sphere (i.e. the sphere such that AB is
its diameter) doesn't contain any other points
• The process of organizing objects into groups whose members are similar
in some way
In this case we easily identify the 4 clusters into which the data
can be divided
35
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – example 1
K MEANS – Example 2
• Suppose we have 4 medicines and each has two
attributes (pH and weight index). Our goal is to group
these objects into K=2 groups of medicine
D. Comaniciu and P.
Meer, Robust Analysis
of Feature Spaces:
Color Image
Segmentation, 1997.
Kmeans - Examples
4
Why dimensionality Reduction?
• Most machine learning techniques may not
be effective for high-dimensional data
• Curse of Dimensionality
• Accuracy and efficiency may degrade rapidly as
the dimension increases.
5
Why dimensionality Reduction?
Visualization: projection of high-dimensional
data onto 2D or 3D.
6
Other examples
8
Dimensionality Reduction
9
Feature selection vs extraction
• Feature selection:
• Feature extraction:
10
Feature extraction
Given a set of data points of p variables
x1, x2 ,, xn
Compute their low-dimensional representation:
xi yi ( p d )
d p
11
Feature Selection
12
Contents: Feature Selection
• Introduction
13
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
14
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
15
Introduction
You have some data, and you want to use it to
build a classifier, so that you can predict
something (e.g. likelihood of cancer)
16
Feature Selection: why?
• Quite easy to find lots more cases from papers,
where experiments show that accuracy reduces
when you use more features
• Questions?
• Why does accuracy reduce with more features?
• How does it depend on the specific choice of features?
• What else changes if we use more features?
• So, how do we choose the right features?
17
Why accuracy reduces ?
Note: Suppose the best feature set has 20
features. If you add another 5 features,
typically the accuracy of machine learning
may reduce. But you still have the original
20 features!! Why does this happen???
18
Noise/explosion
• The additional features typically add noise
19
Overfitting
A statistical model is said to be overfitted, when we train it with a lot of
data (just like fitting ourselves in an oversized pants!). When a model gets
trained with so much of data, it starts learning from the noise and
inaccurate data entries in our data set. Then the model does not categorize
the data correctly, because of too much of details and noise.
Principal component Analysis (PCA)
• PCA is one of the most common feature
selection /extraction techniques
• Reduce the dimensionality of a data set by
finding a new set of variables, smaller than
the original set of variables
• Allows us to combine much of the
information contained in n features into m
features where m < n
22
PCA – Introduction
z1
24
PCA – Introduction
Transform n-dimensional data to a new n-
dimensions
The new dimension with the most variance is the
first principal component
The next is the second principal component, etc.
25
Recap
26
Recap
Covariance
• Focus on the sign (rather than exact value)
of covariance
• Positive value means that as one feature
increases or decreases the other does also
(positively correlated)
• Negative value means that as one feature
increases the other decreases and vice versa
(negatively correlated)
• A value close to zero means the features are
independent
27
Recap
Covariance Matrix
• Covariance matrix is an n × n matrix containing
the covariance values for all pairs of features in a
data set with n features (dimensions)
28
PCA – Main Steps
Center data around 0
29
A test point x G x
d T p
Data
30
Step 1
32
Step 3
0.490833989
eigenvalues
1.28402771
0.735178656 0.677873399
eigenvectors
0.677873399 0.735178656
33