BIM488 Introduction to Pattern Recognition
Classification Algorithms - Part II
Outline
• Introduction
• Linear Discriminant Functions
• The Perceptron Algorithm
• Performance Assessment
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 2
1
Introduction
• Previously, our major concern was to design classifiers
based on probability density functions.
• Now, we will focus on the design of linear classifiers,
regardless of the underlying distributions describing the
training data.
• The major advantage of linear classifiers is their simplicity
and computational attractiveness.
• Here, our assumption is that all feature vectors from the
available classes can be classified correctly using a linear
classifier, and we will develop techniques for the
computation of the corresponding linear functions.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 3
Introduction
The solid and empty dots can be correctly classified by any
number of linear classifiers. H1 (blue) classifies them correctly, as
does H2 (red). H2 could be considered "better" in the sense that
it is also furthest from both groups. H3 (green) fails to correctly
classify the dots.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 4
2
Introduction
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 5
Linear Discriminant Functions
• A classifier that uses discriminant functions assigns a
feature vector x to class ωi if
gi(x) > gj(x) for all j≠i
where gi(x), i = 1, . . . , c, are the discriminant functions for c
classes.
• A discriminant function that is a linear combination of the
components of x is called a linear discriminant function and
can be written as
g(x) = wTx + w0 = w1x1 + w1x2+ ... + wdxd + w0
where w is the weight vector and w0 is the bias (or
threshold weight).
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 6
3
Linear Discriminant Functions
• For the two-category case, the decision rule can be written
as
Decide : ω1 if g(x) > 0
ω2 otherwise
• The equation g(x) = 0 defines the decision boundary that
separates points assigned to ω1 from points assigned to ω2.
• When g(x) is linear, the decision surface is a hyperplane
whose orientation is determined by the normal vector w and
location is determined by the bias ω0.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 7
Linear Discriminant Functions
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 8
4
Linear Discriminant Functions
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 9
Linear Discriminant Functions
Multicategory Case:
• There is more than one way to devise multicategory
classifiers with linear discriminant functions.
• One against all: we can pose the problem as c two-class
problems, where the i’th problem is solved by a linear
discriminant that separates points assigned to ωi from those
not assigned to ωi.
• One against one: Alternatively, we can use c(c-1)/2 linear
discriminants, one for every pair of classes.
• Also, we can use c linear discriminants, one for each class,
and assign x to ωi if gi(x) > gj(x) for all j≠i.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 10
5
Linear Discriminant Functions
Figure: Linear decision boundaries for a 4-class problem devised as
(a) four 2-class problems (b) 6 pairwise problems.The pink regions have
ambiguous category assignments.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 11
Linear Discriminant Functions
• To avoid the problem of ambiguous regions:
– Define c linear discriminant functions
– Assign x to i if gi(x) > gj(x) for all j i.
• The resulting classifier is called a linear machine
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 12
6
Linear Discriminant Functions
Figure: Linear decision boundaries produced by using one linear
discriminant for each class.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 13
Linear Discriminant Functions
• The boundary between two regions Ri and Rj is a portion of
the hyperplane given by:
gi (x) g j (x) or
(w i w j )t x ( wi 0 w j 0 ) 0
• The decision regions for a linear machine are convex.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 14
7
The Perceptron Algorithm
• The perceptron algorithm is appropriate for the 2-class
problem and for classes that are linearly separable.
• The perceptron algorithm computes the values of the
weights w of a linear classifier, which separates the two
classes.
• The algorithm is iterative. It starts with an initial estimate in
the extended (d +1)-dimensional space and converges to a
solution in a finite number of iteration steps.
• The solution w correctly classifies all the training points
assuming linearly separable classes.
• Note that the perceptron algorithm converges to one out of
infinite possible solutions.
• Starting from different initial conditions, different
hyperplanes result.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 15
The Perceptron Algorithm
• The update at the i th iteration step has the simple form
w(t 1) w(t ) t x x
xY
• Y is the set of wrongly classified samples by the current estimate w(t),
• δx is −1 if x Є ω1, and +1 if x Є ω2,
• ρt is a user-defined parameter that controls the convergence speed and
must obey certain requirements to guarantee convergence (for
example, ρt can be chosen to be constant, ρt = ρ).
• The algorithm converges when Y becomes empty.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 16
8
The Perceptron Algorithm
• Move the hyperplane so that training samples are on its
positive side.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 17
The Perceptron Algorithm
• Once the classifier has been computed, a point, x, is
classified to either of the two classes depending on the
outcome of the following operation:
f (wTx) = f (w1x(1) + w2x(2) + ··· + wdx(d) + w0)
• The function f (·) in its simplest form is the step or sign
function ( f (z) = 1 if z > 0; f (z) =−1 if z < 0).
• However, it may have other forms; for example, the output
may be either 1 or 0 for z > 0 and z < 0, respectively.
• In general, it is known as the activation function.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 18
9
The Perceptron Algorithm
• The basic network model, known as perceptron or neuron,
that implements the classification operation is shown
below:
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 19
The Perceptron Algorithm
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 20
10
The Perceptron Algorithm
Some important points related to perceptron:
• For a fixed learning parameter, the number of iterations (in
general) increases as the classes move closer to each
other (i.e., as the problem becomes more difficult).
• The algorithm fails to converge for a data set that is not
linearly separable. Then, what should we do?
• Different initial estimates for w may lead to different final
estimates for it (although all of them are optimal in the
sense that they separate the training data of the two
classes).
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 21
Performance Assessment
• We can use accuracy or error rate to assess performance
of classifiers.
• Accuracy is the ratio of correct classifications.
• Error rate is the ratio of incorrect classifications.
• Accuracy = 1 - Error rate.
• Example:
10 images belonging to the same class
Number of correctly classified images = 8
Number of incorrectly classified images = 2
Accuracy = 8 / 10 = 0.8 = 80%
Error Rate = 2 / 10 = 0.2 = 20%
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 22
11
Performance Assessment
• Performance is evaluated on a testing set.
• Therefore, entire dataset should be divided into
– training set
– testing set
• Classification model is obtained using the training set.
• Classification performance is assessed using the testing
set.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 23
Performance Assessment
• For objective evaluation, k-fold cross validation technique
is used. Why ?
• Example: k = 3
Fold 1 Fold 2 Fold 3
Training Training Testing
Training Testing Training
Testing Training Training
Accuracy1 Accuracy2 Accuracy3
Overall accuracy = (Accuracy1 + Accuracy2 + Accuracy3) / 3
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 24
12
Performance Assessment
• We can also use a confusion matrix during assessment
• The example below shows predicted and true class labels
for a 10-class recognition problem.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 25
Summary
• Introduction
• Linear Discriminant Functions
• The Perceptron Algorithm
• Performance Assessment
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 26
13
References
• S. Theodoridis, A. Pikrakis, K. Koutroumbas, D. Cavouras, Introduction
to Pattern Recognition: A MATLAB Approach, Academic Press, 2010.
• S. Theodoridis and K. Koutroumbas, Pattern Recognition (4th Edition),
Academic Press, 2009.
• R. O. Duda, P. E. Hart, D. G. Stork, Pattern Classification (2nd Edition),
Wiley, 2001.
BIM488 Introduction to Pattern Recognition Classification Algorithms - Part II 27
14