07 Pattern Recognition

Pattern
Recognition
Perceptron (“single-layer perceptron”)
• simple NN for
linearly separable
patterns only
 
error correcting learning
y(n) = sgn w (n)x(n)
T
w(n + 1) = w(n) + d (n) − y(n)x(n)

if x(n)  C1
+ 1 ak
d ( n) = {
if x(n)  C2
− 1 ak
AIDP, M.Oravec, ICSM FEI STU
Linear and nonlinear separability
2 regions separated by hyperplane
p
 w x − = 0
i =1
i i
Class C2
hyperplane
Class C1
Separating boundary
Linear separability – 2D inputs, 2 classes

Linear and nonlinear separability
Class C1
Class C1
Class C2
Class C2
Linearly separable patterns Nonlinearly separable patterns

(decision surfaces are hyperplanes)

linear and nonlinear separability
• Minsky a Papert 1969

• Perceptron is not able
to separate even XOR
(exclusive or) logical A B A XOR B
function! 0 0 0
0 1 1
• NN were dead… 1 0 1
1 1 0
• 1986: PDP group, MLP,

backpropagation
algorithm

• Good illustration of an ANN for data
classification
• http://playground.tensorflow.org

Typical application areas
• Machine vision
• Character recognition (OCR)
• Computer aided diagnosis
• Speech recognition
• Face recognition
• Biometrics
Pattern • Image Data Base retrieval
• Data mining
Recognition) • Bioinformatics
The task:
• Assign unknown objects – patterns –

into the correct class. This is known as
classification.

o Features: These are measurable quantities obtained from
the patterns, and the classification task is based on their
respective values.
o Feature vectors: A number of features

𝑥1 , . . . , 𝑥𝑝 ,
constitute the feature vector
𝑇
𝑥 = 𝑥1 , . . . , 𝑥𝑝 ∈ 𝑅𝑝
Feature vectors are treated as random vectors.

An example: medical image
classification
a graph of the mean

value of the intensities
of objects from their
standard deviation
class A Mean and
standard
class B deviation are
features
• To which class does a new image (*) belongs? With greater

probability to A
• classifier - with its decision boundary it divides the space of flags
into areas of classes A, B
• Here it is a linear decision boundary (straight line) obtained from
training patterns

• The classifier consists of a set of functions, whose values,
computed at x , determine the class to which the corresponding
pattern belongs
• Classification system overview
Patterns
sensor
feature
generation
feature
selection
classifier
design
system
evaluation
Watanabe defines a pattern as
opposite of a chaos; it is an entity,
vaguely defined, that could be given a
name.
• For example, a pattern could be a fingerprint
Pattern, image, a handwritten cursive word, a human
face, or a speech signal.
pattern Given a pattern, its

recognition/classification may consist
recognition of one of the following two tasks:
• 1) supervised classification (e.g., discriminant

analysis) in which the input pattern is identified
as a member of a predefined class,
• 2) unsupervised classification (e.g., clustering)
in which the pattern is assigned to a hitherto
unknown class.

The design of a pattern recognition system
essentially involves the following three
aspects:
1) data
2) data 3) decision
Design of a acquisition and
preprocessing,
representation, making.
pattern
recognition
system Learning from a set of examples (training
set) is an important and desired attribute
of most pattern recognition systems.

• The four best known approaches for
pattern recognition are:
1) template matching,
Approaches 2) statistical classification,
for pattern 3) syntactic or structural matching,
4) neural networks.
recognition These models are not
necessarily independent and
sometimes the same pattern
recognition method exists with
different interpretations.
Attempts have been made to
design hybrid systems.

1) Template Matching
Matching is a generic operation in pattern recognition which is used to determine

the similarity between two entities (points, curves, or shapes) of the same type.
In template matching, a template (typically, a 2D shape) or a prototype of the

pattern to be recognized is available.
• The pattern to be recognized is matched against the stored template while taking into account all
allowable pose (translation and rotation) and scale changes.
• The similarity measure, often a correlation, may be optimized based on the available training set.
Often, the template itself is learned from the training set.
Disadvantages
• if the patterns are distorted due to the imaging process, viewpoint change, or large intraclass
variations among the patterns.

2) Statistical Classification
Each pattern is represented in terms of p features or measurements and is viewed as

a point in a p-dimensional space.
The goal is to choose those features that allow pattern vectors belonging to different
categories to occupy compact and disjoint regions in a p-dimensional feature space.
Given a set of training patterns from each class, the objective is to establish decision
boundaries in the feature space which separate patterns belonging to different
classes.
• In the statistical decision theoretic approach, the decision boundaries are determined by the probability
distributions of the patterns belonging to each class, which must either be specified or learned
• One can also take a discriminant analysis-based approach to classification: First a parametric form of the
decision boundary (e.g., linear or quadratic) is specified; then the best decision boundary of the specified
form is found based on the classification of training patterns.

3) Syntactic (Structural) Matching
• hierarchical perspective where a pattern is viewed as being composed of simple subpatterns which
are themselves built from yet simpler subpatterns
• The simplest/elementary subpatterns to be recognized are called primitives and the given complex
pattern is represented in terms of the interrelationships between these primitives.
• In syntactic pattern recognition, a formal analogy is drawn between the structure of patterns and
the syntax of a language.
– The patterns are viewed as sentences belonging to a language, primitives are viewed as the
alphabet of the language, and the sentences are generated according to a grammar. Thus, a
large collection of complex patterns can be described by a small number of primitives and
grammatical rules. The grammar for each pattern class must be inferred from the available
training samples.
– Suitable, if the patterns have a definite structure which can be captured in terms of a set of
rules,
• such as EKG waveforms, textured images, and shape analysis of contours
– Difficulties which primarily have to do with the segmentation of noisy patterns (to detect the
primitives) and the inference of the grammar from training data.

4) Neural Networks
• massive parallel computing systems

• huge number of simple processors (neurons)
• huge number of connections
• learning, generalization, adaptability, fault tolerance, distributed representation
• artificial neuron
• the ability to learn the complex relationships between inputs and outputs
• nonlinearity
• simple learning algorithms

• nonlinear algorithms for
– Feature extraction
– classification
• good hardware implementation
• boom of deep networks
Pattern Recognition Models

STATISTICAL PATTERN
RECOGNITION

Statistical pattern recognition
• input pattern: p-dimensional input vector
• pattern represented by m features (m-dimensional feature
vector)
• determination of decision boundaries uses known methods
of statistical decision theory
• Two modes of operation of the pattern recognition system:

– training (learning)
– classification (testing))

Model of statistical pattern recognition
• preprocessing
• dimensionality reduction
– „the curse of dimensionality“ - number of trainings points is exp. function of
the feature dimension
– „peaking phenomenon" - for a fixed training. the set of features can degrade
the activity of the classifier (the number of training points is small due to the
number of features)
• feature extraction
• feature selection
• classification
Pattern recognition systems
• One-stage system, direct classification

x1
x2
input classifier decision
...
xp
• Two-stage system
x1 y1
x2 feature
input decision
...
classifier
...
extraction
xp ym

Pattern recognition systems
• One-stage system, direct classification
Pattern Preproces class

Classifica-
dim=p sing dim=p tion
Many methods from 1st

part of semester ☺
• Two-stage system (feature extraction + classification)

Pattern Class
Preproces Feature Feature Classifica-
dim=p sing extraction selection tion
dim=p dim=p dim=m
m<p
Many methods from 1st
part of semester ☺ AIDP, M.Oravec, ICSM FEI STU
FEATURE EXTRACTION

Feature extraction
• feature extraction
• feature – the most important property of data
Data Feature
space space
transformation
dim: p dim: m=p
dim: m<p
decorrelation
dimensionality reduction

❖Principal Components Analysis
(The Karhunen – Loève transform):
➢ The goal: Given an original set of m measurements x  
m
compute y   
y = AT x
for an orthogonal A, so that the elements of y are optimally

mutually uncorrelated.
That is
E  y (i ) y ( j ) = 0, i  j
➢ Sketch of the proof:
  
Ry = E y y = E A x x A = AT Rx A
T T T

• If A is chosen so that its columns a i are the orthogonal
eigenvectors of Rx, then
R y = AT Rx A = 
where Λ is diagonal with elements the respective
eigenvalues λi.
• Observe that this is a sufficient condition but not
necessary. It imposes a specific orthogonal structure on
A.
➢ Properties of the solution

• Mean Square Error approximation.
Due to the orthogonality of A:
m
x =  y (i )a i , y (i ) = a i x
T
i =0

− Define  −1
xˆ =  y (i )a i
i =0
− The Karhunen – Loève transform minimizes the

square error:
   m 
2
E x − xˆ = E   y (i )a i
2

 i = 
− The error is:

E x − xˆ
2
=  
m
i =
i
It can be also shown that this is the minimum mean

square error compared to any other representation of x
by an ℓ-dimensional vector.

− In other words, x̂ is the projection of x into the
subspace spanned by the principal ℓ eigenvectors.
However, for Pattern Recognition this is not the
always the best solution.

Example-computation of PCA (KLT)

PCA (KLT)
• Lindsay I Smith: A tutorial on Principal Components Analysis, 2002, http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

CLASSIFICATION

Statistical pattern recognition – decision making
process
• classifier proposal - classification according to what
information we have about class-conditional densities
– if all densities are known (but this is rare)

• Bayes' optimal rule
– if the densities are known, but some of their parameters are

not known
• parametric decision problem - unknown parameters are estimated
– if the densities are not known

• nonparametric problem
• either probability densities are estimated (eg Parzen window)
• or direct construction of the decision boundary from the training data
• e.g. k-nearest neighbor rule
• MLP is also a nonparametric method for constructing a decision boundary

Various approaches to statistical pattern recognition

Frequently
used
classifiers

Illustration of decision boundaries
• Data:

Illustration of decision boundaries

Statistical pattern recognition – decision making
process
• classifier proposal - classification according to what
information we have about class-conditional densities
– if all densities are known (but this is rare)

• Bayes' optimal rule
– if the densities are known, but some of their parameters are

not known
• parametric decision problem - unknown parameters are estimated
– if the densities are not known

• nonparametric problem
• either probability densities are estimated (eg Parzen window)
• or direct construction of the decision boundary from the training data
• e.g. k-nearest neighbor rule
• MLP is also a nonparametric method for constructing a decision boundary

K-NN RULE
k-Nearest Neighbour Rule

❖ The Nearest Neighbor Rule
➢ Choose k out of the N training vectors, identify the k
nearest ones to x
➢ Out of these k identify ki that belong to class ωi
➢ Assign x → i : ki  k j i  j
➢ The simplest version

k=1 !!!
➢ For large N this is not bad. It can be shown that:

if PB is the optimal Bayesian error probability, then:
M
PB  PNN  PB (2 − PB )  2 PB
M −1
PB - probability of error of optimal
Bayes classifier
2 PNN
➢ PB  PkNN  PB +
k
➢ k →  , PkNN → PB
➢ For small PB:
PNN  2 PB
P3 NN  PB + 3( PB ) 2

❖ Voronoi tesselation
Ri = x : d ( x, x i )  d ( x, x j ) i  j
Voronoi tessellation
◼ 2D case ◼ 3D case

kNN
◼ nice presentation:
http://courses.cs.tamu.edu/rgutier/cs790_w02/l8.pdf

kNN

kNN

kNN
Other nice solutions:
◼ ensemble learning
◼ SVM (support vector machine)
◼ neural networks

Other nice solutions: ensemble learning

Other nice solutions: ensemble learning

Other nice solution: SVM
◼ SVM
❑ https://www.youtube.com/watch?v=9NrALgHFwTo
❑ https://www.youtube.com/watch?v=ndNE8he7Nnk

Other nice solutions: NN
◼ Once again: good illustration of an ANN for data
classification ☺
❑ http://playground.tensorflow.org

07 Pattern Recognition

Uploaded by

Copyright:

Available Formats

07 Pattern Recognition

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

07 Pattern Recognition

Uploaded by

Copyright:

Available Formats

Pattern

w(n + 1) = w(n) + d (n) − y(n)x(n)

Linear separability – 2D inputs, 2 classes

AIDP, M.Oravec, ICSM FEI STU

Linearly separable patterns Nonlinearly separable patterns

AIDP, M.Oravec, ICSM FEI STU

• Minsky a Papert 1969

• 1986: PDP group, MLP,

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

• Assign unknown objects – patterns –

AIDP, M.Oravec, ICSM FEI STU

o Feature vectors: A number of features

AIDP, M.Oravec, ICSM FEI STU

a graph of the mean

• To which class does a new image (*) belongs? With greater

AIDP, M.Oravec, ICSM FEI STU

• Classification system overview

pattern Given a pattern, its

• 1) supervised classification (e.g., discriminant

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

Matching is a generic operation in pattern recognition which is used to determine

In template matching, a template (typically, a 2D shape) or a prototype of the

AIDP, M.Oravec, ICSM FEI STU

Each pattern is represented in terms of p features or measurements and is viewed as

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

• massive parallel computing systems

• simple learning algorithms

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

• Two modes of operation of the pattern recognition system:

AIDP, M.Oravec, ICSM FEI STU

• One-stage system, direct classification

AIDP, M.Oravec, ICSM FEI STU

Pattern Preproces class

Many methods from 1st

• Two-stage system (feature extraction + classification)

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

for an orthogonal A, so that the elements of y are optimally

➢ Properties of the solution

AIDP, M.Oravec, ICSM FEI STU

− The Karhunen – Loève transform minimizes the

It can be also shown that this is the minimum mean

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

• Lindsay I Smith: A tutorial on Principal Components Analysis, 2002, http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf

AIDP, M.Oravec, ICSM FEI STU

– if all densities are known (but this is rare)

– if the densities are known, but some of their parameters are

– if the densities are not known

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU

AIDP, M.Oravec, ICSM FEI STU