0% found this document useful (0 votes)
26 views32 pages

Theory For Classification and Linear Models (I)

Uploaded by

Charlie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views32 pages

Theory For Classification and Linear Models (I)

Uploaded by

Charlie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Aprenentatge Automàtic 1

GCED

Lluı́s A. Belanche
belanche@cs.upc.edu

Soft Computing Research Group


Dept. de Ciències de la Computació (Computer Science)
Universitat Politècnica de Catalunya

2018-2019

LECTURE 5: Theory for classification and linear models (I)


Bayesian decision theory
Introduction: Bayes’ formula

Thomas Bayes: XVIII-century priest. His works on the celebrated formula


were found upon his death

Discrete random variables. Let A a discrete r.v. with pmf PA. We use
the shorthand notation P (a) to mean PA(A = a). Similarly we write
P (b|a) to mean PB|A(B = b|A = a), etc, where

P (b, a)
P (b|a) = , P (a) > 0
P (a)

(prior, joint and conditional probabilities)

(p. 1)
Bayesian decision theory
Introduction: Bayes’ formula

Let {a1, . . . , an}, {b1, . . . , bm} be the possible values that A, B can take.
Then, for any a ∈ {a1, . . . , an}:

m
X m
X
P (a) = P (a, bi) = P (a|bi)P (bi)
i=1 i=1

Since P (a, b) = P (b, a), it follows that, for any ak , bj :

P (ak |bj )P (bj ) m


X
P (bj |ak ) = P
m , with P (bj |ak ) = 1
P (ak |bi)P (bi) j=1
i=1

(posterior probabilities)
(p. 2)
Bayesian decision theory
Introduction: Bayes’ formula

Continuous random variables. Let X, Y two continuous r.v. with pdfs


pX , pY and joint density pXY . We use the shorthand notation p(x) to
mean pX (X = x), etc.

Z Z
p(x) = p(x, y) dy; p(y) = p(x, y) dx
R R

Therefore:

p(x|y)p(y)
Z
p(y|x) = R , with p(y|x) dy = 1
R p(x|y)p(y) dy R

(p. 3)
Bayesian decision theory
Introduction: Bayes’ formula

Mixed random variables. Suppose X is a continuous r.v. and Y is a


discrete r.v. with values in {y1, . . . , ym}.

In this case, p(·|yi) is a continuous r.v. and P (·|x) is a discrete r.v.


Moreover,

p(x|yj )P (yj ) m
X
P (yj |x) = P
m , with P (yj |x) = 1
p(x|yi)P (yi) j=1
i=1

(p. 4)
Bayesian decision theory
Decision rules
We are interested in determining the class or category of objects of nature according
to Ω, a discrete r.v. with values {ω1 , ω2 } that represent the two possible classes.

The prior probabilities are P (ω1 ), P (ω2 ). How should we classify objects?

rule 1:

if P (ω1) > P (ω2) then class of object is ω1 else is ω2

This rule classifies all objects into the same class; therefore it makes
errors!

Pe(rule1) = mı́n{P (ω1), P (ω2)}

useful only if P (ω1)  P (ω2) or P (ω1)  P (ω2).


(p. 5)
Bayesian decision theory
Decision rules
Suppose now that X is a discrete r.v. taking values in {x1 , . . . , xd } that measures a
feature of objects

Now P (ωi |x) = P (x|ωi )P (ωi )/P (x) is the posterior probability that an object with
measured feature x belongs to class P (ωi ), i = 1, 2

Moreover P (x) = P (x|ω1 )P (ω1 ) + P (x|ω2 )P (ω2 )

Upon observing x, the Bayes formula converts prior class probabilities P (ωi ) into
posterior probabilities P (ωi |x). How should we classify objects now?
rule 2:
if P (ω1 |x) > P (ω2 |x) then class of object is ω1 else class is ω2

d
X
Pe(rule2) = mı́n{P (ω1|xi), P (ω2|xi)}P (xi)
i=1

(this rule is known as the Bayes rule or the Bayes classifier)

(p. 6)
Bayesian decision theory
Decision rules
Lemma. For all a, b, c, d ∈ R, mı́n(a, b) + mı́n(c, d) ≤ mı́n(a + c, b + d)
Proposition 1 Pe (rule2) ≤ Pe (rule1)
Proof.
Pd
mı́n{P (ω1 |xi ), P (ω2 |xi )}P (xi )
i=1
d
P
= mı́n{P (xi )P (ω1 |xi ), P (xi )P (ω2 |xi )} (Bayes formula)
i=1
d
P
= mı́n{P (xi |ω1 )P (ω1 ), P (xi |ω2 )P (ω2 )} (iterated lemma)
i=1
d
P d
P
≤ mı́n{ P (xi |ω1 )P (ω1 ), P (xi |ω2 )P (ω2 )}
i=1 i=1
d
P d
P
= mı́n{P (ω1 ) P (xi |ω1 ), P (ω2 ) P (xi |ω2 )}
i=1 i=1

= mı́n{P (ω1 ), P (ω2 )}

The probabilities of error are equal only if P (xi |ω1 ) = P (xi |ω2 ) for all i.
(p. 7)
Bayesian decision theory
Example 1 We have a conveyor belt carrying two classes of pills, suitable for two
different diseases (ω1 and ω2 ). These pills go in two colors {yellow, white}.

P (ω1 ) = 13 , P (ω2 ) = 2
3

P (yellow|ω1 ) = 51 , P (white|ω1 ) = 45 ; P (yellow|ω2 ) = 23 , P (white|ω2 ) = 1


3

1 1 2 2 23
P (yellow) = P (ω1 )P (yellow|ω1 ) + P (ω1 )P (yellow|ω1 ) = 3
· 5
+ 3
· 3
= 45

1 4 2 1 22
P (white) = P (ω1 )P (white|ω1 ) + P (ω2 )P (white|ω2 ) = 3
· 5
+ 3
· 3
= 45

P (yellow|ω1 )P (ω1 ) 1 1
 23 3 20
P (ω1 |yellow) = P (yellow)
= 5
· 3
/ 45 = 23
; P (ω2 |yellow) = 1 − P (ω1 |yellow) = 23

P (white|ω1 )P (ω1 ) 4 1
/ 22 6 5

P (ω1 |white) = P (white)
= 5
· 3 45
= 11
; P (ω2 |white) = 1 − P (ω1 |white) = 11

 
23 3 22 5 13 1 1 2
=⇒ Pe = · + · = < = mı́n ,
45 23 45 11 45 3 3 3

(p. 8)
Bayesian decision theory
Continuous variables
The next step is to consider a r.v. X with pdf p(x) that measures a continuous feature
of the objects. Let P be the support of p, i.e. P = {x ∈ R| p(x) > 0}.
In this setting, p(x|ωi ), i = 1, 2 are the conditional densities of x for every class.
Proposition 2 Pe (rule2) ≤ Pe (rule1)
Proof.
R
P
mı́n{P (ω1 |x), P (ω2 |x)}p(x) dx
R
= P mı́n{p(x)P (ω1 |x), p(x)P (ω2 |x)} dx (Bayes formula)
R
= P mı́n{p(x|ω1 )P (ω1 ), p(x|ω2 )P (ω2 )} dx (standard result for integrals)
R R
≤ mı́n{ P p(x|ω1 )P (ω1 ) dx, P p(x|ω2 )P (ω2 ) dx}
R R
= mı́n{P (ω1 ) P p(x|ω1 ) dx, P (ω2 ) P p(x|ω2 ) dx}
= mı́n{P (ω1 ), P (ω2 )}

The probabilities of error are equal only if p(·|ω1 ) = p(·|ω2 ).

(p. 9)
Bayesian decision theory
Example 2 We have a conveyor belt carrying two classes of pills, suitable for two
different diseases (ω1 and ω2 ). This time the pills go in two colors, shaded in [0, 2],
with probabilities:

P (ω1 ) = 13 , P (ω2 ) = 32 , p(x|ω1 ) = 2−x


2
, p(x|ω2 ) = 2x .

1 2−x 2 x 2+x
p(x) = P (ω1 )p(x|ω1 ) + P (ω2 )p(x|ω2 ) = 3
· 2
+ 3
· 2
= 6

p(x|ω1 )P (ω1 ) 2−x 1


/ 2+x 2−x 2−x 2x

P (ω1 |x) = p(x)
= 2
· 3 6
= 2+x
; P (ω2 |x) = 1 − P (ω1 |x) = 1 − 2+x
= 2+x

Z 2/3 Z 2  
2 1 1 2
=⇒ Pe = P (ω2 ) p(x|ω2 ) dx + P (ω1 ) p(x|ω1 ) dx = < = mı́n ,
0 2/3 9 3 3 3

 
2 2−x 2x
is the solution of =
3 2+x 2+x

(p. 10)
Bayesian decision theory
The Bayes classifier
The Bayes classifier can be extended in two ways:

1. Consider a vector X = (X1, . . . , Xd)> of continuous r.v. with pdf


p(x) = p(x1, . . . , xd) that measures d continuous features

2. Consider a finite number of classes Ω, a discrete r.v. with values


ω1, . . . , ωK , that represent the possible classes (K ≥ 2)

Therefore we have new probabilities p(x|ωi), P (ωi|x), 1 ≤ i ≤ K.

The new Bayes rule says:

the class ω̂(x) of object x is ωk when k = arg max P (ωi|x)


i=1,...,K

The sets Rk = {x/ω̂(x) = k} are called regions (and depend on the specific classifier)
(p. 11)
Bayesian decision theory
Illustration of the optimal classifier (two-class case)

Let us assume a classifier with regions R1, R2:

Pe = P (x ∈ R2, ω1) + P (x ∈ R1, ω2)


= P (x ∈ R2|ω1)P (ω1) + P (x ∈ R1|ω2)P (ω2)
Z Z
= p(x|ω1)P (ω1) dx + p(x|ω2)P (ω2) dx
ZR2 n
R1
o
≥ mı́n p(x|ω1)P (ω1), p(x|ω2)P (ω2) dx
P
= Pe(Bayes)

(p. 12)
Bayesian decision theory
Graphical illustration

(p. 13)
Bayesian decision theory
Graphical illustration

(p. 14)
Bayesian decision theory
Graphical illustration

(p. 15)
Bayesian decision theory
The Bayes classifier

The Bayes classifier can also have a rejection class (illustrated here for
two classes); fix  ∈ (0, 1):

if P (ω1|x) − P (ω2|x) >  then class of object is ω1

else if P (ω2|x) − P (ω1|x) >  then class of object is ω2

else do not classify

For every feature vector x we take one of three possible actions.

(p. 16)
Bayesian decision theory
The Bayes classifier

Consider a finite set of actions A = {a1, . . . , am}. For each ai ∈ A, denote


by l(ai|ωj ) the loss for choosing ai when x is known to be in ωj .
(this is a simplified setting in which the loss does not depend on x)

Example 3 Let m = K + 1 and let ai stand for “classify x into class ωi”
for 1 ≤ i ≤ K; let aK+1 stand for “do not classify x”. A possible set of
losses is:


l(ai|ωj ) = 1 for 1 ≤ i, j ≤ K, i 6= j


l(ai|ωi) = 0 for 1 ≤ i ≤ K
l(aK+1|ωj ) = 1 for 1 ≤ j ≤ K


2

... which suggests that a decision not to classify is less costly than a misclassification.
(p. 17)
Bayesian decision theory
The notion of risk

For a given feature vector x, define the conditional risk of an action as:

K
X
r(ai|x) := l(ai|ωj )P (ωj |x)
j=1

A decision rule is any function a : P ∈ Rn → A that assigns an action


a(x) to every x s.t. p(x) > 0. Define the total risk of a decision rule as:

Z
R(a) := r(a(x)|x)p(x) dx
P

(p. 18)
Bayesian decision theory
The notion of risk

We are interested in the decision rule that minimizes the total risk.
Consider the rule

â(x) = arg min r(aj |x)


1≤j≤m

(you may recognize it as the Bayes rule!)

Given that this rule minimizes the argument of the integral for every
possible x, it follows that the Bayes rule has the lowest possible risk.

The value R(â) is called the Bayes risk.

(p. 19)
Bayesian decision theory
The notion of risk
Example 4 Yet again the conveyor belt carrying two classes of pills that go in two
shaded colors (yellow,white), i.e. a scalar feature x ∈ [0, 2], with probabilities:
P (ω1 ) = 23 , P (ω2 ) = 31 , p(x|ω1 ) = 2−x
2
, p(x|ω2 ) = 12 .
and three possible actions:
a1 − classify as ω1 ,
a2 − classify as ω2 ,
a3 − do not classify
Let the loss matrix lij ≡ l(ai |ωj ) be:

ω1 ω2
a1 0 1
a2 1 0
1 1
a3 4 4

1. Compute the optimal decision rule and its associated risk

2. Give the probability that an object is not classified

(p. 20)
Bayesian decision theory
The notion of risk
We have

2 2−x 1 1 5 − 2x
p(x) = · + · =
3 2 3 2 6

2 2 − x 5 − 2x 4 − 2x 4 − 2x 1
P (ω1 |x) = · / = ; P (ω2 |x) = 1 − P (ω1 |x) = 1 − =
3 2 6 5 − 2x 5 − 2x 5 − 2x

The conditional risks are:


1
r1 (x) ≡ r(a1 |x) = 0 · P (ω1 |x) + 1 · P (ω2 |x) = 5−2x

4−2x
r2 (x) ≡ r(a2 |x) = 1 · P (ω1 |x) + 0 · P (ω2 |x) = 5−2x

1 1 1
r3 (x) ≡ r(a3 |x) = 4
· P (ω1 |x) + 4
· P (ω2 |x) = 4

Now the Bayes rule chooses for each x the action with minimum conditional risk.

(p. 21)
Bayesian decision theory
The notion of risk

1.0
r1
r2
r3

0.8
0.6
0.4
0.2
0.0

0.0 0.5 1.0 1.5 2.0

1
0≤x≤ 2
⇒ take action a1 ⇒ choose ω1
1 11
2
≤x≤ 6
⇒ take action a3 ⇒ do not classify
11
6
≤ x ≤ 2 ⇒ take action a2 ⇒ choose ω2

(p. 22)
Bayesian decision theory
The notion of risk

The total Bayes risk is:

Z 1 Z 11 Z 2
2 6
R= r1(x)p(x) dx + 1
r3(x)p(x) dx +
r2(x)p(x) dx
11
0 2 6

1 1 1 3
= + + + ln ≈ 0,44
12 3 6 2

The probability that an object is not classified is:

11
5 − 2x 16
Z
6
1
dx = ≈ 0,59
2 6 27

(p. 23)
Bayesian decision theory
0/1 losses

In many applications the 0/1 loss is used (usually in absence of more


precise information):

(
0 if i = j
lij =
1 6 j
if i =

Consider K classes and actions ai− classify x into ωi. Then:

K
X K
X
r(ai|x) = lij P (ωj |x) = P (ωj |x) = 1 − P (ωi|x)
j=1 j=1,i6=j

(p. 24)
Bayesian decision theory
Discriminant functions
Functions of the form gk : P → R are a useful tool to build an abstract classifier

An object x is assigned to class ωi when gi (x) is the highest among the values
g1 (x), . . . , gK (x). Examples:
• gk (x) = P (ωk |x)
• gk (x) = P (ωk )p(x|ωk )
• gk (x) = −r(ak |x)

If gk is a discriminant function, then so is h ◦ gk , for any strictly monotonic


function h.

For two classes, we can use a single discriminant function, called a dichotomizer:
1. define g(x) := g1 (x) − g2 (x)
2. assign x to class ω1 if g(x) > 0 and to class ω2 if g(x) < 0

(p. 25)
Bayesian decision theory
The Gaussian Distribution
A normally distributed d-variate random vector X = (X1 , . . . , Xd )> has pdf:

 
1 1
p(x) = d 1 exp − (x − µ)> Σ−1 (x − µ)
(2π) 2 |Σ| 2 2

2 ) is the real symmetric and positive


where µ is the mean vector and Σd×d = (σij
definite (p.d.) covariance matrix.

E[X] = µ and E[(X − µ)(X − µ)> ] = Σ.

2
CoVar[Xi , Xj ] = σij and Var[Xi ] = σii2

if X ∼ N (µ, Σ), then Xi , Xj are statistically independent ⇐⇒ CoVar[Xi , Xj ] = 0

(in general, only the left-to-right implication holds)

Notation: It is convenient to write p(x) = N (x; µ, Σ) or X ∼ N (x; µ, Σ).

(p. 26)
Bayesian decision theory
The Gaussian Distribution

(p. 27)
Bayesian decision theory
The Gaussian Distribution

q
The quantity d(x, µ) = (x − µ)T Σ−1(x − µ) is called the
Mahalanobis distance

What is behind the choice of a multivariate Gaussian for a class?

−→ examples from the class are noisy versions of a prototype:

• Prototype: modeled by the mean vector

• Noise: modeled by the covariance matrix

Very important: the number of parameters is d(d+1)


2 +d

(p. 28)
Bayesian decision theory
The Gaussian Distribution

The surfaces of equal probability d(x, µ) = ct are hyperellipsoids

The principal directions or components (PC) of the hyperellipsoids are given by


the eigenvectors ui of Σ, which satisfy Σui = λi ui , λi > 0

The lengths of the hyperellipsoids along these axes are proportional to λi , the
singular values associated with ui (note Σ is p.d. and therefore all λi > 0)

(p. 29)
Bayesian decision theory
Properties of the Gaussian Distribution

Simplified forms: if the Xi are statistically independent, then


d
p(x) =
Q
p(xi), Σ is diagonal and the PCs are axis-aligned
i=1
(marginal densities [integrating out some variables] and conditional densities
[setting some variables to fixed values] are also normal)

Analytical properties, e.g. any moment E[X p] can be expressed as


a function of µ and Σ

Central limit theorem, the mean of d i.i.d. random variables tends


to a normal distribution, in the limit of infinite d

Linear transformation invariance, the distribution of a linear


transformation of the coordinate system remains normal

(p. 30)
Bayesian decision theory
The Gaussian Distribution

Observations from a bivariate (d = 2) normal distribution, a contour ellipsoid, the two


marginal distributions, and their histograms (images from the Wikipedia)

(p. 31)

You might also like