Theory For Classification and Linear Models (I)
Theory For Classification and Linear Models (I)
GCED
Lluı́s A. Belanche
belanche@cs.upc.edu
2018-2019
Discrete random variables. Let A a discrete r.v. with pmf PA. We use
the shorthand notation P (a) to mean PA(A = a). Similarly we write
P (b|a) to mean PB|A(B = b|A = a), etc, where
P (b, a)
P (b|a) = , P (a) > 0
P (a)
(p. 1)
Bayesian decision theory
Introduction: Bayes’ formula
Let {a1, . . . , an}, {b1, . . . , bm} be the possible values that A, B can take.
Then, for any a ∈ {a1, . . . , an}:
m
X m
X
P (a) = P (a, bi) = P (a|bi)P (bi)
i=1 i=1
(posterior probabilities)
(p. 2)
Bayesian decision theory
Introduction: Bayes’ formula
Z Z
p(x) = p(x, y) dy; p(y) = p(x, y) dx
R R
Therefore:
p(x|y)p(y)
Z
p(y|x) = R , with p(y|x) dy = 1
R p(x|y)p(y) dy R
(p. 3)
Bayesian decision theory
Introduction: Bayes’ formula
p(x|yj )P (yj ) m
X
P (yj |x) = P
m , with P (yj |x) = 1
p(x|yi)P (yi) j=1
i=1
(p. 4)
Bayesian decision theory
Decision rules
We are interested in determining the class or category of objects of nature according
to Ω, a discrete r.v. with values {ω1 , ω2 } that represent the two possible classes.
The prior probabilities are P (ω1 ), P (ω2 ). How should we classify objects?
rule 1:
This rule classifies all objects into the same class; therefore it makes
errors!
Now P (ωi |x) = P (x|ωi )P (ωi )/P (x) is the posterior probability that an object with
measured feature x belongs to class P (ωi ), i = 1, 2
Upon observing x, the Bayes formula converts prior class probabilities P (ωi ) into
posterior probabilities P (ωi |x). How should we classify objects now?
rule 2:
if P (ω1 |x) > P (ω2 |x) then class of object is ω1 else class is ω2
d
X
Pe(rule2) = mı́n{P (ω1|xi), P (ω2|xi)}P (xi)
i=1
(p. 6)
Bayesian decision theory
Decision rules
Lemma. For all a, b, c, d ∈ R, mı́n(a, b) + mı́n(c, d) ≤ mı́n(a + c, b + d)
Proposition 1 Pe (rule2) ≤ Pe (rule1)
Proof.
Pd
mı́n{P (ω1 |xi ), P (ω2 |xi )}P (xi )
i=1
d
P
= mı́n{P (xi )P (ω1 |xi ), P (xi )P (ω2 |xi )} (Bayes formula)
i=1
d
P
= mı́n{P (xi |ω1 )P (ω1 ), P (xi |ω2 )P (ω2 )} (iterated lemma)
i=1
d
P d
P
≤ mı́n{ P (xi |ω1 )P (ω1 ), P (xi |ω2 )P (ω2 )}
i=1 i=1
d
P d
P
= mı́n{P (ω1 ) P (xi |ω1 ), P (ω2 ) P (xi |ω2 )}
i=1 i=1
The probabilities of error are equal only if P (xi |ω1 ) = P (xi |ω2 ) for all i.
(p. 7)
Bayesian decision theory
Example 1 We have a conveyor belt carrying two classes of pills, suitable for two
different diseases (ω1 and ω2 ). These pills go in two colors {yellow, white}.
P (ω1 ) = 13 , P (ω2 ) = 2
3
1 1 2 2 23
P (yellow) = P (ω1 )P (yellow|ω1 ) + P (ω1 )P (yellow|ω1 ) = 3
· 5
+ 3
· 3
= 45
1 4 2 1 22
P (white) = P (ω1 )P (white|ω1 ) + P (ω2 )P (white|ω2 ) = 3
· 5
+ 3
· 3
= 45
P (yellow|ω1 )P (ω1 ) 1 1
23 3 20
P (ω1 |yellow) = P (yellow)
= 5
· 3
/ 45 = 23
; P (ω2 |yellow) = 1 − P (ω1 |yellow) = 23
P (white|ω1 )P (ω1 ) 4 1
/ 22 6 5
P (ω1 |white) = P (white)
= 5
· 3 45
= 11
; P (ω2 |white) = 1 − P (ω1 |white) = 11
23 3 22 5 13 1 1 2
=⇒ Pe = · + · = < = mı́n ,
45 23 45 11 45 3 3 3
(p. 8)
Bayesian decision theory
Continuous variables
The next step is to consider a r.v. X with pdf p(x) that measures a continuous feature
of the objects. Let P be the support of p, i.e. P = {x ∈ R| p(x) > 0}.
In this setting, p(x|ωi ), i = 1, 2 are the conditional densities of x for every class.
Proposition 2 Pe (rule2) ≤ Pe (rule1)
Proof.
R
P
mı́n{P (ω1 |x), P (ω2 |x)}p(x) dx
R
= P mı́n{p(x)P (ω1 |x), p(x)P (ω2 |x)} dx (Bayes formula)
R
= P mı́n{p(x|ω1 )P (ω1 ), p(x|ω2 )P (ω2 )} dx (standard result for integrals)
R R
≤ mı́n{ P p(x|ω1 )P (ω1 ) dx, P p(x|ω2 )P (ω2 ) dx}
R R
= mı́n{P (ω1 ) P p(x|ω1 ) dx, P (ω2 ) P p(x|ω2 ) dx}
= mı́n{P (ω1 ), P (ω2 )}
(p. 9)
Bayesian decision theory
Example 2 We have a conveyor belt carrying two classes of pills, suitable for two
different diseases (ω1 and ω2 ). This time the pills go in two colors, shaded in [0, 2],
with probabilities:
1 2−x 2 x 2+x
p(x) = P (ω1 )p(x|ω1 ) + P (ω2 )p(x|ω2 ) = 3
· 2
+ 3
· 2
= 6
Z 2/3 Z 2
2 1 1 2
=⇒ Pe = P (ω2 ) p(x|ω2 ) dx + P (ω1 ) p(x|ω1 ) dx = < = mı́n ,
0 2/3 9 3 3 3
2 2−x 2x
is the solution of =
3 2+x 2+x
(p. 10)
Bayesian decision theory
The Bayes classifier
The Bayes classifier can be extended in two ways:
The sets Rk = {x/ω̂(x) = k} are called regions (and depend on the specific classifier)
(p. 11)
Bayesian decision theory
Illustration of the optimal classifier (two-class case)
(p. 12)
Bayesian decision theory
Graphical illustration
(p. 13)
Bayesian decision theory
Graphical illustration
(p. 14)
Bayesian decision theory
Graphical illustration
(p. 15)
Bayesian decision theory
The Bayes classifier
The Bayes classifier can also have a rejection class (illustrated here for
two classes); fix ∈ (0, 1):
(p. 16)
Bayesian decision theory
The Bayes classifier
Example 3 Let m = K + 1 and let ai stand for “classify x into class ωi”
for 1 ≤ i ≤ K; let aK+1 stand for “do not classify x”. A possible set of
losses is:
l(ai|ωj ) = 1 for 1 ≤ i, j ≤ K, i 6= j
l(ai|ωi) = 0 for 1 ≤ i ≤ K
l(aK+1|ωj ) = 1 for 1 ≤ j ≤ K
2
... which suggests that a decision not to classify is less costly than a misclassification.
(p. 17)
Bayesian decision theory
The notion of risk
For a given feature vector x, define the conditional risk of an action as:
K
X
r(ai|x) := l(ai|ωj )P (ωj |x)
j=1
Z
R(a) := r(a(x)|x)p(x) dx
P
(p. 18)
Bayesian decision theory
The notion of risk
We are interested in the decision rule that minimizes the total risk.
Consider the rule
Given that this rule minimizes the argument of the integral for every
possible x, it follows that the Bayes rule has the lowest possible risk.
(p. 19)
Bayesian decision theory
The notion of risk
Example 4 Yet again the conveyor belt carrying two classes of pills that go in two
shaded colors (yellow,white), i.e. a scalar feature x ∈ [0, 2], with probabilities:
P (ω1 ) = 23 , P (ω2 ) = 31 , p(x|ω1 ) = 2−x
2
, p(x|ω2 ) = 12 .
and three possible actions:
a1 − classify as ω1 ,
a2 − classify as ω2 ,
a3 − do not classify
Let the loss matrix lij ≡ l(ai |ωj ) be:
ω1 ω2
a1 0 1
a2 1 0
1 1
a3 4 4
(p. 20)
Bayesian decision theory
The notion of risk
We have
2 2−x 1 1 5 − 2x
p(x) = · + · =
3 2 3 2 6
2 2 − x 5 − 2x 4 − 2x 4 − 2x 1
P (ω1 |x) = · / = ; P (ω2 |x) = 1 − P (ω1 |x) = 1 − =
3 2 6 5 − 2x 5 − 2x 5 − 2x
4−2x
r2 (x) ≡ r(a2 |x) = 1 · P (ω1 |x) + 0 · P (ω2 |x) = 5−2x
1 1 1
r3 (x) ≡ r(a3 |x) = 4
· P (ω1 |x) + 4
· P (ω2 |x) = 4
Now the Bayes rule chooses for each x the action with minimum conditional risk.
(p. 21)
Bayesian decision theory
The notion of risk
1.0
r1
r2
r3
0.8
0.6
0.4
0.2
0.0
1
0≤x≤ 2
⇒ take action a1 ⇒ choose ω1
1 11
2
≤x≤ 6
⇒ take action a3 ⇒ do not classify
11
6
≤ x ≤ 2 ⇒ take action a2 ⇒ choose ω2
(p. 22)
Bayesian decision theory
The notion of risk
Z 1 Z 11 Z 2
2 6
R= r1(x)p(x) dx + 1
r3(x)p(x) dx +
r2(x)p(x) dx
11
0 2 6
√
1 1 1 3
= + + + ln ≈ 0,44
12 3 6 2
11
5 − 2x 16
Z
6
1
dx = ≈ 0,59
2 6 27
(p. 23)
Bayesian decision theory
0/1 losses
(
0 if i = j
lij =
1 6 j
if i =
K
X K
X
r(ai|x) = lij P (ωj |x) = P (ωj |x) = 1 − P (ωi|x)
j=1 j=1,i6=j
(p. 24)
Bayesian decision theory
Discriminant functions
Functions of the form gk : P → R are a useful tool to build an abstract classifier
An object x is assigned to class ωi when gi (x) is the highest among the values
g1 (x), . . . , gK (x). Examples:
• gk (x) = P (ωk |x)
• gk (x) = P (ωk )p(x|ωk )
• gk (x) = −r(ak |x)
For two classes, we can use a single discriminant function, called a dichotomizer:
1. define g(x) := g1 (x) − g2 (x)
2. assign x to class ω1 if g(x) > 0 and to class ω2 if g(x) < 0
(p. 25)
Bayesian decision theory
The Gaussian Distribution
A normally distributed d-variate random vector X = (X1 , . . . , Xd )> has pdf:
1 1
p(x) = d 1 exp − (x − µ)> Σ−1 (x − µ)
(2π) 2 |Σ| 2 2
2
CoVar[Xi , Xj ] = σij and Var[Xi ] = σii2
(p. 26)
Bayesian decision theory
The Gaussian Distribution
(p. 27)
Bayesian decision theory
The Gaussian Distribution
q
The quantity d(x, µ) = (x − µ)T Σ−1(x − µ) is called the
Mahalanobis distance
(p. 28)
Bayesian decision theory
The Gaussian Distribution
(p. 29)
Bayesian decision theory
Properties of the Gaussian Distribution
(p. 30)
Bayesian decision theory
The Gaussian Distribution
(p. 31)