Supervised Learning: Linear Methods (1/2) : Applied Multivariate Statistics - Spring 2012

Supervised Learning: Linear Methods (1/2)
Applied Multivariate Statistics – Spring 2012

Overview
 Review: Conditional Probability

 LDA / QDA: Theory
 Fisher’s Discriminant Analysis
 LDA: Example
 Quality control: Testset and Crossvalidation
 Case study: Text recognition
1
Conditional Probability
Sample space
T: Med. Test positive T (Marginal) Probability:

P(T), P(C)
C: Patient has cancer C
New sample space: New sample space:

People with cancer Conditional Probability: People with pos. test
P(T|C), P(C|T) P(C|T)
P(T|C)
large Bayes Theorem: small
P (T jC)P (C)
posterior P (CjT ) = P (T ) prior
Class conditional probability 2
One approach to supervised learning
P (C)P (XjC)
P (CjX) = P (X) » P (C)P (XjC)
Prior / prevalence:
Find some estimate Assume:
Fraction of samples
XjC » N(¹c; §c)
in that class
Bayes rule:
Choose class where P(C|X) is maximal
(rule is “optimal” if all types of error are equally costly)
Special case: Two classes (0/1)

- choose c=1 if P(C=1|X) > 0.5 or
- choose c=1 if posterior odds P(C=1|X)/P(C=0|X) > 1
In Practice: Estimate 𝑃 𝐶 , 𝜇𝐶 , Σ𝐶
3
¡ 1 ¢
QDA: Doing the math… p 1 T ¡1
exp ¡ 2 (x ¡ ¹c ) §C (x ¡ ¹c )
(2¼)d j§C j
 𝑃 𝐶 𝑋 ~ 𝑃 𝐶 𝑃(𝑋|𝐶)
 Use the fact: max 𝑃 𝐶 𝑋 max(log 𝑃 𝐶 𝑋 )
 𝛿𝑐 𝑥 = log 𝑃 𝐶 𝑋 = log 𝑃 𝐶 + log 𝑃 𝑋 𝐶 =
1 1 𝑇 −1
= log 𝑃 𝐶 − log Σ𝐶 − 𝑥 − 𝜇𝐶 Σ𝐶 𝑥 − 𝜇𝐶 + 𝑐
2 2
Prior Additional Sq. Mahalanobis distance

term
 Choose class where 𝛿𝑐 𝑥 is maximal

 Special case: Two classes
Decision boundary: Values of x where 𝛿0 𝑥 = 𝛿1 (𝑥) is quadratic in x
 Quadratic Discriminant Analysis (QDA)
4
Simplification
 Assume same covariance matrix in all classes, i.e.

𝑋|𝐶 ~ 𝑁(𝜇𝑐 , Σ) Fix for all classes
1 1
 𝛿𝑐 𝑥 = log 𝑃 𝐶 − log Σ − 𝑥 − 𝜇𝐶 𝑇 Σ−1 𝑥 − 𝜇𝐶 + 𝑐 =
2 2
Prior 1 Sq. Mahalanobis distance
= log 𝑃 𝐶 − 𝑥 − 𝜇𝐶 𝑇 Σ−1 𝑥 − 𝜇𝐶 + 𝑑=
2
1
(= log 𝑃 𝐶 + 𝑥 𝑇 Σ−1 𝜇𝐶 − 𝜇𝐶𝑇 Σ −1 𝜇𝐶 )
2
Decision boundary is linear in x
 Linear Discriminant Analysis (LDA)
1
Classify to which class (assume equal prior)?
• Physical distance in space is equal
0
• Classify to class 0, since Mahal. Dist. is smaller
5
LDA vs. QDA
+ Only few parameters to - Many parameters to estimate;
estimate; accurate estimates less accurate
- Inflexible + More flexible
(quadratic decision boundary)
(linear decision boundary)
6
Fisher’s Discriminant Analysis: Idea
Find direction(s) in which groups are separated best
1. Principal Component • Class Y, predictors 𝑋 = 𝑋1 , … , 𝑋𝑑

𝑈 = 𝑤𝑇𝑋
1. Linear Discriminant • Find w so that groups are separated
= along U best
1. Canonical Variable • Measure of separation: Rayleigh coefficient
𝐷(𝑈)
𝐽 𝑤 =
𝑉𝑎𝑟(𝑈) 2
where 𝐷 𝑈 = 𝐸 𝑈 𝑌 = 0 − 𝐸 𝑈 𝑌 = 1
• 𝐸 𝑋 𝑌 = 𝑗 = 𝜇𝑗 , 𝑉𝑎𝑟 𝑋 𝑌 = 𝑗 = Σ
𝐸 𝑈 𝑌 = 𝑗 = 𝑤 𝑇 𝜇𝑗 , 𝑉 𝑈 = 𝑤 𝑇 Σw
• Concept extendable to many groups
D(U) D(U)
𝐽 𝑤 large 𝐽 𝑤 small
Var(U) Var(U)
7
LDA and Linear Discriminants
 - Direction with largest J(w): 1. Linear Discriminant (LD 1)

- orthogonal to LD1, again largest J(w): LD 2
- etc.
 At most: min(Nmb. dimensions, Nmb. Groups -1) LD’s
e.g.: 3 groups in 10 dimensions – need 2 LD’s
 Computed using Eigenvalue Decomposition or Singular
Value Decomposition
Proportion of trace: Captured % of variance between group
means for each LD
 R: Function «lda» in package MASS does LDA and
computes linear discriminants (also «qda» available)
8
Example: Classification of Iris flowers
Iris setosa
Iris versicolor
Classify according to sepal/petal length/width
Iris virginica
9
Quality of classification
 Use training data also as test data: Overfitting

Too optimistic for error on new data
 Separate test data
Test
Training
 Cross validation (CV; e.g. “leave-one-out cross validation):

Every row is the test case once, the rest in the training data
10
Measures for prediction error
 Confusion matrix (e.g. 100 samples)
Truth = 0 Truth = 1 Truth = 2

Estimate = 0 23 7 6
Estimate = 1 3 27 4
Estimate = 2 3 1 26
 Error rate:
1 – sum(diagonal entries) / (number of samples) =
= 1 – 76/100 = 0.24
 We expect that our classifier predicts 24% of new
observations incorrectly (this is just a rough estimate)
11
Example: Digit recognition
 7129 hand-written digits

Sample of digits
 Each (centered) digit
was put in a 16*16 grid
 Measure grey value in
each part of the grid,
i.e. 256 grey values
Example with 8*8 grid 12

Concepts to know
 Idea of LDA / QDA

 Meaning of Linear Discriminants
 Cross Validation
 Confusion matrix, error rate
13
R functions to know
 lda
14

Supervised Learning: Linear Methods (1/2) : Applied Multivariate Statistics - Spring 2012

Uploaded by

Copyright:

Available Formats

Supervised Learning: Linear Methods (1/2) : Applied Multivariate Statistics - Spring 2012

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Supervised Learning: Linear Methods (1/2) : Applied Multivariate Statistics - Spring 2012

Uploaded by

Copyright:

Available Formats

Supervised Learning: Linear Methods (1/2)

Applied Multivariate Statistics – Spring 2012

 Review: Conditional Probability

T: Med. Test positive T (Marginal) Probability:

New sample space: New sample space:

Special case: Two classes (0/1)

Prior Additional Sq. Mahalanobis distance

 Choose class where 𝛿𝑐 𝑥 is maximal

 Quadratic Discriminant Analysis (QDA)

 Assume same covariance matrix in all classes, i.e.

Decision boundary is linear in x

 Linear Discriminant Analysis (LDA)

1. Principal Component • Class Y, predictors 𝑋 = 𝑋1 , … , 𝑋𝑑

 - Direction with largest J(w): 1. Linear Discriminant (LD 1)

Classify according to sepal/petal length/width

 Use training data also as test data: Overfitting

 Cross validation (CV; e.g. “leave-one-out cross validation):

 Confusion matrix (e.g. 100 samples)

Truth = 0 Truth = 1 Truth = 2

 7129 hand-written digits

Example with 8*8 grid 12

 Idea of LDA / QDA

You might also like