0% found this document useful (0 votes)
2 views102 pages

class

Module 4 of TMA4268 Statistical Learning focuses on classification techniques, including logistic regression, K-nearest neighbors, and discriminant analysis. It covers the theoretical foundations of classification, the setup for training observations, and methods for evaluating classifier performance. The module also emphasizes practical applications using datasets such as credit card defaults and heart disease data.

Uploaded by

emilien88.henry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views102 pages

class

Module 4 of TMA4268 Statistical Learning focuses on classification techniques, including logistic regression, K-nearest neighbors, and discriminant analysis. It covers the theoretical foundations of classification, the setup for training observations, and methods for evaluating classifier performance. The module also emphasizes practical applications using datasets such as credit card defaults and heart disease data.

Uploaded by

emilien88.henry
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 102

Module 4: Classification

TMA4268 Statistical Learning V2025

Stefanie Mu!, Department of Mathematical Sciences, NTNU

January 30 and 31, 2025

1 / 102
Introduction

Learning material for this module


• (James et al. 2021) : An Introduction to Statistical Learning.
Chapter 4 + Chapter 2.2.3.

• All the material presented on these module slides.

2 / 102
What will you learn?

• Classification and discrimination


• Logistic regression
• Bayes classifier, Bayes risk
• KNN - majority vote or estimate posterior class probability?
• Linear discriminant analysis: model, method, results.
• Quadratic discriminant analysis: model, method, results.
• Naive Bayes - when and why?
• Sensitivity, specificity and ROC curves

3 / 102
What is classification?

• By now our responses Y was assumed continuous, while covariates


were allowed to be categorical.
• Now we allow the response to be categorical.

• This is even more common than continuous responses. Examples:


• Spam filters email → {spam, ham},
• Eye color → {blue, brown, green}.
• Medical condition → {disease1, disease2, disease3}.

4 / 102
• Suppose we have a qualititative response value that can be a
member in one of K classes C = {c1 , c2 , . . . , cK }.
• In classification we build a function f (X) that takes a vector of
input variables X and predicts its class membership, such that
Y → C.
• We would also assess the uncertainty in this classification.
Sometimes the role of the di!erent predictors may be of main
interest.
• We often build models that predict probabilities of categories,
given certain covariates X.

5 / 102
Example: The credit card data
The Default dataset is available from the ISLR package.
Aim : to predic whether an individual will default on his or her credit
card payment, given the annual income and credit card balance.
Orange: default=yes, blue: default=no.

6 / 102
General classification setup

Set-up: Training observations {(x1 , y1 ), ..., (xn , yn )} where the


response variable Y is categorical, e.g Y → C = {0, 1, ..., 9} or
Y → C = {dog, cat, ..., horse}.

Aim: To build a classifier f (X) that assigns a class label from C to a


future unlabelled observation x and to asses the uncertainty in this
classification.

Performance measure: Most popular is the misclassification error


rate (training and test version).

7 / 102
What are the methods?
Three methods for classification are discussed here:

• Logistic regression
• K-nearest neighbours
• Linear and quadratic discriminant analysis

8 / 102
Linear regression for binary classification?
Suppose we have a binary outcome, for example whether a credit card
user defaults Y = yes or no, given covariates X to predict Y . We
could use dummy encoding for Y like
!
0 if no ,
Y =
1 if yes .
Can we simply perform a linear regression of Y on X and classify as
yes if Ŷ > 0.5?
• In this case of a binary outcome, linear regression does a good job
as a classifier.
• However, linear regression might produce probabilities less than
zero or bigger than one.

↑ We need to use logistic regression.

9 / 102
Linear vs. logistic regression

Let’s anticipate a bit, to see why linear regression does not so well. We
estimate the probability that someone defaults, given the credit card
balance as predictor:

10 / 102
Linear regression for categorical classification?

What when there are more than two possible outcomes? For example,
a medical diagnosis Y , given predictors X can be categorized as


 1 if stroke ,
Y = 2 if drug overdose ,

3 if epileptic seizure .

This suggests an ordering, but this is artificial.

• Linear and logistic regression are not appropriate here.


• We need Multiclass logistic regression and Discriminant Analysis.

11 / 102
However:
• It is still possible to use linear regression for classification problems
with two classes. It is actually not even a bad idea, and works well
under some conditions. Under some standard assumptions, this
linear regression (with 0 and 1 response) will in fact give the same
classification as linear discriminant analysis (LDA).
• For categorical outcomes with more than two levels, it requires
some extra work (multivariate Y due to the dummy variable
coding).
• We leave linear regression for now.

12 / 102
Logistic regression

• In logistic regression we consider a classification problem with two


classes.
• Assume that Y is coded (C = {1, 0} or {success, failure}), and we
focus on success (Y = 1).
• We may assume that Yi follows a Bernoulli distribution with
probability of success pi .
%
1 with probability pi ,
Yi =
0 with probability 1 ↓ pi .

• Aim: For covariates (X1 , . . . , Xp ), we want to estimate


pi = Pr(Yi = 1 | X1 , . . . , Xp ).

13 / 102
• We need a clever way to link our covariates X1 , . . . , Xp with this
probability pi . Aim: want to relate the linear predictor

ωi = ε0 + ε1 xi1 + . . . + εp xip

to pi . How?

14 / 102
• We need a clever way to link our covariates X1 , . . . , Xp with this
probability pi . Aim: want to relate the linear predictor

ωi = ε0 + ε1 xi1 + . . . + εp xip

to pi . How?
• The idea is to use a so-called link-function to link pi to the linear
predictor.
• In logistic regression, we use the logistic link function

& '
pi
log = ε0 + ε1 xi1 + ε2 xi2 + . . . + εp xip . (1)
1 ↓ pi

• Q: What is the rationale behind this?

15 / 102
Logistic regression with one covariate
• Equation (1) can be rearranged and solved for pi . Let’s look at
this for only one covariate:

eω0 +ω1 xi
pi = .
1+e ω0 +ω1 xi

• Important: These values pi will always lie in the interval


between 0 and 1, with an S-shaped curve.

16 / 102
Example: Default credit card data

• The parameters are estimated using the method of maximum


likelihood - we will look at that soon.
• Let’s first do it! In R, this works with the glm() function, where
we specify family="binomial".

library(ISLR)
data(Default)
Default$default <- as.numeric(Default$default)-1
glm_default = glm(default~balance, data=Default, family="binomial")

summary(glm_default)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance 0.005498917 0.0002203702 24.95309 1.976602e-137

17 / 102
Plotting the fitted line (in blue):

1.00

0.75
default

0.50

0.25

0.00
0 1000 2000
balance

Default data: here ε̂0 = -10.65 and ε̂1 = 0.005.

18 / 102
Estimating the regression coe!cients with ML

• The coe"cients ε0 , ε1 , . . . are estimated with maximum likelihood


(ML).
• Given n independent observation pairs {xi , yi }, the likelihood
function of a logistic regression model can be written as:
n
( n
( n
(
L(ω) = Li (ω) = f (yi ; ω) = (pi )yi (1 ↓ pi )1→yi ,
i=1 i=1 i=1

where ω = (ε0 , ε1 , ε2 , . . . , εp )T enters into pi

exp(ε0 + ε1 xi1 + · · · + εp xip )


pi = .
1 + exp(ε0 + ε1 xi1 + · · · + εp xip )

19 / 102
• The maximum likelihood estimates are found by maximizing the
likelihood.
• To make the math easier, we usually work with the log-likelihood
(the log is a monotone transform, thus it will give the same result
as maximizing the likelihood).

n *
) +
log(L(ω)) = l(ω) = yi log pi + (1 ↓ yi ) log(1 ↓ pi )
i=1
n *
) * p + +
= yi log + log(1 ↓ pi )
i

i=1
1 ↓ pi
)n * +
= yi (ε0 + ε1 xi1 + · · · + εp xip ) ↓ log(1 + eω0 +ω1 xi1 +···+ωp xip ) .
i=1

20 / 102
• To maximize the log-likelihood function we find the r + 1 partial
derivatives, and set equal to 0.
• This gives us a set of p + 1 non-linear equations in the εs.
• This set of equations does not have a closed form solution.
• The equation system is therefore solved numerically using the
Newton-Raphson algorithm (or Fisher Scoring).

21 / 102
Qualitative interpretation of the coe!cients

Let’s again look at the regression output:

summary(glm_default)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance 0.005498917 0.0002203702 24.95309 1.976602e-137

• The z-statistic is equal to ω̂


, and is approximately N (0, 1)
SE(ω̂)
distributed.1
• The p-value is Pr(|Z| > |z|) for a Z ↔ N (0, 1) random variable
• Check the p-value for Balance. Conclusion?

1 With this knowledge we can construct confidence intervals and test hypotheses
about the ωs, with the aim to understand which covariate(s) contribute to our
posterior probabilites and classification.
22 / 102
Quantitative interpretation of the coe!cients
Remember from equation (1) that
& '
pi
log = ε0 + ε1 xi1 + ε2 xi2 + . . . + εp xip ,
1 ↓ pi
thus
pi
= eω0 +ω1 xi1 +ω2 xi2 +...+ωp xip = eεi .
1 ↓ pi

The quantity pi /(1 ↓ pi ) is called the odds. Odds represent chances


(e.g. in betting).

Q: Answer in mentimeter:
1. You think your football team will win tonight with a probability
p = 80%. What is the odds that it will win?
2. The odds for the best horse in a race is 5 : 1. What is the
probability that this horse will win?

23 / 102
Why is the odds relevant?
Let’s again rearrange the odds in the logistic regression model:

pi P(Yi = 1 | X = xi )
odds(X = xi ) = =
1 ↓ pi P(Yi = 0 | X = xi )

= exp(ε0 ) · exp(ε1 xi1 ) · . . . · exp(εp xip ) .

↑ We have a multiplicative model for the odds - which can help us to


interpret our εs.

24 / 102
The odds ratio
To understand the e!ect of a regression coe"cient εj , let’s see what
happens if we increase xij to xij + 1, while all other covariates are kept
fixed.
Using simple algebra and the formula on the previous slide, you will see
that

odds(Xj = xij + 1)
= exp(εj ) . (2)
odds(Xj = xij )

Interpretation:
By increasing covariate xij by one unit, we change the odds for Yi = 1
by a factor exp(εj ).

Moreover:
Taking the log on equation (2), it follows that εj can be interpreted as
a log odds-ratio.

25 / 102
Let’s now fit the logistic regression model for default, given balance,
income and the binary variable student as predictors:

glm_default2 = glm(default~balance + income + student, data=Default, family="binomial")

summary(glm_default2)$coef

## Estimate Std. Error z value Pr(>|z|)


## (Intercept) -1.086905e+01 4.922555e-01 -22.080088 4.911280e-108
## balance 5.736505e-03 2.318945e-04 24.737563 4.219578e-135
## income 3.033450e-06 8.202615e-06 0.369815 7.115203e-01
## studentYes -6.467758e-01 2.362525e-01 -2.737646 6.188063e-03

Questions:
• What happens with the odds to default when income increases by
10’000 dollars?
• What happens with the odds to default when balance increases
by 100 dollars?

26 / 102
Predictions
Let’s catch up. Why were we doing all this in the first place?
Answer: We wanted to build a model that predict probabilities of
categories Y , given certain covariates X1 , . . . , Xp .
• For given parameter estimates ε̂0 , ε̂1 , . . . ε̂p and a new observation
x0 , we can estimate the probability p̂(x0 ) that the new
observation belongs to the class defined by Y = 1

eε̂0
p̂(x0 ) = ,
1+e ε̂0

with linear predictor

ω̂0 = ε̂0 + ε̂1 x01 + . . . + ε̂p x0p .

In the case of qualitative covariates, a dummy variable needs to be


introduced. This can be done as for linear regression.

27 / 102
So in the Default example, we can predict the probability that
someone defaults.
For example: “What is the estimated probability that a student
defaults with a balance of 2000, and an income of 40000?”

eω0 +2000·ω1 +40000·ω2 +1·ω3


p̂(X) = = 0.5196
1+e ω0 +2000·ω1 +40000·ω2 +1·ω3

Using R:
eta <- summary(glm_default2)$coef[,1] %*% c(1,2000,40000,1)
exp(eta)/(1+exp(eta))

## [,1]
## [1,] 0.5196218

(Or via the predict() function in R.)

28 / 102
Example: South African heart disease data set

• SAheart data set from the ElemStatLearn package, a


retrospective sample of males in a heart-disease high-risk region in
South Africa.
• 462 observations on 10 variables.
• All subjects are male in the age range 15-64.
• 160 cases (individuals who have su!ered from a conorary heart
disease) and 302 controls (individuals who have not su!ered from
a conorary heart disease).

29 / 102
The response value (chd) and covariates
• chd : conorary heart disease {yes, no} coded by the numbers {1,
0}
• sbp : systolic blood pressure
• tobacco : cumulative tobacco (kg)
• ldl : low density lipoprotein cholesterol
• famhist : family history of heart disease. Categorical variable
with two levels: {Absent, Present}.
• obesity : a numerical value
• alcohol : current alcohol consumption
• age : age at onset

The goal is to identify important risk factors. We start by loading and


looking at the data:

30 / 102
library(ElemStatLearn)

d.heart <- SAheart


d.heart$chd <- as.factor(d.heart$chd)
d.heart <- d.heart[,c("sbp","tobacco","ldl","famhist","obesity","alcohol","age",
head(d.heart)

## sbp tobacco ldl famhist obesity alcohol age chd


## 1 160 12.00 5.73 Present 25.30 97.20 52 1
## 2 144 0.01 4.41 Absent 28.87 2.06 63 1
## 3 118 0.08 3.48 Present 29.14 3.81 46 0
## 4 170 7.50 6.41 Present 31.99 24.26 58 1
## 5 134 13.60 3.50 Present 25.99 57.34 49 1
## 6 132 6.20 6.47 Present 30.77 14.14 45 0

31 / 102
pairs() plot with Y = 1 (case) in green and Y = 0 (control) in blue:

0 10 25 1.0 1.6 0 50 150

180
sbp

100
30

tobacco
15
0

14
ldl

8
2
1.6

famhist
1.0

45
obesity

30
15
100

alcohol
0

20 40 60
age

100 160 220 2 6 12 15 30 45 20 40 60

32 / 102
Fitting the model using all predictors in R:

glm_heart = glm(chd~., data=d.heart, family="binomial")


summary(glm_heart)

##
## Call:
## glm(formula = chd ~ ., family = "binomial", data = d.heart)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1295997 0.9641558 -4.283 1.84e-05 ***
## sbp 0.0057607 0.0056326 1.023 0.30643
## tobacco 0.0795256 0.0262150 3.034 0.00242 **
## ldl 0.1847793 0.0574115 3.219 0.00129 **
## famhistPresent 0.9391855 0.2248691 4.177 2.96e-05 ***
## obesity -0.0345434 0.0291053 -1.187 0.23529
## alcohol 0.0006065 0.0044550 0.136 0.89171
## age 0.0425412 0.0101749 4.181 2.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 483.17 on 454 degrees of freedom
## AIC: 499.17
##
## Number of Fisher Scoring iterations: 4

33 / 102
• For which predictors do we have evidence that they are associated
with CHD (www.menti.com)?
• How would you now calculate p̂(X) that someone will develop
CHD, given covariates X?

R-hint: The predict() function can be used to get predicted


probabilities using
predict(glm_heart, newdata=... , type="response")

34 / 102
The Bayes classifier
• We were going for classification, but now estimated
p̂(X) = Pr(Y | X) using logistic regression. Idea: probability can
be used for classification.
• Assume that we know or can estimate the probability that a new
observation x0 belongs to class k, for K classes
C = {c1 , c2 , . . . , cK }, with elements numbered as 1, 2, ..., K

pk (x0 ) = Pr(Y = k|X = x0 ), k = 1, 2, ...K .

This is the probability that Y = k given the observation x0 .


• The Bayes classifier assigns an observation to the most likely class,
given its predictor values.
• Example for two groups {A, B}. A new observation x0 will be
classified to A if Pr(Y = A|X = x0 ) > 0.5 and to class B
otherwise.

35 / 102
Properties of the Bayes classifier

• It has the smallest test error rate.


• The class boundaries using the Bayes classifier is called the Bayes
decision boundary.
• The overall Bayes error rate is given as

1 ↓ E(max Pr(Y = j | X))


j

where the expectation is over X.


• The Bayes error rate is comparable to the irreducible error in the
regression setting.
• Caveat: We usually don’t know the true conditional distribution
Pr(Y |X) for real data.

36 / 102
Some terminology

Training set: Independent observations {(x1 , y1 ), ..., (xn , yn )} with


qualitative response variable Y → {1, 2, ..., K}, used to construct the
classification rule (by estimating parameters in class densities or
posterior probabilites).

Test set: Independent observations of the same format as the training


set, used to evaluate the classification rule.

Loss function: The misclassifications are given the loss 1 and the
correct classifications loss 0 - this is called 0/1-loss.

37 / 102
Training error

Training error rate: The proportion of mistakes that are made if we


apply our estimator fˆ to the training observations, i.e. ŷi = fˆ(xi )

1)
n
I(yi ↗= ŷi ) ,
n i=1

with indicator function I, which is defined as:


%
1 if a ↗= â ,
I(a ↗= â) =
0 else.

38 / 102
Test error
Test error rate: The fraction of misclassifications when our model is
applied on a test set

1 )
n0
I(y0i ↗= ŷ0i ) ,
n0 i=1

where the average is over all the test observations (x0 , y0 ).

Remember: a good classifier is a classifier that has a low test error


(why?).

39 / 102
The K-nearest neighbour (KNN) classifier

• We have discussed how to estimate Pr(Y |X) with logistic


regression (for two categories).

• Alternative: K-nearest neighbor (KNN) classifier estimates this


conditional distribution non-parametrically and chooses the most
likely cateogry (Bayes classifier).2

2 Attention!!
K refers to the number of neighbours used for the classifier, and not
to the number of classes!! The latter is assumed known.
40 / 102
A synthetic example

• Simulate 2 ↘ 100 observations from a bivariate normal distribution


with mean vectors µ&A = (1,
' 1)T
, µB = (3, 3)T
, and covariance
2 0
matrix !A = !B = .
0 2
6

class
2
X2

A
B

−2

0 2 4 6
X1

• Aim: Find a rule to classify a new observation to A or B, given


only the data points (not the knowledge about the true
parameters).
41 / 102
The K-nearest neighbour classifier (KNN) works in the following way:
• Given a new observation x0 it searches for the K points in our
training data that are closest to it (Euclidean distance).
• These points make up the neighborhood of x0 , N0 .

• Classification is done by a majority vote: x0 is classified to the


most occurring class among its neighbors
1 )
Pr(Y = j|X = x0 ) = I(yi = j) .
K
i↑N0

42 / 102
Illustration:
k=1 k=3

5.0 5.0

class class
2.5 2.5
X2

X2
A A

0.0 B 0.0 B

−2.5 −2.5

−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0


X1 X1

k = 10 k = 150

5.0 5.0

class class
2.5 2.5
X2

X2
A A

0.0 B 0.0 B

−2.5 −2.5

−2.5 0.0 2.5 5.0 −2.5 0.0 2.5 5.0


X1 X1

• The small colored dots show the predicted classes for an evenly-spaced grid.
• The lines show the decision boundaries.

43 / 102
How to choose K?
• K = 1: the classification is made to the same class as the one
nearest neighbor.
• K large: the decision boundary tends towards a straight line
(which is the Bayes boundary in this set-up).

Discuss:
• Depending on the choice of K, when is the bias large, when is the
variance large?
• How to find the optimal K?
• Would you rather choose K even or odd? Why?

44 / 102
Bias-variance trade-o" in a classification setting

Finding the optimal value of K: Test the predictive power for di!erent
K, for example by cross-validation (Module 5).
Cross−validated test error rate for KNN
0.275

0.250
Test error

0.225

0.200

0.175

0 10 20 30 40 50
Number of neighbors K

45 / 102
The curse of dimensionality
• KNN can be quite good if the number of predictors p is small and
the number of observations n is large. We need enough close
neighbors to make a good classification.
• The e!ectiveness of the KNN classifier falls quickly when the
dimension of the preditor space is high.
• Why? Because the nearest neighbors tend to be far away in high
dimensions and the method is no longer local. This is referred to
as the curse of dimensionality.

46 / 102
Bayes decision rule - two paradigms
Two approaches to estimate Pr(Y = k | X = x):

Diagnostic paradigm
This is what we did so far: The focus was directly estimating the
posterior distribution for the classes

Pr(Y = k | X = x) .

Examples: Logistic regression. KNN classification.

Sampling paradigm
• Indirect approach: Model the conditional distribution of predictors
fk (x) = Pr(X = x | Y = k) for each class, and the prior
probabilities ϑk = Pr(Y = k).
• Then classify to the class with the maximal product ϑk fk (x). In
this paradigm, we need to model the pdf for each class.

47 / 102
Given a continuous X and categorical Y , and
• the probability density function fk (x) = Pr(X = x | Y = k) for X
in class k.
• the prior probability for class k, ϑk = Pr(Y = k), is the prior
probability.
How do we get Pr(Y = k|X = x0 )? That is, how can we “flip” the
conditioning around?

48 / 102
Given a continuous X and categorical Y , and
• the probability density function fk (x) = Pr(X = x | Y = k) for X
in class k.
• the prior probability for class k, ϑk = Pr(Y = k), is the prior
probability.
How do we get Pr(Y = k|X = x0 )? That is, how can we “flip” the
conditioning around?

Bayes theorem

Pr(X = x ≃ Y = k)
pk (X) = Pr(Y = k | X = x) =
f (x)
fk (x)ϑk
= ,K . (3)
l=1 fl (x)ϑl

49 / 102
Discriminant Analysis

• Discriminant analysis is relying on the sampling paradigm.


• The approach is to model the distribution of X in each of the
classes separately, and then use Bayes theorem to flip things
around and obtain Pr(Y | X).

Example
Suppose we have observations coming from two classes:
{green, orange}

Xgreen ↔ N (↓2, 1.52 ) and Xorange ↔ N (2, 1.52 )

Assume probabilities to be equal, ϑ1 = ϑ2 = 0.5.

50 / 102
We plot ϑk fk (x) for the two classes:
πkfk(x)
π1=π2=0.5

0.15

0.10

0.05

0.00
−6 −3 0 3 6
x

The decision boundary is where the point of intersection of the two


lines is, because here ϑ1 f1 (x) = ϑ2 f2 (x).

51 / 102
For di!erent priors ϑ1 = 0.3 and ϑ2 = 0.7, the decision boundary shifts
to the left:
πkfk(x)
0.20 π1 = 0.3 π2 = 0.7

0.15

0.10

0.05

0.00
−6 −3 0 3 6
x

52 / 102
Linear discriminant analysis (LDA) when p = 1

• Class conditional distributions fk (X) are assumed normal


(Gaussian) for k = 1, . . . , K, that is
- .2
1 1 x→µk
fk (x) = ⇐

e 2 ωk
2ϑϖk

has parameters µk (mean) and ϖk (standard deviation).


• With LDA we assume that all of the classes have the same
standard deviation ϖk = ϖ.
• In addition we have prior class probabilites ϑk = Pr(Y = k), so
,K
that k=1 ϑk = 1.

53 / 102
We can insert the expression for each class distribution into Bayes
formula to obtain the posterior probability pk (x) = Pr(Y = k|X = x)
- .2
1 x→µk
↓ 1 e→ 2
fk (x)ϑk ϑ k 2ϑϖ
ω

pk (x) = = - x→µ .2 .
f (x) , K 1 →21 l
ϑ ↓
l=1 l 2ϑϖ e ω

Our rule is to classify to the class for which pk (x) is largest.

54 / 102
Taking logs, and discarding terms that do not depend on k, we see that
this is equivalent to assigning x to the class with the largest
discriminant score ϱk (x):

µk µ2k
ϱk (x) = x · 2 ↓ 2 + log(ϑk ).
ϖ 2ϖ
• This decision boundaries between the classes are linear in x.
• For K = 2 classes and ϑ1 = ϑ2 , the decision boundary is at

µ1 + µ2
x= .
2

(Show this by setting ϱ1 (x) = ϱ2 (x) and resolving for x.)

55 / 102
Back to our example

Xgreen ↔ N (↓2, 1.52 ) and Xorange ↔ N (2, 1.52 )

π1=π2=0.5

0.15

0.10

0.05

0.00
−6 −3 0 3 6
x

• The Bayes decision boundary is at x = 0.


• Bayes error rate: round(pnorm(0,2,1.5))=0.09.
• The Bayes classifier has the lowest test error rate.

56 / 102
In the above example we knew the true distributions pk (X) and the
priors ϑk . But typically we don’t know these parameters, we only have
the training data.

Idea: we simply estimate the parameters and plug them into the rule.

57 / 102
Parameter estimators
• Prior probability for class k is (often) estimated by taking the
fraction of observations nk (out of n) coming from class k:
ϑ̂k = nnk .
• The mean value for class k is simply the sample mean of all
observations from class k:
1 )
µ̂k = xi .
nk
i:yi =k

• “Pooled” standard deviation across across all classes:

1 nk ↓ 1 2
K )
) K
)
ϖ̂ 2 = (xi ↓ µ̂k )2 = · ϖ̂k .
n↓K n↓K
k=1 i:yi =k k=1

ϖ̂k : estimated standard deviation of all observations from class k.

58 / 102
How to test the goodness of the estimator?

1. Use the training set to estimate parameters and class boundary.


2. Use the test set to estimate misclassification rate.

Simulated example:
n=1000
pi1=pi2=0.5
mu1=-2
mu2=2
sigma=1.5
set.seed(1)
n1train=rbinom(1,n,pi1)
n2train=n-n1train
n1test=rbinom(1,n,pi1)
n2test=n-n1test
train1=rnorm(n1train,mu1,sigma)
train2=rnorm(n2train,mu2,sigma)
test1=rnorm(n1test,mu1,sigma)
test2=rnorm(n2test,mu2,sigma)
var2.1=var(train1)
var2.2=var(train2)
var.pool=((n1train-1)*var2.1+(n2train-1)*var2.2)/(n-2)

59 / 102
Then set

ϱ̂1 (x) = ϱ̂2 (x)

and resolve for x to obtain a decision rule (boundary).

Exercise: Verify that the following code will give you the training and
test error rates:
rule=0.5*(mean(train1)+mean(train2))+
var.pool*(log(n2train/n)-log(n1train/n))/(mean(train1)-mean(train2))

trainingError <- (sum(train1>rule)+sum(train2<rule))/n


testError <- (sum(test1>rule)+sum(test2<rule))/n

c(trainingError,testError)

## [1] 0.105 0.115

This is a rather good performance, compared to the minimal Bayes


error rate. But keep in mind that the LDA classifier relies on the
Normal assumption, and that ϖk = ϖ for all classes is assumed3 .
3 Both of which we knew were fulfilled here.
60 / 102
The confusion matrix
• The confusion matrix is a table that can show the performance of
a classifier, given that the true values are known.
• We can make a confusion matrix from the training or test set
• The sum of the diagonal is the total number of correct
classifications. The sum of all elements o! the diagonal is the total
number of misclassifications.

• The confusion matrix can be obtained in R by using the table


function, or directly using the caret package.
61 / 102
Multivariate LDA (p > 1)
• LDA can be generalized to situations when p > 1 covariates are
used. The decision boundaries are still linear.
• The multivariate normal distribution function:
1 1
f (x) = exp(↓ (x ↓ µ)T !→1 (x ↓ µ))
(2ϑ)p/2 |!|1/2 2

62 / 102
• Plugging this density into equation (3) gives the following
expression for the discriminant function:
1 T →1
ϱk (x) = x !
T →1
µk ↓ µk ! µk + log ϑk .
2

• Note: ϱk (x) = ck0 + ck1 x1 + . . . + ckp xp is a linear function in


(x1 , . . . , xp ).

63 / 102
Back to our synthetic example
• Consider again our simulation from a bivariate normal distribution
with mean vectors µ&A = (1,
' 1)T
, µB = (3, 3)T
, and covariance
2 0
matrix !A = !B = .
0 2
• Aim: Use LDA to classify a new observation x0 to class A or B.

class
2
X2

A
B

−2

0 2 4 6
X1

64 / 102
• Since the truth is known here, we can calculate the Bayes
boundary and the Bayes error.
• Since we have bivariate normal class distributions with common
covariance matrix, the optimal boundary is given by LDA, with
boundary given at ϱA (x) = ϱB (x).

1 T →1 1 T →1
x !
T →1
µA ↓ µA ! µA + log ϑA = x ! µB ↓ µB ! µB + log ϑB
T →1
2 2

1 T →1 1 T →1
x !
T →1
(µA ↓ µB ) ↓ µA ! µA + µB ! µB + log ϑA ↓ log ϑB = 0
2 2

Inserting numerical values gives ↓x1 ↓ x2 + 4 = 0, thus a boundary


with functional form
x2 = 4 ↓ x1 .

65 / 102
Confusion matrix for the synthetic example
We can use the Bayes boundary to find the error rate:
r.pred <- ifelse(df$X2<4-df$X1,"A","B")
table(real=df$class, r.pred)
## r.pred
## real A B
## A 82 18
## B 21 79

Of course, the Bayes boundary is usually not known, and we must


estimate it from the data.

66 / 102
Estimators for p>1:
• Prior probability for class k (unchanged from p = 1): ϑ̂k = nk
n .
• The mean value for class k is simply the sample mean of all
observations from class k (but now these are vectors):
1 )
µ̂k = Xi .
nk
i:yi =k

• The covariance matrices for each class:

ˆk = 1 )
! (Xi ↓ µ̂k )(Xi ↓ µ̂k )T
nk ↓ 1
i:yi =k

• Pooled version:
nk ↓ 1
K
)
ˆ =
! ˆ k.
·!
n↓K
k=1

67 / 102
Analysing the synthetic example with lda()

r.lda <- lda(class ~ X1 + X2, df)


r.pred <- predict(r.lda, df)$class
table(real=df$class, predicted = r.pred)
## predicted
## real A B
## A 87 13
## B 18 82

Note: The training error is smaller than for the Bayes boundary. Why?

68 / 102
Comparison of the Bayes with the estimated boundary
Solid line: Bayes boundary
Dashed line: Estimated boundary

class
2
X2

A
B

−2

0 2 4 6
X1

69 / 102
Posterior probabilites
• Sometimes the probability that an observation comes from a class
k is more interesting than the actual classification itself.
• These class probabilities can be estimated from the priors and
class conditional distributions, or from the discriminant functions:

ϑ̂k · 1
exp(↓ 12 (x ˆ →1
↓ µ̂k ) ! (x ↓ µ̂k ))
T
ˆ 1/2
(2ϑ)p/2 |!|
P̂ (Y = k|X = x) = ,
K
ϑ̂l (2ϑ)p/2 |!|1
exp(↓ 12 (x ↓ µ̂l T ˆ →1
) ! (x ↓ µ̂l ))
l=1 ˆ 1/2

eϱ̂k (x)
= ,K .
l=1 eϱ̂l (x)

70 / 102
Quadratic Discriminant Analysis (QDA)

• In LDA we assumed that !k = ! for all classes.


• In QDA we allow di!erent covariance matrices !k for each class,
while the predictors are still multivariate Gaussian

X ↔ N (µk , !k ) .

• The discriminant functions are now given by:


1 1
ϱk (x) = ↓ (x ↓ µk ) !k (x ↓ µk ) ↓ log |!k | + log ϑk
T →1
2 2
1 T →1 1 T →1 1
= ↓ x !k x + x !k µk ↓ µk !k µk ↓ log |!k | + log ϑk .
T →1
2 2 2

• These decision boundaries are quadratic functions of x.

71 / 102
LDA vs QDA
QDA is more flexible than LDA, as it allows for group-specific
covariance matrices.

Q:
• But, if the covariance matrices in theory are equal - will they not
be estimated equal?
• Should we not always prefer QDA to LDA?

72 / 102
A:

73 / 102
LDA vs QDA – Illustration
Bayes (purple dashed), LDA (black dotted) and QDA (green solid)
decision boundaries for the cases where !1 = !2 (left) and !1 ↗= !2
(right).

74 / 102
Example: Which type of iris species?

The iris flower data set was introduced by the British statistician and
biologist Ronald Fisher in 1936.
• Three plant species: {setosa, virginica, versicolor}.
• Four features: Sepal.Length, Sepal.Width, Petal.Length and
Petal.Width.

Figure 1: Iris plant with sepal and petal leaves


http://blog.kaggle.com/2015/04/22/scikit- learn- video- 3- machine- learning- first- steps- with- the- iris-
dataset/

75 / 102
Example: Classification of iris plants
We will use sepal width and sepal length to build a classificator.
We have 50 observations from each class.

attach(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa

76 / 102
8

Species
Sepal.Length

setosa
6 versicolor
virginica

2.0 2.5 3.0 3.5 4.0 4.5


Sepal.Width

77 / 102
Iris: LDA
iris_lda = lda(Species~Sepal.Length+Sepal.Width, data=iris, prior=c(1,1,1)/3)

Species
Sepal.Length

setosa
6 versicolor
virginica

4
2 3 4
Sepal.Width

78 / 102
Iris: QDA
iris_qda = qda(Species~Sepal.Length + Sepal.Width, data=iris, prior=c(1,1,1)/3)

Species
Sepal.Length

setosa
6 versicolor
virginica

4
2 3 4
Sepal.Width

79 / 102
Iris: compare LDA and QDA
To compare the predictive performance of our two classifiers we divide
the original iris data set randomly into train and test samples of
equal size:

set.seed(1)
train = sample(1:150, 75)

iris_train = iris[train, ]
iris_test = iris[-train, ]

Run LDA and QDA on the same training set:

iris_lda2 = lda(Species~Sepal.Length + Sepal.Width,


data=iris_train,
prior=c(1,1,1)/3)

iris_qda2 = qda(Species~Sepal.Length + Sepal.Width,


data=iris_train,
prior=c(1,1,1)/3)

80 / 102
14
LDA training error: 75 = 0.19
table(predict(iris_lda2, newdata=iris_train)$class,
iris_train$Species)

##
## setosa versicolor virginica
## setosa 27 0 0
## versicolor 1 15 8
## virginica 0 5 19

19
LDA test error: 75 = 0.26.
iris_lda2_predict = predict(iris_lda2, newdata=iris_test)
table(iris_lda2_predict$class, iris$Species[-train])

##
## setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 22 11
## virginica 0 8 12

81 / 102
13
QDA training error: 75 = 0.17.
table(predict(iris_qda2, newdata=iris_train)$class, iris_train$Species)

##
## setosa versicolor virginica
## setosa 28 0 0
## versicolor 0 16 9
## virginica 0 4 18

24
QDA test error: 75 = 0.32.
iris_qda2_predict = predict(iris_qda2, newdata=iris_test)
table(iris_qda2_predict$class, iris$Species[-train])

##
## setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 18 12
## virginica 0 12 11

82 / 102
Result: The LDA classifier has given the smallest test error4 for
classifying iris plants based on sepal width and sepal length for our test
set and should be preferred in this case.

But:
1. Would another division of the data into training and test set give
the same conclusion (that LDA is better than QDA for this data
set)?
A: Not necessarily, but probably.
↑ We will look into such questions in Module 5 (cross-validation).

2. What about the other two covariates? Would adding them to the
model (4 covariates) give a better classification rule?
A: Probably. Try if you want.

4 Note that the training error is of much less interest; it could be low due to
overfitting only.
83 / 102
Di"erent forms of discriminant analysis

• LDA
• QDA
• Naive Bayes (“Idiot’s Bayes”): Assume that each class density is
the product of marginal densities - i.e. inputs are conditionally
independent in each class
p
(
fk (x) = fkj (xj ) .
j=1

This is generally not true, but it simplifies the estimation


dramatically.
• Other forms by proposing specific density models for fk (x),
including nonparametric approaches.

84 / 102
Naive Bayes
• Naive Bayes is method that is popular when p is large.
• The original naive Bayes: univariate normal marginal
distributions. Consequently
 
p p 2
( 1 ) (x ↓ µ )
ϱk (x) ⇒ log ϑk fkj (xj ) = ↓ + log(ϑk ) ,
j kj
2 j=1 2
ϖkj
j=1

thus !k is assumed diagonal, and only the diagonal elements are


estimated.
• Arbitraty generalizations can be made. For example, mixed
features (qualitative and quantitative predictors).
• This method often produces good results, even though the joint
pdf is not the product of the marginal pdf. This might be because
we are not focussing on estimation of class pdfs, but class
boundaries.

85 / 102
Summary of Classification Methods

• Logistic regression
• KNN
• Linear discriminant analysis
• Quadratic discriminant analysis
• Naive Bayes

Remember:

• Logistic regression and KNN directly estimate Pr(Y = k | X = x)


(diagnostic paradigm).
• LDA, QDA and naive Bayes indirectly estimate
Pr(Y = k | X = x) ⇒ fk (x) · ϑk (sampling paradigm).

86 / 102
Which classification method is the best?

Advantages of discriminant analysis

• Linear discriminant analysis is more stable than logistic regression


when
• the classes are well-separated. In that case, the parameter
estimates for the logistic regression model are very unstable.
• if the number of observations n is small and the distribution of the
predictors X is approximately (multivariate) normal.
• Moreover, linear discriminant analysis is popular when we have
more than two response classes.

87 / 102
Linearity

Assume a binary classification problem with one covariate.


• Recall that logistic regression can be written:
* p(x) +
log = ε0 + ε1 x .
1 ↓ p(x)
• For a two-class problem, one can show that for LDA
& ' & '
p1 (x) p1 (x)
log = log = c0 + c1 x1 ,
1 ↓ p1 (x) p2 (x)

thus the same linear form. The di!erence is in how the parameters
are estimated.

88 / 102
LDA vs logistic regression
• In practice the results are often very similar5 , but
• LDA is “more available” in the multi-class setting.
• if the class conditional distributions are multivariate normal then
LDA (or QDA) is preferred.
• logistic regression makes no assumptions about the covariates and
is therefore to be preferred in many practical applications.
• in medicine for two-class problems logistic regression is often
preferred (for interpretability) and (always) together with ROC
and AUC (for model comparison).

and KNN?
• KNN is used when the class boundaries are non-linear.

5 logistic
regression can also fit quadratic boundaries like QDA, by explicitly
including quadratic terms in the model.
89 / 102
So: Which classification method is the best?
The answer is: it depends!
• Logistic regression is very popular for classification, especially
when K = 2.
• LDA is useful when n is small, or the classes are well separated,
and Gaussian assumptions are reasonable. Also when K > 2.
• Naive Bayes is useful when p is very large.
• KNN is nonparametric, thus no assumptions about the decision
boundary nor the distribution of the variables. Expected to work
better than LDA and logistic regression when boundary is very
non-linear. Caveats:
• No interpretation of the e!ect of the covariates possbile.
• Curse of dimensionality.

Please read yourself Section 4.5 of our coursebook (James et al. 2021).

90 / 102
Two-class problems: sensitivity, specificity

• Problems with only two classes (binary classifiers) have a special


status, e.g. in medicine or biology.
• Assume the classes (Y ) are labelled “-” (non disease, or Y = 0)
and “+” (disease, or Y = 1), and that a diagnostic test is used to
predict Y given X = x.
• Sensitivity is the proportion of correctly classified positive
observations:
#True Positive TP
= .
#Condition Positive P

• Specificity is the proportion of correctly classified negative


observations:
#True Negative TN
= .
#Condition Negative N

91 / 102
• We would like that a classification rule (or a diagnostic test) have
both a high sensitivity and a high specificity.

• However, in an imperfect test (i.e., an imperfect predictor) one


usually comes at the cost of the other.

2 ↘ 2 table shows data from a simple diagnostic study:


Predicted
Ŷ = 0 Ŷ = 1
Y =0 True negative (T N ) False positive (F N ) N
True
Y =1 False negative (F P ) True positive (T P ) P
Nς Pς T ot

92 / 102
Example Continued: South African heart disease
Go back to the multiple logistic model for the SAheart data set. We
divide the original data set randomly into a training and test dataset of
equal size:
set.seed(20)
train_ID = sample(1:nrow(d.heart), nrow(d.heart)/2)
train_SA = d.heart[train_ID, ]
test_SA = d.heart[-train_ID, ]

Fit a logistic regression model, using the training set only:


glm_SA = glm(chd~. , data=train_SA, family="binomial")
summary(glm_SA)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7674748108 1.439119254 -2.61790314 0.008847191
## sbp 0.0045154473 0.008907046 0.50695226 0.612188315
## tobacco 0.1253783872 0.043828325 2.86067029 0.004227464
## ldl 0.0484641791 0.080453775 0.60238540 0.546917627
## famhistPresent 0.8186419234 0.338067955 2.42153068 0.015455296
## obesity -0.0226110199 0.040483169 -0.55852890 0.576483276
## alcohol -0.0007383597 0.008039590 -0.09184047 0.926824793
## age 0.0446073935 0.014781896 3.01770439 0.002546972

93 / 102
• The estimated probability of a chd event (Y = 1) is then given as


p̂(X) = ,
1 + eε
with ω = ε0 + ε1 · xsbp + ε2 · xtobacco + . . . + ε7 · xage .

• We are interested in the predictions for the test set.

94 / 102
• The predict function does these calculations for us. When
specifying type="response" the function returns the probabilities
for Y = 1.

probs_SA = predict(glm_SA, newdata=test_SA, type="response")

• Here we have chosen a threshold value of 0.5. By using the ifelse


function we specify that all probabilities larger than 0.5 are to be
classified as 1, while the remaining probabilities are to be classified
as 0.

95 / 102
pred_SA = ifelse(probs_SA > 0.5, 1, 0)

predictions_SA = data.frame(probs_SA, pred_SA, test_SA[,"chd"])


colnames(predictions_SA) = c("Estim. prob. of Y=1","Predicted class","True class")
head(predictions_SA)

## Estim. prob. of Y=1 Predicted class True class


## 1 0.7741048 1 1
## 4 0.7141807 1 1
## 7 0.2258178 0 0
## 11 0.5216856 1 1
## 12 0.7112496 1 1
## 15 0.6702761 1 0

96 / 102
Confusion matrix for the test set:
table(predicted=pred_SA, true=d.heart[-train_ID,"chd"])

## true
## predicted 0 1
## 0 118 45
## 1 27 41

• The logistic model has correctly classified 118+41 times, and


misclassified 27+45 times, thus
27 + 45
Test error rate = ⇑ 0.31 .
27 + 45 + 118 + 41

• Sensitivity: 41
41+45 = 0.477

• Specificity = 118
118+27 = 0.814.

97 / 102
ROC curves and AUC
• The receiver operating characteristics (ROC) curve gives a
graphical display of the sensitivity against (1-specificity), as the
threshold value (cut-o! on probability) is moved from 0 to 1.
• An ideal classifier will give a ROC curve which hugs the top left
corner (sensitivity = specificity = 1), while a straight line
represents a classifier with a random guess of the outcome.
• The AUC score is the area under the AUC curve. It ranges
between the values 0 and 1, where a higher value indicates a better
classifier.
• AUC is useful for comparing the performance of di!erent
classifiers.

98 / 102
Example Continued: South African heart disease
In order to see how our model performs for di!erent threshold values,
we can plot a ROC curve:
ROC curve
1.00

0.75
sensitivity

0.50

AUC = 0.7762
0.25

0.00

0.00 0.25 0.50 0.75 1.00


1−specificity

99 / 102
To check where in the plot we find the default cut-o! on 0.5, we need
to calculate sensitivity and specificity for this cut-o!:
res=table(pred_SA, d.heart[-train_ID,"chd"])
sens=res[2,2]/sum(res[,2])
spec=res[1,1]/sum(res[,1])
sens

## [1] 0.4767442
spec

## [1] 0.8137931

Observe that the value 0.477 (on y-axis) and 0.186 (1-specificity on
x-axis) is on our ROC curve.
The ROC-curve is made up of all possible cut-o!s and their associated
sensitivity and specificity.

100 / 102
Further reading

• Videos on YouTube by the authors of ISL, Chapter 4,

101 / 102
References

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2021. An


Introduction to Statistical Learning with Applications in r. 2nd ed.
New York: Springer.

102 / 102

You might also like