class
class
1 / 102
Introduction
2 / 102
What will you learn?
3 / 102
What is classification?
4 / 102
• Suppose we have a qualititative response value that can be a
member in one of K classes C = {c1 , c2 , . . . , cK }.
• In classification we build a function f (X) that takes a vector of
input variables X and predicts its class membership, such that
Y → C.
• We would also assess the uncertainty in this classification.
Sometimes the role of the di!erent predictors may be of main
interest.
• We often build models that predict probabilities of categories,
given certain covariates X.
5 / 102
Example: The credit card data
The Default dataset is available from the ISLR package.
Aim : to predic whether an individual will default on his or her credit
card payment, given the annual income and credit card balance.
Orange: default=yes, blue: default=no.
6 / 102
General classification setup
7 / 102
What are the methods?
Three methods for classification are discussed here:
• Logistic regression
• K-nearest neighbours
• Linear and quadratic discriminant analysis
8 / 102
Linear regression for binary classification?
Suppose we have a binary outcome, for example whether a credit card
user defaults Y = yes or no, given covariates X to predict Y . We
could use dummy encoding for Y like
!
0 if no ,
Y =
1 if yes .
Can we simply perform a linear regression of Y on X and classify as
yes if Ŷ > 0.5?
• In this case of a binary outcome, linear regression does a good job
as a classifier.
• However, linear regression might produce probabilities less than
zero or bigger than one.
9 / 102
Linear vs. logistic regression
Let’s anticipate a bit, to see why linear regression does not so well. We
estimate the probability that someone defaults, given the credit card
balance as predictor:
10 / 102
Linear regression for categorical classification?
What when there are more than two possible outcomes? For example,
a medical diagnosis Y , given predictors X can be categorized as
1 if stroke ,
Y = 2 if drug overdose ,
3 if epileptic seizure .
11 / 102
However:
• It is still possible to use linear regression for classification problems
with two classes. It is actually not even a bad idea, and works well
under some conditions. Under some standard assumptions, this
linear regression (with 0 and 1 response) will in fact give the same
classification as linear discriminant analysis (LDA).
• For categorical outcomes with more than two levels, it requires
some extra work (multivariate Y due to the dummy variable
coding).
• We leave linear regression for now.
12 / 102
Logistic regression
13 / 102
• We need a clever way to link our covariates X1 , . . . , Xp with this
probability pi . Aim: want to relate the linear predictor
ωi = ε0 + ε1 xi1 + . . . + εp xip
to pi . How?
14 / 102
• We need a clever way to link our covariates X1 , . . . , Xp with this
probability pi . Aim: want to relate the linear predictor
ωi = ε0 + ε1 xi1 + . . . + εp xip
to pi . How?
• The idea is to use a so-called link-function to link pi to the linear
predictor.
• In logistic regression, we use the logistic link function
& '
pi
log = ε0 + ε1 xi1 + ε2 xi2 + . . . + εp xip . (1)
1 ↓ pi
15 / 102
Logistic regression with one covariate
• Equation (1) can be rearranged and solved for pi . Let’s look at
this for only one covariate:
eω0 +ω1 xi
pi = .
1+e ω0 +ω1 xi
16 / 102
Example: Default credit card data
library(ISLR)
data(Default)
Default$default <- as.numeric(Default$default)-1
glm_default = glm(default~balance, data=Default, family="binomial")
summary(glm_default)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance 0.005498917 0.0002203702 24.95309 1.976602e-137
17 / 102
Plotting the fitted line (in blue):
1.00
0.75
default
0.50
0.25
0.00
0 1000 2000
balance
18 / 102
Estimating the regression coe!cients with ML
19 / 102
• The maximum likelihood estimates are found by maximizing the
likelihood.
• To make the math easier, we usually work with the log-likelihood
(the log is a monotone transform, thus it will give the same result
as maximizing the likelihood).
n *
) +
log(L(ω)) = l(ω) = yi log pi + (1 ↓ yi ) log(1 ↓ pi )
i=1
n *
) * p + +
= yi log + log(1 ↓ pi )
i
i=1
1 ↓ pi
)n * +
= yi (ε0 + ε1 xi1 + · · · + εp xip ) ↓ log(1 + eω0 +ω1 xi1 +···+ωp xip ) .
i=1
20 / 102
• To maximize the log-likelihood function we find the r + 1 partial
derivatives, and set equal to 0.
• This gives us a set of p + 1 non-linear equations in the εs.
• This set of equations does not have a closed form solution.
• The equation system is therefore solved numerically using the
Newton-Raphson algorithm (or Fisher Scoring).
21 / 102
Qualitative interpretation of the coe!cients
summary(glm_default)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
## balance 0.005498917 0.0002203702 24.95309 1.976602e-137
1 With this knowledge we can construct confidence intervals and test hypotheses
about the ωs, with the aim to understand which covariate(s) contribute to our
posterior probabilites and classification.
22 / 102
Quantitative interpretation of the coe!cients
Remember from equation (1) that
& '
pi
log = ε0 + ε1 xi1 + ε2 xi2 + . . . + εp xip ,
1 ↓ pi
thus
pi
= eω0 +ω1 xi1 +ω2 xi2 +...+ωp xip = eεi .
1 ↓ pi
Q: Answer in mentimeter:
1. You think your football team will win tonight with a probability
p = 80%. What is the odds that it will win?
2. The odds for the best horse in a race is 5 : 1. What is the
probability that this horse will win?
23 / 102
Why is the odds relevant?
Let’s again rearrange the odds in the logistic regression model:
pi P(Yi = 1 | X = xi )
odds(X = xi ) = =
1 ↓ pi P(Yi = 0 | X = xi )
24 / 102
The odds ratio
To understand the e!ect of a regression coe"cient εj , let’s see what
happens if we increase xij to xij + 1, while all other covariates are kept
fixed.
Using simple algebra and the formula on the previous slide, you will see
that
odds(Xj = xij + 1)
= exp(εj ) . (2)
odds(Xj = xij )
Interpretation:
By increasing covariate xij by one unit, we change the odds for Yi = 1
by a factor exp(εj ).
Moreover:
Taking the log on equation (2), it follows that εj can be interpreted as
a log odds-ratio.
25 / 102
Let’s now fit the logistic regression model for default, given balance,
income and the binary variable student as predictors:
summary(glm_default2)$coef
Questions:
• What happens with the odds to default when income increases by
10’000 dollars?
• What happens with the odds to default when balance increases
by 100 dollars?
26 / 102
Predictions
Let’s catch up. Why were we doing all this in the first place?
Answer: We wanted to build a model that predict probabilities of
categories Y , given certain covariates X1 , . . . , Xp .
• For given parameter estimates ε̂0 , ε̂1 , . . . ε̂p and a new observation
x0 , we can estimate the probability p̂(x0 ) that the new
observation belongs to the class defined by Y = 1
eε̂0
p̂(x0 ) = ,
1+e ε̂0
27 / 102
So in the Default example, we can predict the probability that
someone defaults.
For example: “What is the estimated probability that a student
defaults with a balance of 2000, and an income of 40000?”
Using R:
eta <- summary(glm_default2)$coef[,1] %*% c(1,2000,40000,1)
exp(eta)/(1+exp(eta))
## [,1]
## [1,] 0.5196218
28 / 102
Example: South African heart disease data set
29 / 102
The response value (chd) and covariates
• chd : conorary heart disease {yes, no} coded by the numbers {1,
0}
• sbp : systolic blood pressure
• tobacco : cumulative tobacco (kg)
• ldl : low density lipoprotein cholesterol
• famhist : family history of heart disease. Categorical variable
with two levels: {Absent, Present}.
• obesity : a numerical value
• alcohol : current alcohol consumption
• age : age at onset
30 / 102
library(ElemStatLearn)
31 / 102
pairs() plot with Y = 1 (case) in green and Y = 0 (control) in blue:
180
sbp
100
30
tobacco
15
0
14
ldl
8
2
1.6
famhist
1.0
45
obesity
30
15
100
alcohol
0
20 40 60
age
32 / 102
Fitting the model using all predictors in R:
##
## Call:
## glm(formula = chd ~ ., family = "binomial", data = d.heart)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.1295997 0.9641558 -4.283 1.84e-05 ***
## sbp 0.0057607 0.0056326 1.023 0.30643
## tobacco 0.0795256 0.0262150 3.034 0.00242 **
## ldl 0.1847793 0.0574115 3.219 0.00129 **
## famhistPresent 0.9391855 0.2248691 4.177 2.96e-05 ***
## obesity -0.0345434 0.0291053 -1.187 0.23529
## alcohol 0.0006065 0.0044550 0.136 0.89171
## age 0.0425412 0.0101749 4.181 2.90e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 596.11 on 461 degrees of freedom
## Residual deviance: 483.17 on 454 degrees of freedom
## AIC: 499.17
##
## Number of Fisher Scoring iterations: 4
33 / 102
• For which predictors do we have evidence that they are associated
with CHD (www.menti.com)?
• How would you now calculate p̂(X) that someone will develop
CHD, given covariates X?
34 / 102
The Bayes classifier
• We were going for classification, but now estimated
p̂(X) = Pr(Y | X) using logistic regression. Idea: probability can
be used for classification.
• Assume that we know or can estimate the probability that a new
observation x0 belongs to class k, for K classes
C = {c1 , c2 , . . . , cK }, with elements numbered as 1, 2, ..., K
35 / 102
Properties of the Bayes classifier
36 / 102
Some terminology
Loss function: The misclassifications are given the loss 1 and the
correct classifications loss 0 - this is called 0/1-loss.
37 / 102
Training error
1)
n
I(yi ↗= ŷi ) ,
n i=1
38 / 102
Test error
Test error rate: The fraction of misclassifications when our model is
applied on a test set
1 )
n0
I(y0i ↗= ŷ0i ) ,
n0 i=1
39 / 102
The K-nearest neighbour (KNN) classifier
2 Attention!!
K refers to the number of neighbours used for the classifier, and not
to the number of classes!! The latter is assumed known.
40 / 102
A synthetic example
class
2
X2
A
B
−2
0 2 4 6
X1
42 / 102
Illustration:
k=1 k=3
5.0 5.0
class class
2.5 2.5
X2
X2
A A
0.0 B 0.0 B
−2.5 −2.5
k = 10 k = 150
5.0 5.0
class class
2.5 2.5
X2
X2
A A
0.0 B 0.0 B
−2.5 −2.5
• The small colored dots show the predicted classes for an evenly-spaced grid.
• The lines show the decision boundaries.
43 / 102
How to choose K?
• K = 1: the classification is made to the same class as the one
nearest neighbor.
• K large: the decision boundary tends towards a straight line
(which is the Bayes boundary in this set-up).
Discuss:
• Depending on the choice of K, when is the bias large, when is the
variance large?
• How to find the optimal K?
• Would you rather choose K even or odd? Why?
44 / 102
Bias-variance trade-o" in a classification setting
Finding the optimal value of K: Test the predictive power for di!erent
K, for example by cross-validation (Module 5).
Cross−validated test error rate for KNN
0.275
0.250
Test error
0.225
0.200
0.175
0 10 20 30 40 50
Number of neighbors K
45 / 102
The curse of dimensionality
• KNN can be quite good if the number of predictors p is small and
the number of observations n is large. We need enough close
neighbors to make a good classification.
• The e!ectiveness of the KNN classifier falls quickly when the
dimension of the preditor space is high.
• Why? Because the nearest neighbors tend to be far away in high
dimensions and the method is no longer local. This is referred to
as the curse of dimensionality.
46 / 102
Bayes decision rule - two paradigms
Two approaches to estimate Pr(Y = k | X = x):
Diagnostic paradigm
This is what we did so far: The focus was directly estimating the
posterior distribution for the classes
Pr(Y = k | X = x) .
Sampling paradigm
• Indirect approach: Model the conditional distribution of predictors
fk (x) = Pr(X = x | Y = k) for each class, and the prior
probabilities ϑk = Pr(Y = k).
• Then classify to the class with the maximal product ϑk fk (x). In
this paradigm, we need to model the pdf for each class.
47 / 102
Given a continuous X and categorical Y , and
• the probability density function fk (x) = Pr(X = x | Y = k) for X
in class k.
• the prior probability for class k, ϑk = Pr(Y = k), is the prior
probability.
How do we get Pr(Y = k|X = x0 )? That is, how can we “flip” the
conditioning around?
48 / 102
Given a continuous X and categorical Y , and
• the probability density function fk (x) = Pr(X = x | Y = k) for X
in class k.
• the prior probability for class k, ϑk = Pr(Y = k), is the prior
probability.
How do we get Pr(Y = k|X = x0 )? That is, how can we “flip” the
conditioning around?
Bayes theorem
Pr(X = x ≃ Y = k)
pk (X) = Pr(Y = k | X = x) =
f (x)
fk (x)ϑk
= ,K . (3)
l=1 fl (x)ϑl
49 / 102
Discriminant Analysis
Example
Suppose we have observations coming from two classes:
{green, orange}
50 / 102
We plot ϑk fk (x) for the two classes:
πkfk(x)
π1=π2=0.5
0.15
0.10
0.05
0.00
−6 −3 0 3 6
x
51 / 102
For di!erent priors ϑ1 = 0.3 and ϑ2 = 0.7, the decision boundary shifts
to the left:
πkfk(x)
0.20 π1 = 0.3 π2 = 0.7
0.15
0.10
0.05
0.00
−6 −3 0 3 6
x
52 / 102
Linear discriminant analysis (LDA) when p = 1
53 / 102
We can insert the expression for each class distribution into Bayes
formula to obtain the posterior probability pk (x) = Pr(Y = k|X = x)
- .2
1 x→µk
↓ 1 e→ 2
fk (x)ϑk ϑ k 2ϑϖ
ω
pk (x) = = - x→µ .2 .
f (x) , K 1 →21 l
ϑ ↓
l=1 l 2ϑϖ e ω
54 / 102
Taking logs, and discarding terms that do not depend on k, we see that
this is equivalent to assigning x to the class with the largest
discriminant score ϱk (x):
µk µ2k
ϱk (x) = x · 2 ↓ 2 + log(ϑk ).
ϖ 2ϖ
• This decision boundaries between the classes are linear in x.
• For K = 2 classes and ϑ1 = ϑ2 , the decision boundary is at
µ1 + µ2
x= .
2
55 / 102
Back to our example
π1=π2=0.5
0.15
0.10
0.05
0.00
−6 −3 0 3 6
x
56 / 102
In the above example we knew the true distributions pk (X) and the
priors ϑk . But typically we don’t know these parameters, we only have
the training data.
Idea: we simply estimate the parameters and plug them into the rule.
57 / 102
Parameter estimators
• Prior probability for class k is (often) estimated by taking the
fraction of observations nk (out of n) coming from class k:
ϑ̂k = nnk .
• The mean value for class k is simply the sample mean of all
observations from class k:
1 )
µ̂k = xi .
nk
i:yi =k
1 nk ↓ 1 2
K )
) K
)
ϖ̂ 2 = (xi ↓ µ̂k )2 = · ϖ̂k .
n↓K n↓K
k=1 i:yi =k k=1
58 / 102
How to test the goodness of the estimator?
Simulated example:
n=1000
pi1=pi2=0.5
mu1=-2
mu2=2
sigma=1.5
set.seed(1)
n1train=rbinom(1,n,pi1)
n2train=n-n1train
n1test=rbinom(1,n,pi1)
n2test=n-n1test
train1=rnorm(n1train,mu1,sigma)
train2=rnorm(n2train,mu2,sigma)
test1=rnorm(n1test,mu1,sigma)
test2=rnorm(n2test,mu2,sigma)
var2.1=var(train1)
var2.2=var(train2)
var.pool=((n1train-1)*var2.1+(n2train-1)*var2.2)/(n-2)
59 / 102
Then set
Exercise: Verify that the following code will give you the training and
test error rates:
rule=0.5*(mean(train1)+mean(train2))+
var.pool*(log(n2train/n)-log(n1train/n))/(mean(train1)-mean(train2))
c(trainingError,testError)
62 / 102
• Plugging this density into equation (3) gives the following
expression for the discriminant function:
1 T →1
ϱk (x) = x !
T →1
µk ↓ µk ! µk + log ϑk .
2
63 / 102
Back to our synthetic example
• Consider again our simulation from a bivariate normal distribution
with mean vectors µ&A = (1,
' 1)T
, µB = (3, 3)T
, and covariance
2 0
matrix !A = !B = .
0 2
• Aim: Use LDA to classify a new observation x0 to class A or B.
class
2
X2
A
B
−2
0 2 4 6
X1
64 / 102
• Since the truth is known here, we can calculate the Bayes
boundary and the Bayes error.
• Since we have bivariate normal class distributions with common
covariance matrix, the optimal boundary is given by LDA, with
boundary given at ϱA (x) = ϱB (x).
1 T →1 1 T →1
x !
T →1
µA ↓ µA ! µA + log ϑA = x ! µB ↓ µB ! µB + log ϑB
T →1
2 2
1 T →1 1 T →1
x !
T →1
(µA ↓ µB ) ↓ µA ! µA + µB ! µB + log ϑA ↓ log ϑB = 0
2 2
65 / 102
Confusion matrix for the synthetic example
We can use the Bayes boundary to find the error rate:
r.pred <- ifelse(df$X2<4-df$X1,"A","B")
table(real=df$class, r.pred)
## r.pred
## real A B
## A 82 18
## B 21 79
66 / 102
Estimators for p>1:
• Prior probability for class k (unchanged from p = 1): ϑ̂k = nk
n .
• The mean value for class k is simply the sample mean of all
observations from class k (but now these are vectors):
1 )
µ̂k = Xi .
nk
i:yi =k
ˆk = 1 )
! (Xi ↓ µ̂k )(Xi ↓ µ̂k )T
nk ↓ 1
i:yi =k
• Pooled version:
nk ↓ 1
K
)
ˆ =
! ˆ k.
·!
n↓K
k=1
67 / 102
Analysing the synthetic example with lda()
Note: The training error is smaller than for the Bayes boundary. Why?
68 / 102
Comparison of the Bayes with the estimated boundary
Solid line: Bayes boundary
Dashed line: Estimated boundary
class
2
X2
A
B
−2
0 2 4 6
X1
69 / 102
Posterior probabilites
• Sometimes the probability that an observation comes from a class
k is more interesting than the actual classification itself.
• These class probabilities can be estimated from the priors and
class conditional distributions, or from the discriminant functions:
ϑ̂k · 1
exp(↓ 12 (x ˆ →1
↓ µ̂k ) ! (x ↓ µ̂k ))
T
ˆ 1/2
(2ϑ)p/2 |!|
P̂ (Y = k|X = x) = ,
K
ϑ̂l (2ϑ)p/2 |!|1
exp(↓ 12 (x ↓ µ̂l T ˆ →1
) ! (x ↓ µ̂l ))
l=1 ˆ 1/2
eϱ̂k (x)
= ,K .
l=1 eϱ̂l (x)
70 / 102
Quadratic Discriminant Analysis (QDA)
X ↔ N (µk , !k ) .
71 / 102
LDA vs QDA
QDA is more flexible than LDA, as it allows for group-specific
covariance matrices.
Q:
• But, if the covariance matrices in theory are equal - will they not
be estimated equal?
• Should we not always prefer QDA to LDA?
72 / 102
A:
73 / 102
LDA vs QDA – Illustration
Bayes (purple dashed), LDA (black dotted) and QDA (green solid)
decision boundaries for the cases where !1 = !2 (left) and !1 ↗= !2
(right).
74 / 102
Example: Which type of iris species?
The iris flower data set was introduced by the British statistician and
biologist Ronald Fisher in 1936.
• Three plant species: {setosa, virginica, versicolor}.
• Four features: Sepal.Length, Sepal.Width, Petal.Length and
Petal.Width.
75 / 102
Example: Classification of iris plants
We will use sepal width and sepal length to build a classificator.
We have 50 observations from each class.
attach(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
76 / 102
8
Species
Sepal.Length
setosa
6 versicolor
virginica
77 / 102
Iris: LDA
iris_lda = lda(Species~Sepal.Length+Sepal.Width, data=iris, prior=c(1,1,1)/3)
Species
Sepal.Length
setosa
6 versicolor
virginica
4
2 3 4
Sepal.Width
78 / 102
Iris: QDA
iris_qda = qda(Species~Sepal.Length + Sepal.Width, data=iris, prior=c(1,1,1)/3)
Species
Sepal.Length
setosa
6 versicolor
virginica
4
2 3 4
Sepal.Width
79 / 102
Iris: compare LDA and QDA
To compare the predictive performance of our two classifiers we divide
the original iris data set randomly into train and test samples of
equal size:
set.seed(1)
train = sample(1:150, 75)
iris_train = iris[train, ]
iris_test = iris[-train, ]
80 / 102
14
LDA training error: 75 = 0.19
table(predict(iris_lda2, newdata=iris_train)$class,
iris_train$Species)
##
## setosa versicolor virginica
## setosa 27 0 0
## versicolor 1 15 8
## virginica 0 5 19
19
LDA test error: 75 = 0.26.
iris_lda2_predict = predict(iris_lda2, newdata=iris_test)
table(iris_lda2_predict$class, iris$Species[-train])
##
## setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 22 11
## virginica 0 8 12
81 / 102
13
QDA training error: 75 = 0.17.
table(predict(iris_qda2, newdata=iris_train)$class, iris_train$Species)
##
## setosa versicolor virginica
## setosa 28 0 0
## versicolor 0 16 9
## virginica 0 4 18
24
QDA test error: 75 = 0.32.
iris_qda2_predict = predict(iris_qda2, newdata=iris_test)
table(iris_qda2_predict$class, iris$Species[-train])
##
## setosa versicolor virginica
## setosa 22 0 0
## versicolor 0 18 12
## virginica 0 12 11
82 / 102
Result: The LDA classifier has given the smallest test error4 for
classifying iris plants based on sepal width and sepal length for our test
set and should be preferred in this case.
But:
1. Would another division of the data into training and test set give
the same conclusion (that LDA is better than QDA for this data
set)?
A: Not necessarily, but probably.
↑ We will look into such questions in Module 5 (cross-validation).
2. What about the other two covariates? Would adding them to the
model (4 covariates) give a better classification rule?
A: Probably. Try if you want.
4 Note that the training error is of much less interest; it could be low due to
overfitting only.
83 / 102
Di"erent forms of discriminant analysis
• LDA
• QDA
• Naive Bayes (“Idiot’s Bayes”): Assume that each class density is
the product of marginal densities - i.e. inputs are conditionally
independent in each class
p
(
fk (x) = fkj (xj ) .
j=1
84 / 102
Naive Bayes
• Naive Bayes is method that is popular when p is large.
• The original naive Bayes: univariate normal marginal
distributions. Consequently
p p 2
( 1 ) (x ↓ µ )
ϱk (x) ⇒ log ϑk fkj (xj ) = ↓ + log(ϑk ) ,
j kj
2 j=1 2
ϖkj
j=1
85 / 102
Summary of Classification Methods
• Logistic regression
• KNN
• Linear discriminant analysis
• Quadratic discriminant analysis
• Naive Bayes
Remember:
86 / 102
Which classification method is the best?
87 / 102
Linearity
thus the same linear form. The di!erence is in how the parameters
are estimated.
88 / 102
LDA vs logistic regression
• In practice the results are often very similar5 , but
• LDA is “more available” in the multi-class setting.
• if the class conditional distributions are multivariate normal then
LDA (or QDA) is preferred.
• logistic regression makes no assumptions about the covariates and
is therefore to be preferred in many practical applications.
• in medicine for two-class problems logistic regression is often
preferred (for interpretability) and (always) together with ROC
and AUC (for model comparison).
and KNN?
• KNN is used when the class boundaries are non-linear.
5 logistic
regression can also fit quadratic boundaries like QDA, by explicitly
including quadratic terms in the model.
89 / 102
So: Which classification method is the best?
The answer is: it depends!
• Logistic regression is very popular for classification, especially
when K = 2.
• LDA is useful when n is small, or the classes are well separated,
and Gaussian assumptions are reasonable. Also when K > 2.
• Naive Bayes is useful when p is very large.
• KNN is nonparametric, thus no assumptions about the decision
boundary nor the distribution of the variables. Expected to work
better than LDA and logistic regression when boundary is very
non-linear. Caveats:
• No interpretation of the e!ect of the covariates possbile.
• Curse of dimensionality.
Please read yourself Section 4.5 of our coursebook (James et al. 2021).
90 / 102
Two-class problems: sensitivity, specificity
91 / 102
• We would like that a classification rule (or a diagnostic test) have
both a high sensitivity and a high specificity.
92 / 102
Example Continued: South African heart disease
Go back to the multiple logistic model for the SAheart data set. We
divide the original data set randomly into a training and test dataset of
equal size:
set.seed(20)
train_ID = sample(1:nrow(d.heart), nrow(d.heart)/2)
train_SA = d.heart[train_ID, ]
test_SA = d.heart[-train_ID, ]
93 / 102
• The estimated probability of a chd event (Y = 1) is then given as
eε
p̂(X) = ,
1 + eε
with ω = ε0 + ε1 · xsbp + ε2 · xtobacco + . . . + ε7 · xage .
94 / 102
• The predict function does these calculations for us. When
specifying type="response" the function returns the probabilities
for Y = 1.
95 / 102
pred_SA = ifelse(probs_SA > 0.5, 1, 0)
96 / 102
Confusion matrix for the test set:
table(predicted=pred_SA, true=d.heart[-train_ID,"chd"])
## true
## predicted 0 1
## 0 118 45
## 1 27 41
• Sensitivity: 41
41+45 = 0.477
• Specificity = 118
118+27 = 0.814.
97 / 102
ROC curves and AUC
• The receiver operating characteristics (ROC) curve gives a
graphical display of the sensitivity against (1-specificity), as the
threshold value (cut-o! on probability) is moved from 0 to 1.
• An ideal classifier will give a ROC curve which hugs the top left
corner (sensitivity = specificity = 1), while a straight line
represents a classifier with a random guess of the outcome.
• The AUC score is the area under the AUC curve. It ranges
between the values 0 and 1, where a higher value indicates a better
classifier.
• AUC is useful for comparing the performance of di!erent
classifiers.
98 / 102
Example Continued: South African heart disease
In order to see how our model performs for di!erent threshold values,
we can plot a ROC curve:
ROC curve
1.00
0.75
sensitivity
0.50
AUC = 0.7762
0.25
0.00
99 / 102
To check where in the plot we find the default cut-o! on 0.5, we need
to calculate sensitivity and specificity for this cut-o!:
res=table(pred_SA, d.heart[-train_ID,"chd"])
sens=res[2,2]/sum(res[,2])
spec=res[1,1]/sum(res[,1])
sens
## [1] 0.4767442
spec
## [1] 0.8137931
Observe that the value 0.477 (on y-axis) and 0.186 (1-specificity on
x-axis) is on our ROC curve.
The ROC-curve is made up of all possible cut-o!s and their associated
sensitivity and specificity.
100 / 102
Further reading
101 / 102
References
102 / 102