Logistic
Regression
Outline
1. The classification problem
2. Why not linear regression?
3. Logistic regression formulation
4. Logistic regression cost function
5. Worked examples
2 /33
Outline
1. The classification problem
2. Why not linear regression?
3. Logistic regression formulation
4. Logistic regression cost function
5. Worked examples
3 /33
The classification probelm
The linear regression model discussed in the previous lesson assumes that the
response variable 𝑦 is quantitative (metrical)
• in many situations, the response variable is instead qualitative (categorical)
Qualitative variables take values in an unordered set 𝒞 = "cat! ", … , "cat " " , such as:
• eye color ∈ "brown", "blue", "green"
• email ∈ "spam", "not spam"
Metric data Categorical data
• Describe a quantity • Describe membership categories
• An ordering is defined • It is not meaningful to apply an ordering
• A distance is defined • It is not meaningful to compute distances
4 /33
The classification probelm
The process of estimating categorical outcomes using a set of regressors 𝝋 is called
classification
Estimating a categorical response for an observation 𝝋 can be referred to as classifying
that observation, since it involves assigning the observation to a category, or class
Often we are more interested in estimating the probabilities that 𝝋 belongs to each
category in 𝒞
The most probable category is then chosen as the class for the observation 𝝋
5 /33
Examples of classification problems
• A person arrives at the emergency room with a set of symptoms that could possibly
be attributed to one of three medical conditions
Which of the three conditions does the individual have?
• An online banking system manages transactions, storing user’s IP address, past
transaction history, and so forth
Is the transaction fraudulent or not?
• A biologist collects DNA sequence data for a number of patients with and without
a given disease
which DNA mutations are deleterious (disease-causing) and which are not?
6 /33
Example: cat vs dog classification
Suppose that we measure the weight and Classifier function 𝑓 ⋅
height of some dogs and cats
Height [cm]
We want to learn the function 𝑓 ⋅ that can
$
tell us if a given input vector 𝝋 = 𝜑! , 𝜑# is a
dog or a cat Cats
• 𝜑! : weight
𝜑" Dogs
• 𝜑# : height
QUIZ: The point is classified by the model
as a ?
𝜑! Weight [kg]
7 /33
The classification problem
QUIZ: Consider a company that produces sliding gates. The gates can have four
weights 300Kg, 400Kg, 500Kg, 600Kg . We want to detect the weight of the
gate. This is a:
q A regression problem
q A classification problem
q Both a regression and a classification problem
8 /33
Outline
1. The classification problem
2. Why not linear regression?
3. Logistic regression formulation
4. Logistic regression cost function
5. Worked examples
9 /33
Why not linear regression?
Suppose that we are trying to estimate the medical condition of a patient in the
emergency room based on her symptoms
There are three possibilities: stroke, drug overdose and epileptic seizure
We could consider encoding these values as a quantitative response variable, 𝑦, as
1 if stroke
𝑦 = <2 if drug overdose
3 if epileptic seizure
However, we are implicitly saying that the «difference» between drug overdose and
stroke is the same as the «difference» between epileptic seizure and drug
overdose, which does not make much sense
10 /33
Why not linear regression?
We can also change the encoding to
1 if epileptic seizure
𝑦 = <2 if stroke
3 if drug overdose
This would imply a totally different relationship among the three conditions
• each of these codings would produce fundamentally different linear models…
• …that would ultimately lead to different sets of estimates on test observations
In general, there is no natural way to convert a qualitative response variable with more
than two levels into a quantitative response that is ready for linear regression
11 /33
Why not linear regression?
With two levels, the situation is better. For instance, perhaps there are only two
possibilities for the patient’s medical condition: stroke and drug overdose
0 if stroke
𝑦=<
1 if drug overdose
We can fit a linear regression to this binary response, and classify as drug overdose if
𝑦C > 0.5 and stroke otherwise, interpreting 𝑦C as a probability of drug overdose
However, if we use linear regression, some of our estimates might be outside the [0, 1]
interval, which does not make sense as a probability. There is nothing that “saturates” the
output between 0 and 1. Logistic function (Sigmoid)
12 /33
Outline
1. The classification problem
2. Why not linear regression?
3. Logistic regression formulation
4. Logistic regression cost function
5. Worked examples
13 /33
Logistic regression
Purpose: Estimate the probability that a set of input regressors 𝝋 ∈ ℝ%×! belong to one
of two classes 𝑦 ∈ 0, 1
Define the linear combination quantity Logistic function (Sigmoid)
%*!
𝑎 = I 𝜑' ⋅ 𝜃' = 𝝋$ ⋅ 𝜽
'()
0.5
The formula 𝑠 𝑎 is the logistic function
1 𝑒$ • 𝑎≫0⇒𝑠 𝑎 =1
𝑠 𝑎 = #$ =
1+𝑒 1 + 𝑒$ • 𝑎≪0⇒𝑠 𝑎 =0 0
14 /33
Logistic regression
Purpose: Estimate the probability that a set of input regressors 𝝋 ∈ ℝ%×! belong to one
of two classes 𝑦 ∈ 0, 1
1
𝑃 𝑦=1𝝋 =𝑠 𝑎 =𝑠 𝝋%𝜽 = !𝜽
1 + 𝑒 #𝝋
The output of 𝑠 𝝋!𝜽 is interpreted as a probability
• 𝝋$ 𝜽 ≫ 0 ⇒ 𝑠 𝝋$ 𝜽 ≫ 0.5 ⇒ 𝑃 𝑦 = 1 𝝋 ≈ 1 𝝋 is classified to class 1
• 𝝋$ 𝜽 ≪ 0 ⇒ 𝑠 𝝋$ 𝜽 ≪ 0.5 ⇒ 𝑃 𝑦 = 1 𝝋 ≈ 0 𝝋 is classified to class 0
15 /33
Outline
1. The classification problem
2. Why not linear regression?
3. Logistic regression formulation
4. Logistic regression cost function
5. Worked examples
16 /33
Logistic regression cost function
Suppose to have at disposal a dataset 𝒟 = 𝝋 1 ,𝑦 1 ,…, 𝝋 𝑁 ,𝑦 𝑁 where 𝝋 𝑖 ∈ ℝ%×!
and 𝑦 𝑖 ∈ 0, 1 , 𝑖 = 1, … , 𝑁, 𝑖. 𝑖. 𝑑.
1
Estimate a logistic regression model 𝑃 𝑦 𝑖 = 1 𝝋 𝑖 = ! ≡𝜋 𝑖
1+ 𝑒 *𝝋 ' 𝜽
The logistic regression cost function 𝐽 𝜽 is defined as:
*
𝐽 𝜽 = − 7 𝑦 𝑖 ⋅ ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ⋅ ln 1 − 𝜋 𝑖
()!
17 /33
Logistic regression cost function
QUIZ: In the logistic regression cost function, where are the parameters 𝜽 that we
want to estimate?
*
𝐽 𝜽 = − 7 𝑦 𝑖 ⋅ ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ⋅ ln 1 − 𝜋 𝑖
()!
q In the 𝑦 𝑖 terms
q In the ln terms
q In the 𝜋 𝑖 terms
18 /33
Logistic regression cost function
Cost function interpretation
Suppose there is only one datum 𝒟 = 𝝋, 𝑦
− ln 𝜋 if 𝑦 = 1
⇒𝐽 𝜽 =<
− ln 1 − 𝜋 if 𝑦 = 0
Case 𝑦 = 1
• 𝐽 𝜽 ≈ 0 if 𝑦 = 1 and 𝜋 ≈ 1
𝐽 𝜽 = −ln 𝜋
• 𝐽 𝜽 ≈ +∞ if 𝑦 = 1 and 𝜋 ≈ 0
19 /33
Logistic regression cost function
Cost function interpretation
Suppose there is only one datum 𝒟 = 𝝋, 𝑦
− ln 𝜋 if 𝑦 = 1
⇒𝐽 𝜽 =<
− ln 1 − 𝜋 if 𝑦 = 0
Case 𝑦 = 0
• 𝐽 𝜽 ≈ 0 if 𝑦 = 0 and 𝜋 ≈ 0
𝐽 𝜽 = −ln 1 − 𝜋
• 𝐽 𝜽 ≈ +∞ if 𝑦 = 0 and 𝜋 ≈ 1
20 /33
IN-DEPTH ANALYSIS
Computation of the minimum of 𝐽(𝜽)
We have to compute the gradient of 𝐽 𝜽 with respect to 𝜽 ∈ ℝ%×! . First, compute the
!
derivative of 𝑠 𝑎 =
!-. "#
𝜕𝑠 𝑎 𝜕 1 𝜕 "# "$ 𝑒 "#
= = 1 + 𝑒 = − 1+ 𝑒 "# "% 𝑒 "# −1 = − 1 + 𝑒 "# "% −𝑒 "# =
𝜕𝑎 𝜕𝑎 1 + 𝑒 "# 𝜕𝑎 1 + 𝑒 "# %
1 𝑒 "# 1 1 + 𝑒 "# − 1 1 1 + 𝑒 "# 1
= "# ⋅ "# = "# ⋅ "# = "# ⋅ "# − "# = 𝑠 𝑎 ⋅ 1−𝑠 𝑎
1+𝑒 1+𝑒 1+𝑒 1+𝑒 1+𝑒 1+𝑒 1+𝑒
In the case where 𝑎 = 𝝋$ 𝜽, we have that
𝛻𝜽𝑠 𝝋%𝜽 = 𝝋 ⋅ 𝑠 𝝋%𝜽 ⋅ 1 − 𝑠 𝝋%𝜽 =𝝋⋅𝜋⋅ 1−𝜋
𝑑×1 𝑑×1 1×1 1×1
21 /33
IN-DEPTH ANALYSIS
Computation of the minimum of 𝐽(𝜽)
We can now compute the gradient of 𝐽 𝜽
(
1
𝐽 𝜽 = − 3 𝑦 𝑖 ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ln 1 − 𝜋 𝑖 𝜋 𝑖 = !𝜽
&'$
1 + 𝑒 "𝝋 &
( (
𝜋) 𝑖 −𝜋 ) 𝑖 𝝋 𝑖 𝜋 𝑖 1−𝜋 𝑖 −𝝋 𝑖 𝜋 𝑖 1 − 𝜋 𝑖
𝛻𝜽 𝐽 𝜽 = − / 𝑦 𝑖 + 1−𝑦 𝑖 = −/ 𝑦 𝑖 + 1−𝑦 𝑖
𝜋 𝑖 1−𝜋 𝑖 𝜋 𝑖 1−𝜋 𝑖
𝑑×1 %&' %&'
( (
= / −𝑦 𝑖 𝝋 𝑖 1 − 𝜋 𝑖 − 1−𝑦 𝑖 −𝝋 𝑖 𝜋 𝑖 = / 𝝋 𝑖 ⋅ −𝑦 𝑖 + 𝑦 𝑖 𝜋 𝑖 +𝝋 𝑖 ⋅ 𝜋 𝑖 −𝑦 𝑖 𝜋 𝑖
%&' %&'
( (
= / 𝝋 𝑖 ⋅ −𝑦 𝑖 + 𝑦 𝑖 𝜋 𝑖 − 𝑦 𝑖 𝜋 𝑖 + 𝜋 𝑖 = /𝝋 𝑖 ⋅ 𝜋 𝑖 − 𝑦 𝑖
%&' %&' 𝑑×1 1×1
22 /33
Gradient descent
It can be shown that:
• The cost function 𝐽 𝜽 is convex and admits a unique minimum
• The equations found by posing 𝛻𝜽 𝐽 𝜽 = 𝟎 are nonlinear in 𝜽 and it is not possible to
find a solution in closed-form
ü For this reason, we need to resort to iterative optimization algorithms
Use gradient descent:
𝜽 ! 𝑘 − 𝛼 ⋅ 𝛻𝐽 𝜽 ,
! 𝑘+1 =𝜽 𝛼 ∈ ℝ/) : learning rate
𝑑×1 𝑑×1 1×1 𝑑×1 : ;
𝜽9𝜽
23 /33
Gradient descent
*
𝐽 𝜽 = − 7 𝑦 𝑖 ⋅ ln 𝜋 𝑖 + 1 − 𝑦 𝑖 ⋅ ln 1 − 𝜋 𝑖
()!
Repeat { 0
𝜃) = 𝜃) − 𝛼 ⋅ I 𝜋 𝑖 − 𝑦 𝑖
'(!
0
𝜃! = 𝜃# − 𝛼 ⋅ I 𝜋 𝑖 − 𝑦 𝑖 ⋅ −𝜑! 𝑖
'(!
⋮
0
𝜃%*! = 𝜃%*! − 𝛼 ⋅ I 𝜋 𝑖 − 𝑦 𝑖 ⋅ −𝜑%*! 𝑖
} '(!
24 /33
Logistic regression recap
The logistic regression model, despite its name, is not used for regression, but for
classification
Once the model estimates the probability of a class, we can classify a point to a particular
class if the probability for that class is above a threshold (usually 0.5)
The function that now we are trying to estimate is: 𝑓 𝝋 = 𝑃 𝑦 = 1 𝝋
The logistic regression tries to model 𝑓 by using the model:
$
1
𝑠 𝝋 𝜽 = !𝜽
1 + 𝑒 *𝝋
The point 𝝋 can then be classified to class 𝑦 = +1 if 𝑠 𝝋$ 𝜽 ≥ 0.5
25 /33
Logistic regression recap
The classification boundary found by Linear classifier
logistic regression is linear
Height [cm]
Infact, classifying with the rule:
𝑦 = 1 if 𝑠 𝝋$ 𝜽 ≥ 0.5
Cats
Dogs
is the same as saying
𝑦 = 1 if 𝝋$ 𝜽 ≥ 0
Weight [kg]
26 /33
Outline
1. The classification problem
2. Why not linear regression?
3. Logistic regression formulation
4. Logistic regression cost function
5. Worked examples
27 /33
Students admissions classification
% Read data from file
We want to estimate if a student will get admitted data = load(‘studentsdata.csv’);
to a university given the results on two exams (𝜑$ , 𝜑% ) Phi = data(:, [1, 2]); y = data(:, 3);
• The training set consists of 𝑁 = 100 students with % Setup the data matrix appropriately, and
add ones for the intercept term
𝜑$ 𝑖 , 𝜑% 𝑖 and 𝑦 𝑖 ∈ 0,1 , for 𝑖 = 1, … , 𝑁 [N, m] = size(Phi); d = m + 1;
% Add intercept term
Phi = [ones(N, 1) Phi];
• Φ ∈ ℝ$++×- % Initialize fitting parameters
initial_theta = zeros(d, 1);
• 𝒚 ∈ ℝ$++×$ pi_s = sigmoid(Phi*theta)
• 𝜽 ∈ ℝ-×$ J = ( -y'*log(pi_s) – (1-y)'*log(1-pi_s));
grad = Phi'*( pi_s - y);
Embed in a function and pass the function to an
optimization algoritm that iteratively computes
the gradient
28 /33
The framingham heart study
In late 1940s, U.S. Government set out to better understand cardiovascular disease
Plan: track large cohort of initially healthy patients over time
The city of Framingham (MA) was selected as site for the study in the 1948
• Appropriate size
• Stable population
• Cooperative doctors and residents
A total of 5209 patients aged 30-59 were enrolled. They had to give a survey and take
and exam every 2 years:
• Physical characteristics and behavioral characteristics
• Test results
29 /33
The framingham heart study
We will build models using the Framingham data to estimate and prevent heart disease
We will estimate the 10-year risk of Coronary Heart Disease
• CHD is a disease of the blood vessels supplying the heart
Heart disease has been the leading cause of death
worldwide since 1921:
• 7.3 million people died from CHD in 2008
• Since 1950, age-adjusted death rates have declined 60%
30 /33
The framingham heart study
Demographic risk factors
• male: sex of patient • age: age in years at first examination
• education: Some high school (1), high school (2), some college (3), college (4)
Behavioral risk factors
• currentSmoker: 0/1 • cigsPerDay: cigarettes per day
Behavioral risk factors
• BPmeds: On blood pressure medication at time of first examination
• prevalentStroke: Previously had a stroke
• prevalentHyp: Currently hypertensive • Diabetes: Currently has diabetes
31 /33
The framingham heart study
Risk factors from first examination
• totChol: Total cholesterol (mg/dL)
• sysBP: Systolic blood pressure
• diaBP: Diastolic blood pressure
• BMI: Body Mass Index (kg/m# )
• heartRate: Heart rate (beats/minute)
• glucose: Blood glucose level (mg/dL)
Use logistic regression to estimate whether or not a patient experienced CHD within
10 years of first examination
32 /33
The framingham heart study
Most critical identified risk
factors
33 /33