Logistic regression and regularization
DSA5103 Lecture 3
Yangjing Zhang
29-Aug-2023
NUS
Today’s content
1. Logistic regression for classification
2. Ridge/lasso regularization
lecture3 1/31
Logistic regression
Classification
Binary classification:
• Email: spam/not spam
• Patient: cancer/healthy
• Student: fail/pass
We usually assign
(
0, normal state/negative class e.g., not spam
label
1, abnormal state/positive class e.g., spam
However, the label assignment can be arbitrary:
0 = not spam, 1 = spam or 0 = spam, 1 = not spam
Data xi ∈ Rp , yi ∈ {0, 1}, i = 1, 2, . . . , n.
lecture3 2/31
Classification
Multi-class classification:
• Iris flower (3 species: Setosa, Versicolor, Virginica)
• Optical character recognition
Data xi ∈ Rp , yi ∈ {1, . . . , K}, i = 1, 2, . . . , n.
Figure 1: Convert images of text to machine-readable format. Image from
internet
lecture3 3/31
Linear regression for classification?
Data xi , yi ∈ {0, 1}
We fit f (x) = β0 + β T x, and predict a new input x̃ belong to
class 1 if f (x̃) ≥ 0.5 class 0 if f (x̃) < 0.5
lecture3 4/31
Linear regression vs. logistic regression
Linear regression
• Data xi , yi ∈ R
• Fit f (x) = β T x + β0 = β̂ T x̂, β̂ = [β0 ; β], x̂ = [1; x]
Logistic regression
• Data xi , yi ∈ {0, 1}
1
• Fit f (x) = g(β̂ T x̂), g(z) = sigmoid/logistic function
1 + e−z
. 0 < g(z) < 1 an increasing
1
function
. g(0) = 0.5
0.5
. g(z) → 1 as z → +∞
. g(z) → 0 as z → −∞
0
-10 -5 0 5 10
lecture3 5/31
Graph illustration
β0 , β1 will change the shape and location of the function
0.5
-6 -4 -2 0 2 4 6 8 10 12
lecture3 6/31
Probabilistic interpretation
1.5
0.5
0
0 1 2 3 4 5 6 7 8
In logistic regression, we interpret
f (x) = probability(input x ∈ class 1)
= p(y = 1|x; β̂)
Then 1 − f (x) = probability(input x ∈ class 0) = p(y = 0|x; β̂)
lecture3 7/31
Logistic regression
1 1
f (x) = g(β̂ T x̂), g(z) =
1 + e−z
f (x) = probability(input x ∈ class 1)
0.5
= p(y = 1|x; β̂)
-10 -5 0 5 10
Predict y = 1 (x ∈ class 1) if
• f (x) ≥ 0.5, i.e., β̂ T x̂ ≥ 0
Predict y = 0 (x ∈ class 0) if
• f (x) < 0.5, i.e., β̂ T x̂ < 0
lecture3 8/31
Example
(1) (One feature)
Say β̂ = [β0 ; β1 ] = [−4; 2]. Then f (x) = g(β0 + β1 x).
Predict y = 1 if
β0 + β1 x = −4 + 2x ≥ 0, i.e., x ≥ 2
(2) (Two features)
Say β̂ = [β0 ; β1 ; β2 ] = [−4; 2; 1]. Then f (x) = g(β0 + β1 x1 + β2 x2 ).
Predict y = 1 if
β0 + β1 x1 + β2 x2 = −4 + 2x1 + x2 ≥ 0, i.e., 2x1 + x2 ≥ 4
lecture3 9/31
Example
(3) (Three features)
Say β̂ = [β0 ; β1 ; β2 ; β3 ] = [−4; 2; 1; 2]. Then
f (x) = g(β0 + β1 x1 + β2 x2 + β3 x3 ).
Predict y = 1 if
β0 +β1 x1 +β2 x2 +β3 x3 = −4+2x1 +x2 +2x3 ≥ 0, i.e., 2x1 +x2 +2x3 ≥ 4
lecture3 10/31
Decision boundary
The set of all x ∈ Rp such that
β0 + β T x = 0
is called the decision boundary between classes 0 and 1.
The logistic regression has a linear decision boundary; it is
• a point when p = 1
• a line when p = 2
• a plan when p = 3
• in general a (p − 1)-dimensional subpace
lecture3 11/31
Feature expansion
Say β̂ = [β0 ; β1 ; . . . ; β5 ] = [−4; 0; 0; 1; 1; 0]. The decision boundary is
−4 + x21 + x22 = 0
lecture3 12/31
Maximum likelihood estimation
• Data (xi , yi ), i = 1, 2, . . . , n, xi ∈ Rp , yi ∈ {0, 1}.
• The likelihood of a single training example (xi , yi ) is
probability(xi ∈ class yi )
p(yi = 1|xi ; β̂) = f (xi ), if yi = 1
=
p(yi = 0|xi ; β̂) = 1 − f (xi ), if yi = 0
h iyi h i1−yi
= f (xi ) 1 − f (xi )
• Hope the likelihood is close to 1 for every training example (xi , yi )
lecture3 13/31
Maximum likelihood estimation
• Assume independence of the training samples, the likelihood is
n h
Y iyi h i1−yi
f (xi ) 1 − f (xi )
i=1
• Want to find β̂ to maximize the log-likelihood. Note that
maximizing a (positive) function is the same as maximizing the log
of a function (because log is monotone increasing)
lecture3 14/31
Maximum likelihood estimation
If f (x) > 0 for all x,
max f (x) ⇐⇒ max log(f (x))
x x
since log is monotone increasing, i.e., log(x) ≥ log(y) if x ≥ y > 0
x∗ is a global maximizer of f (·)
⇐⇒ f (x∗ ) ≥ f (x) ∀ x
⇐⇒ log(f (x∗ )) ≥ log(f (x)) ∀ x
⇐⇒ x∗ is a global maximizer of log(f (·))
Similar arguments work for local maximizer.
Note that the likelihood is always positive and we can take log.
lecture3 15/31
Cost function
Cost function1
n h
!
Y iyi h i1−yi
L(β̂) = − log f (xi ) 1 − f (xi )
i=1
n
X
=− yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1
For a particular training example (xi , yi ), the cost is
− yi log(f (xi )) − (1 − yi ) log(1 − f (xi ))
(
− log(f (xi )), if yi = 1
=
− log(1 − f (xi )), if yi = 0
1 We use natural logarithms in logistic regression
lecture3 16/31
Understand the cost function
cost = − log(f (xi )), when yi = 1
• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)
• f (xi ) = 0.9, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.1, cost = 2.3 (“bad” scenario ⇒ large cost)
lecture3 17/31
Understand the cost function
cost = − log(f (xi )), when yi = 1
• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)
• f (xi ) = 0.9, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.1, cost = 2.3 (“bad” scenario ⇒ large cost)
4.5
3.5
2.5
1.5
0.5
0
0 0.2 0.4 0.6 0.8 1
lecture3 17/31
Understand the cost function
cost = − log(1 − f (xi )), when yi = 0
• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)
• f (xi ) = 0.1, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.9, cost = 2.3 (“bad” scenario ⇒ large cost)
lecture3 18/31
Understand the cost function
cost = − log(1 − f (xi )), when yi = 0
• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)
• f (xi ) = 0.1, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.9, cost = 2.3 (“bad” scenario ⇒ large cost)
0
0 0.2 0.4 0.6 0.8 1
lecture3 18/31
Simply the cost function
n
∗ T
X
L(β̂) = L(β0 , β) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1
1
Derivation*. Recall that f (x) = 1+e−β̂ T x̂
and
n
X
L(β̂) = − yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1
n
X f (xi )
=− yi log + log(1 − f (xi ))
i=1
1 − f (xi )
It remains to prove two equalities:
f (xi ) T
1. log 1−f (xi ) = β̂ x̂i
T
2. log(1 − f (xi )) = − log(1 + eβ̂ x̂i
)
lecture3 19/31
Simply the cost function
1 !
f (xi ) T
1+e−β̂ x̂i 1
1. log = log 1 = log
1 − f (xi ) 1 − T 1 + e−β̂ T x̂i − 1
1+e−β̂ x̂i
T
= log(eβ̂ x̂i
) = β̂ T x̂i
T
!
1 + e−β̂ x̂i
1 −1
2. log(1 − f (xi )) = log 1 − = log
1 + e−β̂ T x̂i 1 + e−β̂ T x̂i
1 T
= log = − log 1 + eβ̂ x̂i
1 + eβ̂ T x̂i
lecture3 20/31
Gradient of the cost function
• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k
β1 xi1 + β2 xi2 + · · · + βp xip
lecture3 21/31
Gradient of the cost function
• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k
β1 xi1 + β2 xi2 + · · · + βp xip
• Calculate
n n
∂ X 1 X
L= − yi = (f (xi ) − yi )
∂β0 i=1
1 + e−(β0 +β T xi ) i=1
n n
∂ X 1 X
L= − yi xi1 = (f (xi ) − yi ) xi1
∂β1 i=1
1 + e−(β0 +β T xi ) i=1
n n
∂ X 1 X
L= −(β0 +β T xi )
− yi xi2 = (f (xi ) − yi ) xi2
∂β2 1 + e
.. i=1 i=1
. n n
∂ X 1 X
L= − y i x ip = (f (xi ) − yi ) xip
∂βp i=1
1 + e−(β0 +β T xi ) i=1
lecture3 21/31
Linear regression vs. logistic regression
Linear regression Logistic regression L =
n n
1X T X T
L= (β xi + β0 − yi )2 log(1 + eβ0 +β xi ) − yi (β0 + β T xi )
2 i=1 i=1
Gradient Gradient
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi ) L= ( − yi )
∂β0 i=1
∂β0 i=1
1 + e 0 +β T xi )
−(β
n
X n
X
= (f (xi ) − yi ) = (f (xi ) − yi )
i=1 i=1
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi )xij L= ( − yi )xij
∂βj i=1
∂βj i=1
1 + e−(β0 +β T xi )
n
X n
X
= (f (xi ) − yi )xij = (f (xi ) − yi )xij
i=1 i=1
for j = 1, 2, . . . , p for j = 1, 2, . . . , p
lecture3 22/31
Solution may not exist
The solution (global minimizer) of the minimization problem
n
X T
minimize log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
β0 ,β1 ,...,βp
i=1
may not exist. (Regularization will help solve this issue)
Example. n = 1, x1 = −1, y1 = 0. Then the cost function
L(β0 , β1 ) = log(1 + eβ0 −β1 )
We can see that min L = 0. However this value cannot be attained.
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6
lecture3 23/31
Multi-class classification: one-vs-rest
Idea: transfer multi-class classification to multiple binary classification
problems
Data: xi ∈ Rp , yi ∈ {1, 2, . . . , K}, i = 1, 2, . . . , n.
For each k ∈ {1, 2, . . . , K}
1. Construct a new label ỹi = 1 if yi = k and ỹi = 0 otherwise
2. Learn a binary classifier fk with data xi , ỹi
Multi-class classifier predicts class k where k achieves the maximal value
max fk (x)
k∈{1,2,...,K}
lecture3 24/31
Multi-class classification: one-vs-rest
feature 1 feature 2 label y
1 5 1
0.8 4.5 1
1.5 1.5 2
4 4 3
Say for a new input x̃, we have f1 (x̃) = 0.8, f2 (x̃) = 0.1, f3 (x̃) = 0.6.
Then we say
x̃ belongs to class 1 with probability 80%
x̃ belongs to class 2 with probability 10%
x̃ belongs to class 3 with probability 60%
and we predict it belongs to class 1. lecture3 25/31
Ridge/lasso regularization
Over-fitting
under-fitted good fit/just right over-fitted
Figure 2: Image from internet
• Under-fitting: a model is too simple and does not adequately
capture the underlying structure of the data
• Over-fitting: a model is too complicated and contains more
parameters that can be justified by the data; it does not generalize
well from training data to test data
• Good fit: a model adequately learns the training data and
generalizes well to test data
lecture3 26/31
Ridge regularization
In linear/logistic regression, over-fitting occurs frequently. Regularization
will make the model simpler and works well for most of the
regression/classification problems.
• Ridge regularization:
p
X
λkβk2 = λ βj2
j=1
λ: regularization parameter, kβk2 : regularizer
• It is differentiable. It forces βj ’s to be small
• Extreme case: suppose λ is a huge number, it will push all βj ’s to
be zero and the model will be naive
lecture3 27/31
Ridge regularized problems
• Logistic regression + ridge regularization (Gradient methods can be
used, a solution exists)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ βj2
β0 ,β1 ,...,βp
i=1 j=1
• Linear regression + ridge regularization (Apply either normal
equation or gradient methods)
n p
1X T X
minimize (β xi + β0 − yi )2 + λ βj2
β0 ,β1 ,...,βp 2 i=1 j=1
lecture3 28/31
Normal equation for ridge regularized linear regression
xT1 y1
xT2 y2
X= .. Y =
..
. .
xTn yn
For simplicity, we assume2 β0 = 0
n p
1X T X 1
minimize (β xi − yi )2 + λ βj2 = kXβ − Y k2 + λkβk2
p
β∈R 2 i=1 j=1
2
Compute the gradient
Gradient = X T (Xβ − Y ) + 2λβ
Normal equation (setting gradient to be zero)
(2λI + X T X)β = X T Y
⇒ β = (2λI + X T X)−1 X T Y
2 Note that the intercept should be zero β0 = 0 if the data is standardized
lecture3 29/31
Lasso regularization
• Lasso (Least Absolute Shrinkage and Selection Operator)
regularization:
Xp
λkβk1 = λ |βj |
j=1
• It is non-differentiable. It forces some βj ’s to be exactly zero
• It can be used for feature selection (model selection). It selects
important features (removing non-informative or redundant features)
• When λ is larger, less features will be selected
lecture3 30/31
Lasso regularized problems
• Logistic regression + lasso regularization (Gradient methods is no
longer applicable)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ |βj |
β0 ,β1 ,...,βp
i=1 j=1
• Linear regression + lasso regularization (Gradient methods is no
longer applicable)
n p
1X T 2
X
minimize (β xi + β0 − yi ) + λ |βj |
β0 ,β1 ,...,βp 2 i=1 j=1
• In the following, we always assume β0 = 0. Note that the intercept
should be zero β0 = 0 if the data is standardized.
lecture3 31/31
Lasso regularized problems
• Logistic regression + lasso regularization (Gradient methods is no
longer applicable)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ |βj |
β0 ,β1 ,...,βp
i=1 j=1
• Linear regression + lasso regularization (Gradient methods is no
longer applicable)
n p
1X T 2
X
minimize (β xi + β0 − yi ) + λ |βj |
β0 ,β1 ,...,βp 2 i=1 j=1
• In the following, we always assume β0 = 0. Note that the intercept
should be zero β0 = 0 if the data is standardized.
Given feature matrix X ∈ Rn×p and response vector Y ∈ Rp , the famous
1
lasso problem [1] minimize kXβ − Y k2 + λkβk1
β∈Rp 2
lecture3 31/31
References i
R. Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society: Series B (Methodological),
58(1):267–288, 1996.