0% found this document useful (0 votes)

26 views39 pages

Lecture3 Logistic Regression Regularization

Logistic regression is used for classification problems where the target variable is categorical. It models the probability of the target variable belonging to a certain class given the input variables. Maximum likelihood estimation is used to fit the logistic regression model by minimizing the cost function, which is the negative log-likelihood of the training data. Regularization techniques like ridge and lasso regression are introduced to prevent overfitting by penalizing model complexity.

Uploaded by

jiayuan0113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views39 pages

Lecture3 Logistic Regression Regularization

Uploaded by

jiayuan0113

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

Logistic regression and regularization

DSA5103 Lecture 3

Yangjing Zhang
29-Aug-2023
NUS
Today’s content

1. Logistic regression for classification

2. Ridge/lasso regularization

lecture3 1/31
Logistic regression
Classification

Binary classification:

• Email: spam/not spam

• Patient: cancer/healthy
• Student: fail/pass

We usually assign
(
0, normal state/negative class e.g., not spam
label
1, abnormal state/positive class e.g., spam

However, the label assignment can be arbitrary:

0 = not spam, 1 = spam or 0 = spam, 1 = not spam

Data xi ∈ Rp , yi ∈ {0, 1}, i = 1, 2, . . . , n.

lecture3 2/31
Classification

Multi-class classification:
• Iris flower (3 species: Setosa, Versicolor, Virginica)
• Optical character recognition
Data xi ∈ Rp , yi ∈ {1, . . . , K}, i = 1, 2, . . . , n.

Figure 1: Convert images of text to machine-readable format. Image from

internet

lecture3 3/31
Linear regression for classification?

Data xi , yi ∈ {0, 1}
We fit f (x) = β0 + β T x, and predict a new input x̃ belong to
class 1 if f (x̃) ≥ 0.5 class 0 if f (x̃) < 0.5

lecture3 4/31
Linear regression vs. logistic regression

Linear regression
• Data xi , yi ∈ R
• Fit f (x) = β T x + β0 = β̂ T x̂, β̂ = [β0 ; β], x̂ = [1; x]
Logistic regression
• Data xi , yi ∈ {0, 1}
1
• Fit f (x) = g(β̂ T x̂), g(z) = sigmoid/logistic function
1 + e−z
. 0 < g(z) < 1 an increasing
1
function
. g(0) = 0.5
0.5
. g(z) → 1 as z → +∞
. g(z) → 0 as z → −∞
0

-10 -5 0 5 10

lecture3 5/31
Graph illustration

β0 , β1 will change the shape and location of the function

0.5

-6 -4 -2 0 2 4 6 8 10 12

lecture3 6/31
Probabilistic interpretation

1.5

0.5

0
0 1 2 3 4 5 6 7 8

In logistic regression, we interpret

f (x) = probability(input x ∈ class 1)

= p(y = 1|x; β̂)

Then 1 − f (x) = probability(input x ∈ class 0) = p(y = 0|x; β̂)

lecture3 7/31
Logistic regression

1 1

f (x) = g(β̂ T x̂), g(z) =

1 + e−z
f (x) = probability(input x ∈ class 1)
0.5

= p(y = 1|x; β̂)

-10 -5 0 5 10

Predict y = 1 (x ∈ class 1) if

• f (x) ≥ 0.5, i.e., β̂ T x̂ ≥ 0

Predict y = 0 (x ∈ class 0) if

• f (x) < 0.5, i.e., β̂ T x̂ < 0

lecture3 8/31
Example

(1) (One feature)

Say β̂ = [β0 ; β1 ] = [−4; 2]. Then f (x) = g(β0 + β1 x).
Predict y = 1 if
β0 + β1 x = −4 + 2x ≥ 0, i.e., x ≥ 2

(2) (Two features)

Say β̂ = [β0 ; β1 ; β2 ] = [−4; 2; 1]. Then f (x) = g(β0 + β1 x1 + β2 x2 ).
Predict y = 1 if
β0 + β1 x1 + β2 x2 = −4 + 2x1 + x2 ≥ 0, i.e., 2x1 + x2 ≥ 4
lecture3 9/31
Example

(3) (Three features)

Say β̂ = [β0 ; β1 ; β2 ; β3 ] = [−4; 2; 1; 2]. Then
f (x) = g(β0 + β1 x1 + β2 x2 + β3 x3 ).
Predict y = 1 if
β0 +β1 x1 +β2 x2 +β3 x3 = −4+2x1 +x2 +2x3 ≥ 0, i.e., 2x1 +x2 +2x3 ≥ 4

lecture3 10/31
Decision boundary

The set of all x ∈ Rp such that

β0 + β T x = 0

is called the decision boundary between classes 0 and 1.

The logistic regression has a linear decision boundary; it is

• a point when p = 1
• a line when p = 2
• a plan when p = 3
• in general a (p − 1)-dimensional subpace

lecture3 11/31
Feature expansion

Say β̂ = [β0 ; β1 ; . . . ; β5 ] = [−4; 0; 0; 1; 1; 0]. The decision boundary is

−4 + x21 + x22 = 0

lecture3 12/31
Maximum likelihood estimation

• Data (xi , yi ), i = 1, 2, . . . , n, xi ∈ Rp , yi ∈ {0, 1}.

• The likelihood of a single training example (xi , yi ) is

probability(xi ∈ class yi )

 p(yi = 1|xi ; β̂) = f (xi ), if yi = 1
=
p(yi = 0|xi ; β̂) = 1 − f (xi ), if yi = 0


h iyi h i1−yi
= f (xi ) 1 − f (xi )

• Hope the likelihood is close to 1 for every training example (xi , yi )

lecture3 13/31
Maximum likelihood estimation

• Assume independence of the training samples, the likelihood is

n h
Y iyi h i1−yi
f (xi ) 1 − f (xi )
i=1

• Want to find β̂ to maximize the log-likelihood. Note that

maximizing a (positive) function is the same as maximizing the log
of a function (because log is monotone increasing)

lecture3 14/31
Maximum likelihood estimation

If f (x) > 0 for all x,

max f (x) ⇐⇒ max log(f (x))

x x

since log is monotone increasing, i.e., log(x) ≥ log(y) if x ≥ y > 0

x∗ is a global maximizer of f (·)

⇐⇒ f (x∗ ) ≥ f (x) ∀ x
⇐⇒ log(f (x∗ )) ≥ log(f (x)) ∀ x
⇐⇒ x∗ is a global maximizer of log(f (·))

Similar arguments work for local maximizer.

Note that the likelihood is always positive and we can take log.

lecture3 15/31
Cost function

Cost function1
n h
!
Y iyi h i1−yi
L(β̂) = − log f (xi ) 1 − f (xi )
i=1
n
X
=− yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1

For a particular training example (xi , yi ), the cost is

− yi log(f (xi )) − (1 − yi ) log(1 − f (xi ))

(
− log(f (xi )), if yi = 1
=
− log(1 − f (xi )), if yi = 0

1 We use natural logarithms in logistic regression

lecture3 16/31
Understand the cost function

cost = − log(f (xi )), when yi = 1

• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)

• f (xi ) = 0.9, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.1, cost = 2.3 (“bad” scenario ⇒ large cost)

lecture3 17/31
Understand the cost function

cost = − log(f (xi )), when yi = 1

• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)

• f (xi ) = 0.9, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.1, cost = 2.3 (“bad” scenario ⇒ large cost)

4.5

3.5

2.5

1.5

0.5

0
0 0.2 0.4 0.6 0.8 1

lecture3 17/31
Understand the cost function

cost = − log(1 − f (xi )), when yi = 0

• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)

• f (xi ) = 0.1, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.9, cost = 2.3 (“bad” scenario ⇒ large cost)

lecture3 18/31
Understand the cost function

cost = − log(1 − f (xi )), when yi = 0

• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)

• f (xi ) = 0.1, cost = 0.11 (“good” scenario ⇒ small cost)
• f (xi ) = 0.9, cost = 2.3 (“bad” scenario ⇒ large cost)

0
0 0.2 0.4 0.6 0.8 1

lecture3 18/31
Simply the cost function
n
∗ T
X
L(β̂) = L(β0 , β) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
i=1

1
Derivation*. Recall that f (x) = 1+e−β̂ T x̂
and

n
X
L(β̂) = − yi log(f (xi )) + (1 − yi ) log(1 − f (xi ))
i=1
n
X f (xi )
=− yi log + log(1 − f (xi ))
i=1
1 − f (xi )

It remains to prove two equalities:

f (xi ) T
1. log 1−f (xi ) = β̂ x̂i
T
2. log(1 − f (xi )) = − log(1 + eβ̂ x̂i
)

lecture3 19/31
Simply the cost function

1 !
f (xi ) T
1+e−β̂ x̂i 1
1. log = log 1 = log
1 − f (xi ) 1 − T 1 + e−β̂ T x̂i − 1
1+e−β̂ x̂i
T
= log(eβ̂ x̂i
) = β̂ T x̂i

T
!
1 + e−β̂ x̂i

1 −1
2. log(1 − f (xi )) = log 1 − = log
1 + e−β̂ T x̂i 1 + e−β̂ T x̂i

1 T

= log = − log 1 + eβ̂ x̂i
1 + eβ̂ T x̂i

lecture3 20/31
Gradient of the cost function

• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k

β1 xi1 + β2 xi2 + · · · + βp xip

lecture3 21/31
Gradient of the cost function

• Cost function
n
X T
L(β0 , β1 , . . . , βp ) = log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
| {z }
i=1
k

β1 xi1 + β2 xi2 + · · · + βp xip

• Calculate
n n
∂ X 1 X
L= − yi = (f (xi ) − yi )
∂β0 i=1
1 + e−(β0 +β T xi ) i=1
n n
∂ X 1 X
L= − yi xi1 = (f (xi ) − yi ) xi1
∂β1 i=1
1 + e−(β0 +β T xi ) i=1
n n
∂ X 1 X
L= −(β0 +β T xi )
− yi xi2 = (f (xi ) − yi ) xi2
∂β2 1 + e
.. i=1 i=1
. n n
∂ X 1 X
L= − y i x ip = (f (xi ) − yi ) xip
∂βp i=1
1 + e−(β0 +β T xi ) i=1
lecture3 21/31
Linear regression vs. logistic regression

Linear regression Logistic regression L =

n n
1X T X T
L= (β xi + β0 − yi )2 log(1 + eβ0 +β xi ) − yi (β0 + β T xi )
2 i=1 i=1

Gradient Gradient
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi ) L= ( − yi )
∂β0 i=1
∂β0 i=1
1 + e 0 +β T xi )
−(β

n
X n
X
= (f (xi ) − yi ) = (f (xi ) − yi )
i=1 i=1
n n
∂ X ∂ X 1
L= (β T xi + β0 − yi )xij L= ( − yi )xij
∂βj i=1
∂βj i=1
1 + e−(β0 +β T xi )
n
X n
X
= (f (xi ) − yi )xij = (f (xi ) − yi )xij
i=1 i=1

for j = 1, 2, . . . , p for j = 1, 2, . . . , p
lecture3 22/31
Solution may not exist

The solution (global minimizer) of the minimization problem

n
X T
minimize log(1 + eβ0 +β xi
) − yi (β0 + β T xi )
β0 ,β1 ,...,βp
i=1

may not exist. (Regularization will help solve this issue)

Example. n = 1, x1 = −1, y1 = 0. Then the cost function
L(β0 , β1 ) = log(1 + eβ0 −β1 )
We can see that min L = 0. However this value cannot be attained.
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6
lecture3 23/31
Multi-class classification: one-vs-rest

Idea: transfer multi-class classification to multiple binary classification

problems
Data: xi ∈ Rp , yi ∈ {1, 2, . . . , K}, i = 1, 2, . . . , n.
For each k ∈ {1, 2, . . . , K}

1. Construct a new label ỹi = 1 if yi = k and ỹi = 0 otherwise

2. Learn a binary classifier fk with data xi , ỹi

Multi-class classifier predicts class k where k achieves the maximal value

max fk (x)
k∈{1,2,...,K}

lecture3 24/31
Multi-class classification: one-vs-rest

feature 1 feature 2 label y

1 5 1
0.8 4.5 1
1.5 1.5 2
4 4 3

Say for a new input x̃, we have f1 (x̃) = 0.8, f2 (x̃) = 0.1, f3 (x̃) = 0.6.
Then we say
x̃ belongs to class 1 with probability 80%
x̃ belongs to class 2 with probability 10%
x̃ belongs to class 3 with probability 60%
and we predict it belongs to class 1. lecture3 25/31
Ridge/lasso regularization
Over-fitting

under-fitted good fit/just right over-fitted

Figure 2: Image from internet

• Under-fitting: a model is too simple and does not adequately

capture the underlying structure of the data
• Over-fitting: a model is too complicated and contains more
parameters that can be justified by the data; it does not generalize
well from training data to test data
• Good fit: a model adequately learns the training data and
generalizes well to test data
lecture3 26/31
Ridge regularization

In linear/logistic regression, over-fitting occurs frequently. Regularization

will make the model simpler and works well for most of the
regression/classification problems.

• Ridge regularization:
p
X
λkβk2 = λ βj2
j=1

λ: regularization parameter, kβk2 : regularizer

• It is differentiable. It forces βj ’s to be small
• Extreme case: suppose λ is a huge number, it will push all βj ’s to
be zero and the model will be naive

lecture3 27/31
Ridge regularized problems

• Logistic regression + ridge regularization (Gradient methods can be

used, a solution exists)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ βj2
β0 ,β1 ,...,βp
i=1 j=1

• Linear regression + ridge regularization (Apply either normal

equation or gradient methods)
n p
1X T X
minimize (β xi + β0 − yi )2 + λ βj2
β0 ,β1 ,...,βp 2 i=1 j=1

lecture3 28/31
Normal equation for ridge regularized linear regression
   
xT1 y1
xT2  y2 
   
 
X= ..  Y =
 .. 

. .
 
 
xTn yn
For simplicity, we assume2 β0 = 0
n p
1X T X 1
minimize (β xi − yi )2 + λ βj2 = kXβ − Y k2 + λkβk2
p
β∈R 2 i=1 j=1
2

Compute the gradient

Gradient = X T (Xβ − Y ) + 2λβ
Normal equation (setting gradient to be zero)
(2λI + X T X)β = X T Y
⇒ β = (2λI + X T X)−1 X T Y
2 Note that the intercept should be zero β0 = 0 if the data is standardized
lecture3 29/31
Lasso regularization

• Lasso (Least Absolute Shrinkage and Selection Operator)

regularization:
Xp
λkβk1 = λ |βj |
j=1
• It is non-differentiable. It forces some βj ’s to be exactly zero
• It can be used for feature selection (model selection). It selects
important features (removing non-informative or redundant features)
• When λ is larger, less features will be selected

lecture3 30/31
Lasso regularized problems

• Logistic regression + lasso regularization (Gradient methods is no

longer applicable)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ |βj |
β0 ,β1 ,...,βp
i=1 j=1

• Linear regression + lasso regularization (Gradient methods is no

longer applicable)
n p
1X T 2
X
minimize (β xi + β0 − yi ) + λ |βj |
β0 ,β1 ,...,βp 2 i=1 j=1

• In the following, we always assume β0 = 0. Note that the intercept

should be zero β0 = 0 if the data is standardized.

lecture3 31/31
Lasso regularized problems

• Logistic regression + lasso regularization (Gradient methods is no

longer applicable)
n p
β0 +β T xi
X X
T
minimize log(1 + e ) − yi (β0 + β xi ) + λ |βj |
β0 ,β1 ,...,βp
i=1 j=1

• Linear regression + lasso regularization (Gradient methods is no

longer applicable)
n p
1X T 2
X
minimize (β xi + β0 − yi ) + λ |βj |
β0 ,β1 ,...,βp 2 i=1 j=1

• In the following, we always assume β0 = 0. Note that the intercept

should be zero β0 = 0 if the data is standardized.
Given feature matrix X ∈ Rn×p and response vector Y ∈ Rp , the famous
1
lasso problem [1] minimize kXβ − Y k2 + λkβk1
β∈Rp 2

lecture3 31/31
References i

R. Tibshirani.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society: Series B (Methodological),
58(1):267–288, 1996.

Time Series Forecasting - Project Final
100% (3)
Time Series Forecasting - Project Final
50 pages
Lecture 03 Logistic Regression
No ratings yet
Lecture 03 Logistic Regression
34 pages
Lecture 8 Logistic Regression
No ratings yet
Lecture 8 Logistic Regression
34 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
Lecture 05
No ratings yet
Lecture 05
5 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
09 23ECE216 LogisticRegression
No ratings yet
09 23ECE216 LogisticRegression
40 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
2+logistic Regression
No ratings yet
2+logistic Regression
10 pages
4.logistic Regression
No ratings yet
4.logistic Regression
16 pages
3-LG Eval
No ratings yet
3-LG Eval
52 pages
Lecture 3.3 - Logistic Regression
No ratings yet
Lecture 3.3 - Logistic Regression
5 pages
Lecture 1, Part 3: Training A Classifier: Roger Grosse
No ratings yet
Lecture 1, Part 3: Training A Classifier: Roger Grosse
11 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Log Reg Skimed - Ipynb - Colab
No ratings yet
Log Reg Skimed - Ipynb - Colab
10 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
cs188 Fa23 Note22
No ratings yet
cs188 Fa23 Note22
3 pages
Lec 20
No ratings yet
Lec 20
16 pages
03. Presentation
No ratings yet
03. Presentation
59 pages
cs188 Fa22 Note21
No ratings yet
cs188 Fa22 Note21
4 pages
Logistic Regression Loss
No ratings yet
Logistic Regression Loss
7 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Cost Function For Logistic Regression
No ratings yet
Cost Function For Logistic Regression
42 pages
ML DSBA Lab2
No ratings yet
ML DSBA Lab2
4 pages
Lecture W3
No ratings yet
Lecture W3
28 pages
Business Analytics & Machine Learning: Logistic and Poisson Regressions
No ratings yet
Business Analytics & Machine Learning: Logistic and Poisson Regressions
62 pages
Lecture 2 2022
No ratings yet
Lecture 2 2022
34 pages
Logistic Regression
No ratings yet
Logistic Regression
34 pages
Logistic Regression
No ratings yet
Logistic Regression
25 pages
Output 23
No ratings yet
Output 23
6 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Lec12 Logreg
No ratings yet
Lec12 Logreg
41 pages
Reference Material - Logistic - Regression
No ratings yet
Reference Material - Logistic - Regression
11 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
100% (1)
Logistic Regression: Gunjan Bharadwaj Assistant Professor Dept of CEA
42 pages
06 Logistic Regression
No ratings yet
06 Logistic Regression
55 pages
Reference Material - Logistic - Regression
No ratings yet
Reference Material - Logistic - Regression
11 pages
Chapter02 Introduction To DeepLearning
No ratings yet
Chapter02 Introduction To DeepLearning
84 pages
Exp 2
No ratings yet
Exp 2
7 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
CSCI-43646364 S25 - Lecture 4
No ratings yet
CSCI-43646364 S25 - Lecture 4
92 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Lec 3
No ratings yet
Lec 3
22 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Reference Material Logistic Regression
No ratings yet
Reference Material Logistic Regression
11 pages
Week 2
No ratings yet
Week 2
43 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Algorithms Notes
No ratings yet
Algorithms Notes
66 pages
Lecture Notes Chapt13
No ratings yet
Lecture Notes Chapt13
15 pages
Classification-Introduction, Logistic Regression
No ratings yet
Classification-Introduction, Logistic Regression
26 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
CH-2 Simultaneous Equation Models Short Handout
No ratings yet
CH-2 Simultaneous Equation Models Short Handout
18 pages
Gazette.: Published Online by Cambridge University Press
No ratings yet
Gazette.: Published Online by Cambridge University Press
2 pages
Hypothesis Testing in The Multiple Regression
No ratings yet
Hypothesis Testing in The Multiple Regression
23 pages
MA8452 PIT - by EasyEngineering - Net 2
No ratings yet
MA8452 PIT - by EasyEngineering - Net 2
74 pages
Testing For Equal Distributions Using The Likelihood Ratio Test
No ratings yet
Testing For Equal Distributions Using The Likelihood Ratio Test
17 pages
Introduction To Econometrics, Global Edition James H. Stock Instant Download
No ratings yet
Introduction To Econometrics, Global Edition James H. Stock Instant Download
51 pages
One Sample T Test - SPSS Tutorials - LibGuides at Kent State University
No ratings yet
One Sample T Test - SPSS Tutorials - LibGuides at Kent State University
10 pages
Applied Quantitative Analysis For Real Estate 1st Edition Sotiris Tsolacos Instant Download
No ratings yet
Applied Quantitative Analysis For Real Estate 1st Edition Sotiris Tsolacos Instant Download
71 pages
Green Belt Body of Knowledge
No ratings yet
Green Belt Body of Knowledge
4 pages
Mann Kendall
No ratings yet
Mann Kendall
3 pages
Ema Theory Trading
No ratings yet
Ema Theory Trading
16 pages
Chi Square Test of Independence
No ratings yet
Chi Square Test of Independence
2 pages
Lecture Notes Confidence Intervals
No ratings yet
Lecture Notes Confidence Intervals
7 pages
11 Annotated Ch6 Part 2 Hypothesis Testing F14
No ratings yet
11 Annotated Ch6 Part 2 Hypothesis Testing F14
6 pages
Quants
100% (1)
Quants
18 pages
Sport Rxiv Preprint Deload
No ratings yet
Sport Rxiv Preprint Deload
28 pages
MATH03-CO5-Lesson1-Estimation of Parameters (Interval Estimation)
No ratings yet
MATH03-CO5-Lesson1-Estimation of Parameters (Interval Estimation)
20 pages
Standard Error
No ratings yet
Standard Error
14 pages
Correlation
No ratings yet
Correlation
57 pages
Robustness Analysis: P. Vanicek E. J. Krakiwsky M. R. Craymer
No ratings yet
Robustness Analysis: P. Vanicek E. J. Krakiwsky M. R. Craymer
37 pages
Spearman's Rank Correlation Coefficient
No ratings yet
Spearman's Rank Correlation Coefficient
11 pages
ME Computer Engineering Syllabus
No ratings yet
ME Computer Engineering Syllabus
37 pages
Lesson 1 - Obtaining Data
No ratings yet
Lesson 1 - Obtaining Data
40 pages
Lesson Nov 18 2023
No ratings yet
Lesson Nov 18 2023
22 pages
Session On Multicollinearity
No ratings yet
Session On Multicollinearity
11 pages
Chapter 7
No ratings yet
Chapter 7
53 pages
Dcu1008 Research Methodology Notes
No ratings yet
Dcu1008 Research Methodology Notes
12 pages
Interpreting DNA Evidence Statistical Genetics For
No ratings yet
Interpreting DNA Evidence Statistical Genetics For
3 pages

Lecture3 Logistic Regression Regularization

Uploaded by

Lecture3 Logistic Regression Regularization

Uploaded by

Logistic regression and regularization

1. Logistic regression for classification

• Email: spam/not spam

However, the label assignment can be arbitrary:

Data xi ∈ Rp , yi ∈ {0, 1}, i = 1, 2, . . . , n.

Figure 1: Convert images of text to machine-readable format. Image from

β0 , β1 will change the shape and location of the function

In logistic regression, we interpret

f (x) = probability(input x ∈ class 1)

Then 1 − f (x) = probability(input x ∈ class 0) = p(y = 0|x; β̂)

f (x) = g(β̂ T x̂), g(z) =

= p(y = 1|x; β̂)

• f (x) ≥ 0.5, i.e., β̂ T x̂ ≥ 0

• f (x) < 0.5, i.e., β̂ T x̂ < 0

(1) (One feature)

(2) (Two features)

(3) (Three features)

The set of all x ∈ Rp such that

is called the decision boundary between classes 0 and 1.

Say β̂ = [β0 ; β1 ; . . . ; β5 ] = [−4; 0; 0; 1; 1; 0]. The decision boundary is

• Data (xi , yi ), i = 1, 2, . . . , n, xi ∈ Rp , yi ∈ {0, 1}.

• Hope the likelihood is close to 1 for every training example (xi , yi )

• Assume independence of the training samples, the likelihood is

• Want to find β̂ to maximize the log-likelihood. Note that

If f (x) > 0 for all x,

max f (x) ⇐⇒ max log(f (x))

since log is monotone increasing, i.e., log(x) ≥ log(y) if x ≥ y > 0

x∗ is a global maximizer of f (·)

Similar arguments work for local maximizer.

For a particular training example (xi , yi ), the cost is

− yi log(f (xi )) − (1 − yi ) log(1 − f (xi ))

1 We use natural logarithms in logistic regression

cost = − log(f (xi )), when yi = 1

• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)

cost = − log(f (xi )), when yi = 1

• f (xi ) = 1, cost = 0 (“perfect” scenario ⇒ zero cost)

cost = − log(1 − f (xi )), when yi = 0

• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)

cost = − log(1 − f (xi )), when yi = 0

• f (xi ) = 0, cost = 0 (“perfect” scenario ⇒ zero cost)

It remains to prove two equalities:

β1 xi1 + β2 xi2 + · · · + βp xip

β1 xi1 + β2 xi2 + · · · + βp xip

Linear regression Logistic regression L =

The solution (global minimizer) of the minimization problem

may not exist. (Regularization will help solve this issue)

Idea: transfer multi-class classification to multiple binary classification

1. Construct a new label ỹi = 1 if yi = k and ỹi = 0 otherwise

Multi-class classifier predicts class k where k achieves the maximal value

feature 1 feature 2 label y

under-fitted good fit/just right over-fitted

Figure 2: Image from internet

• Under-fitting: a model is too simple and does not adequately

In linear/logistic regression, over-fitting occurs frequently. Regularization

λ: regularization parameter, kβk2 : regularizer

• Logistic regression + ridge regularization (Gradient methods can be

• Linear regression + ridge regularization (Apply either normal

Compute the gradient

• Lasso (Least Absolute Shrinkage and Selection Operator)

• Logistic regression + lasso regularization (Gradient methods is no

• Linear regression + lasso regularization (Gradient methods is no

• In the following, we always assume β0 = 0. Note that the intercept

• Logistic regression + lasso regularization (Gradient methods is no

• Linear regression + lasso regularization (Gradient methods is no

• In the following, we always assume β0 = 0. Note that the intercept

You might also like