Lecture3_Linear Regression and Logistic Regression
Lecture3_Linear Regression and Logistic Regression
Sunok Kim
Department of Software Engineering
sunok.kim@kau.ac.kr
Today’s Topic
• Linear Regression
• What is Regression?
• Regression Function
• Linear Regression
• Cost Function for Linear Regression
• Gradient Descent
• Logistic Regression
• What is Classification?
• Logistic Regression
• Cost Function for Logistic Regression
9
Supervised Learning vs Unsupervised Learning
10
Linear Regression
11
What is Regression?
• Regression vs. Classification (in CV)
• Examples:
12
What is Regression?
• Examples of Regression 500
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
• Training Set of Housing Prices Size in feet2 (x) Price ($) in 1000's (y)
2,104 460
1,416 232
1,534 315
852 178
… … 13
What is Regression?
• Regression is to relate input variables to the output variable,
to either predict outputs for new inputs, and/or to understand
the effect of the input on the output.
$! ∈ ℝ
“output”
!! ∈ ℝ"
“input”
14
What is Regression?
• Dataset for Regression
• In regression, dataset consists of pairs { "! , $! $
!"# },
where $! is the &-th output and "! is a vector of ' input.
• The number of pairs ( is the dataset size.
• ' is the dimensionality.
15
What is Regression?
• Two Goals of Regression
• In prediction, we wish to predict the output for a new input vector,
e.g., what is the weight of a person who is 170 cm tall?
• In interpretation, we wish to understand the effect of inputs on output,
e.g., are taller people heavier too?
• Examples:
• What is the life-expectancy of a person who has been smoking for 10
years?
• Does smoking cause cancer?
• A massive scale earthquake will occur in California within next 30 years.
16
The Regression Function (Hypothesis)
• We need to find a mapping function that approximates
the output “well enough” given inputs.
Training set
Learning Algorithm
17
Linear Regression
• Linear regression is a model that assumes a linear relationship
between input and output.
“Weight”
“Height”
• Why learn about linear regression?
• Plenty of reasons: simple, easy to understand, most widely used, easily
generalized to non-linear models.
Most importantly, you can learn almost all fundamental concepts of
machine learning with single regression alone.
18
Simple Linear Regression
• With one-dimensional input, we get a simple linear regression.
$! ≈ & !! = &# !! = +$ + +% !!
)% 1
“Height”
!! ∈ ℝ 19
Learning or Estimation or Fitting
• Given data, we would like to find the parameter w ∗ .
• This is called learning or estimating the parameters or fitting
the model.
• To do so, we need an optimization algorithm.
20
Cost Function
• Training Set of Housing Prices Size in feet2 (x) Price ($) in 1000's (y)
&0 = 2,104 -0 = 460
&$ = 1,416 -$ = 232
&: = 1,534 -: = 315
&; = 852 -; = 178
… …
21
Cost Function
• Consider the following model: $! ≈ &# !! = +$ + +% !!
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
)% = 1.5 )% = 0 )% = 1
)# = 0 )# = 0.5 )# = 0.5
22
Cost Function
• Idea: Choose w = (+$ , +% ) so that &# ! is close to $ for
training dataset { !! , $! (
!'% }.
(
1
$ min 6 (&# !! − $! ))
# 5
(!! , $! ) !'%
(
1
8 w = 6 9 (&# !! , $! )
5
! !'%
w ∗ = argmin 8 w
#
23
Cost Function
• A cost function (or energy, loss, training objective) is used to
learn the parameters (w) that explain the data well.
24
Cost Function
• Model: • Model:
&# ! = +% ! &# ! = +$ + +% !
• Parameters: • Parameters:
+% w = (+$ , +% )
• Cost function: • Cost function:
( (
1 1
8 w = 6 (&# !! − $! )) 8 w = 6 (&# !! − $! ))
5 5
!'% !'%
• Goal: • Goal:
w ∗ = argmin 8 w w ∗ = argmin 8 w
# #
25
Cost Function
&# ! 8 +%
(for fixed )# , this is a function of ") (function of the parameter )# )
$ 3
/ )# 3
2 2
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
" )#
26
Cost Function
• Model:
&# ! = +$ + +% !
• Parameters:
w = (+$ , +% )
• Cost function:
(
1
8 w = 6 (&# !! − $! ))
5
!'%
• Goal:
w ∗ = argmin 8 w
#
27
Cost Function
&# ! 8 w
(for fixed w, this is a function of ") (function of the parameter w)
500
400
Price ($)
E F., F0
300
in 1000’s
200 ." = 50
100
.# = 0.06
&! ' = 50 + 0.06'
0
0 1000 2000 3000
Size in feet2 (x) F. F0
500
400
Price ($) 300
in 1000’s <"
200
100
0
0 1000 2000 3000
Size in feet2 (x) <!
29
Cost Function
• Examples
• Mean Square Error (MSE): one of the most popular cost function.
$
1
MSE w = 4 (6/ "! − $! )0
(
!"#
30
Cost Function
• Desirable Properties of Cost Functions
• When the target $ is real-valued, it is often desirable that the
cost is symmetric around :, since both positive and negative
errors should be penalized equally.
• Also, the cost function should penalize “large” mistakes and
“very-large” mistakes similarly.
9(=! ) $
=! ! 31
Cost Function
• Desirable Properties of Cost Functions
• Statistical vs.
Computational Trade-off
• If we want better statistical properties,
then we have to give up good
computational properties.
• So which loss function is the best?
33
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
34
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
Negative gradient
1 ← Original w
direction: − 12 / ) →
35
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
36
Gradient Descent
Q: Where start?
Q: Which direction?
Q: How much move?
← Initial state
/ )% , )#
)#
)%
37
Gradient Descent
• Gradient Descent Algorithm
• Repeat until convergence
{
A
+$ ← +$ − @ 8 +$ , +%
A+$
A
+% ← +% − @ 8 +$ , +%
A+%
}
38
Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%
+% ← +% − @(BCDEFEGH number)
+%
+% ← +% − @(LHMNFEGH number)
+%
39
Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%
+% 40
Gradient Descent
• Limitation of Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%
+% ← +% − @(0)
+%
41
Multivariate Linear Regression
• Remind) Single Variable
• Housing Prices 500
400
(Portland, OR) Price 300
(in 1000’s 200
of dollars)
100
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
• Training Set of Housing Prices
Size in feet2 (x) Price ($) in 1000's (y)
2104 460
1416 232
1534 315
852 178
… …
42
Multivariate Linear Regression
• Multiple Variables
Number of Number of Age of home Price
Size (feet2)
bedrooms floors (years) ($1000)
2,104 5 1 45 460
1,416 3 2 40 232
1,534 3 2 30 315
852 2 1 36 178
… … … … …
Previously: &# !! = +$ + +% !!
44
Multivariate Linear Regression
• Mapping Function for Multiple Variables
&# x! = w + x!
45
Multivariate Linear Regression
• Mapping Function
&# x = w + x
46
Gradient Descent for Multiple Variables
• Single variable: • Multiple variables:
Repeat{ Repeat{
B B
)% ← )% − A / )% , )# )5 ← )5 − A / w
B)% B)5
B $
)# ← )# − A / )% , )# B 1 0
B)# ← )5 − A 4 6/ x! − $!
B)5 (
} $
!"#
2
← )5 − A 4 (6/ x! − $! )"!5
(
!"#
}
47
Logistic Regression
48
What is Classification?
• Similar to regression,
classification relates the input variable ! to the output variable $,
but $ can only take on discrete values.
We say that $ is a categorical variable.
• Examples:
• Email: Spam / Not Spam?
• Online Transactions: Fraudulent (Yes / No)?
• Tumor: Malignant / Benign?
$ ∈ {0,1} 0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)
49
What is Classification?
• Examples of Classification
(Yes) 1
Malignant?
(No) 0
Tumor Size
51
The Classification Function (Hypothesis)
• &# x : Estimated probability that $ = 1 on input x
• &# x = P($ = 1|x; w)
“Probability that $ = 1, given x, parameterized by w”
• P $ = 0 x; w + P $ = 1 x; w = 1
• P $ = 0 x; w = 1 − P $ = 1 x; w
• Predict $ = 1 if −3 + !% + !) ≥ 0
• Predict $ = 0 if −3 + !% + !) < 0
53
Non-linear Decision Boundary
&# x = U w + x = U(+$ + +% !% + +) !) "0
++* !%) + +- !)) ) 1
-1 1
"#
• Predict $ = 1 if −1 + !%) + !)) ≥0 -1
54
Logistic Regression
• Training set: { x% , $% , x) , $) , … , x( , $( }
• )-th example: x! = [!!$ , !!% , … , !!" ]+ ∈ ℝ",% , !!$ = 1,
$! ∈ {0,1}
• The mapping function:
1
&# x =
1 + exp(−w + x)
55
Cost Function
• Simple possible cost function: e.g., Mean Square Error (MSE)
(
1
8 w = 6 (&# x! − $! ))
5
!'%
“non-convex” “convex”
/ w / w
w w
56
Cost Function for Logistic Regression
− log &# x! if $! = 1
• 9 &# x! , $! = h
−log 1 − &# x! if $! = 0
If $ = 1 H 6/ x! , $! = 0 if $! = 1, 6/ x! = 1
But as 6/ x! → 0, H 6/ x! , $! → ∞
If 6/ x! = 0 (predict P $ = 1 x; w = 0),
but $! = 1,
we’ll penalize learning algorithm
6/ x by a very large cost.
57
Cost Function for Logistic Regression
− log &# x! if $! = 1
• 9 &# x! , $! = h
−log 1 − &# x! if $! = 0
If $ = 0
6/ x
58
Cost Function for Logistic Regression
(
1
8 w = 6 9 &# x! , $!
5
!'%
− log &# x! if $! = 1
9 &# x! , $! = h
−log 1 − &# x! if $! = 0
Note: $! = 0 or $! = 1 always
59
Cost Function for Logistic Regression
(
1
8 w = 6 9 &# x! , $!
5
!'%(
1
8 w = − 6 $! log &# x! + (1 − $! )log 1 − &# x!
5
!'%
To fit parameters w ∗ = argmin 8 w
#
To make a prediction given a new x:
1
&# x =
1 + exp(−w + x)
60
Gradient Descent for Logistic Regression
(
1
8 w = − 6 $! log &# x! + (1 − $! )log 1 − &# x!
5
!'%
Want w ∗ = argmin 8 w
#
Repeat{
A
+. ← +. − @ 8 w
A+.
(
1
← +. − @ 6 (&# x! − $! )!!.
5
!'%
}
61
Multi-Class Classification: One vs. All
• E-mail Foldering/Tagging: Work, Friends, Family, Hobby
62
Multi-Class Classification: One vs. All
"# "#
63
Multi-Class Classification: One vs. All
"0 &$
&0
&$
"# &0
Class 1: &$
Class 2:
Class 3: &0
(9)
6/ x = L($ = M|x; w) (M = 1,2,3)
64
Multi-Class Classification: One vs. All
(0)
• Training a logistic regression classifier &# x for each class m
to predict the probability that $ = m
65
Up Next:
66
Thank you!
67