SW4428: 컴퓨터비전
Lecture 3: Linear Regression and Logistic Regression
Sunok Kim
Department of Software Engineering
sunok.kim@kau.ac.kr
Today’s Topic
• Linear Regression
• What is Regression?
• Regression Function
• Linear Regression
• Cost Function for Linear Regression
• Gradient Descent
• Logistic Regression
• What is Classification?
• Logistic Regression
• Cost Function for Logistic Regression
9
Supervised Learning vs Unsupervised Learning
Supervised Learning Unsupervised Learning
10
Linear Regression
11
What is Regression?
• Regression vs. Classification (in CV)
• Examples:
12
What is Regression?
• Examples of Regression 500
• Housing Prices 400
Price
(Portland, OR) 300
(in 1000’s
200
of dollars)
100
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
• Training Set of Housing Prices Size in feet2 (x) Price ($) in 1000's (y)
2,104 460
1,416 232
1,534 315
852 178
… … 13
What is Regression?
• Regression is to relate input variables to the output variable,
to either predict outputs for new inputs, and/or to understand
the effect of the input on the output.
$! ∈ ℝ
“output”
!! ∈ ℝ"
“input”
14
What is Regression?
• Dataset for Regression
• In regression, dataset consists of pairs { "! , $! $
!"# },
where $! is the &-th output and "! is a vector of ' input.
• The number of pairs ( is the dataset size.
• ' is the dimensionality.
* Machine Learning Term for Regression
• Input variables are also known as features, covariates, independent
variables, predictors, etc.
• Output variables are also known as target, label, response, outcome,
dependent variable, measured variable, etc.
15
What is Regression?
• Two Goals of Regression
• In prediction, we wish to predict the output for a new input vector,
e.g., what is the weight of a person who is 170 cm tall?
• In interpretation, we wish to understand the effect of inputs on output,
e.g., are taller people heavier too?
• Examples:
• What is the life-expectancy of a person who has been smoking for 10
years?
• Does smoking cause cancer?
• A massive scale earthquake will occur in California within next 30 years.
16
The Regression Function (Hypothesis)
• We need to find a mapping function that approximates
the output “well enough” given inputs.
$! ≈ &(!! ), for all )
Training set
Learning Algorithm
!! Mapping Function $! ≈ &(!! )
17
Linear Regression
• Linear regression is a model that assumes a linear relationship
between input and output.
“Weight”
“Height”
• Why learn about linear regression?
• Plenty of reasons: simple, easy to understand, most widely used, easily
generalized to non-linear models.
Most importantly, you can learn almost all fundamental concepts of
machine learning with single regression alone.
18
Simple Linear Regression
• With one-dimensional input, we get a simple linear regression.
$! ≈ & !! = &# !! = +$ + +% !!
• Here, w = (+$ , +% ) are the two parameters of the model.
They describe &# ! .
$! ∈ ℝ &# !
)#
“Weight”
)% 1
“Height”
!! ∈ ℝ 19
Learning or Estimation or Fitting
• Given data, we would like to find the parameter w ∗ .
• This is called learning or estimating the parameters or fitting
the model.
• To do so, we need an optimization algorithm.
20
Cost Function
• Training Set of Housing Prices Size in feet2 (x) Price ($) in 1000's (y)
&0 = 2,104 -0 = 460
&$ = 1,416 -$ = 232
&: = 1,534 -: = 315
&; = 852 -; = 178
… …
• Consider the following model: $! ≈ &# !! = +$ + +% !!
• How can we estimate (or guess) values of w = (+$ , +% ) given
the data?
21
Cost Function
• Consider the following model: $! ≈ &# !! = +$ + +% !!
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
)% = 1.5 )% = 0 )% = 1
)# = 0 )# = 0.5 )# = 0.5
22
Cost Function
• Idea: Choose w = (+$ , +% ) so that &# ! is close to $ for
training dataset { !! , $! (
!'% }.
(
1
$ min 6 (&# !! − $! ))
# 5
(!! , $! ) !'%
(
1
8 w = 6 9 (&# !! , $! )
5
! !'%
w ∗ = argmin 8 w
#
23
Cost Function
• A cost function (or energy, loss, training objective) is used to
learn the parameters (w) that explain the data well.
• The cost function quantifies
how well our model explains the data or
how costly our mistakes are.
24
Cost Function
• Model: • Model:
&# ! = +% ! &# ! = +$ + +% !
• Parameters: • Parameters:
+% w = (+$ , +% )
• Cost function: • Cost function:
( (
1 1
8 w = 6 (&# !! − $! )) 8 w = 6 (&# !! − $! ))
5 5
!'% !'%
• Goal: • Goal:
w ∗ = argmin 8 w w ∗ = argmin 8 w
# #
25
Cost Function
&# ! 8 +%
(for fixed )# , this is a function of ") (function of the parameter )# )
$ 3
/ )# 3
2 2
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
" )#
26
Cost Function
• Model:
&# ! = +$ + +% !
• Parameters:
w = (+$ , +% )
• Cost function:
(
1
8 w = 6 (&# !! − $! ))
5
!'%
• Goal:
w ∗ = argmin 8 w
#
27
Cost Function
&# ! 8 w
(for fixed w, this is a function of ") (function of the parameter w)
500
400
Price ($)
E F., F0
300
in 1000’s
200 ." = 50
100
.# = 0.06
&! ' = 50 + 0.06'
0
0 1000 2000 3000
Size in feet2 (x) F. F0
“Contour Plots” or “Contour Figures”
28
Cost Function
&# ! 8 w
(for fixed w, this is a function of ") (function of the parameter w)
500
400
Price ($) 300
in 1000’s <"
200
100
0
0 1000 2000 3000
Size in feet2 (x) <!
29
Cost Function
• Examples
• Mean Square Error (MSE): one of the most popular cost function.
$
1
MSE w = 4 (6/ "! − $! )0
(
!"#
• Mean Absolut Error (MAE)
$
1
MAE w = 4 6/ "! − $!
(
!"#
30
Cost Function
• Desirable Properties of Cost Functions
• When the target $ is real-valued, it is often desirable that the
cost is symmetric around :, since both positive and negative
errors should be penalized equally.
• Also, the cost function should penalize “large” mistakes and
“very-large” mistakes similarly.
9(=! ) $
=! ! 31
Cost Function
• Desirable Properties of Cost Functions
• Statistical vs.
Computational Trade-off
• If we want better statistical properties,
then we have to give up good
computational properties.
• So which loss function is the best?
[Figure taken from Patrick Breheny’s slides] 32
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
• Grid search is one of the simplest optimization algorithms.
We compute the cost over all values w in a grid,
and pick the best.
• This is brute-force, but extremely simple.
• But, it has too many “for loops”, resulting in an exponential
computational complexity.
33
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
34
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
• Follow the Gradient: A gradient (at a point) is the slope of the
tangent to the function (at the point). It points to the direction
of largest increase of the function.
• The direction of gradient descent is the negative gradient.
Negative gradient
1 ← Original w
direction: − 12 / ) →
35
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#
• Follow the Gradient: A gradient (at a point) is the slope of the
tangent to the function (at the point). It points to the direction
of largest increase of the function.
• Outline:
• Start with some w = (+$ , +% ) (e.g., +$ = 0, +% = 1).
• Keep changing w = (+$ , +% ) to reduce 8 w
until we hopefully end up at a minimum.
36
Gradient Descent
Q: Where start?
Q: Which direction?
Q: How much move?
← Initial state
/ )% , )#
)#
)%
37
Gradient Descent
• Gradient Descent Algorithm
• Repeat until convergence
{
A
+$ ← +$ − @ 8 +$ , +%
A+$
A
+% ← +% − @ 8 +$ , +%
A+%
}
38
Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%
+% ← +% − @(BCDEFEGH number)
+%
+% ← +% − @(LHMNFEGH number)
+%
39
Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%
• If @ is too small, gradient descent can
be slow.
+%
• If @ is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.
+% 40
Gradient Descent
• Limitation of Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%
+% ← +% − @(0)
+%
41
Multivariate Linear Regression
• Remind) Single Variable
• Housing Prices 500
400
(Portland, OR) Price 300
(in 1000’s 200
of dollars)
100
0
0 500 1000 1500 2000 2500 3000
Size (feet2)
• Training Set of Housing Prices
Size in feet2 (x) Price ($) in 1000's (y)
2104 460
1416 232
1534 315
852 178
… …
42
Multivariate Linear Regression
• Multiple Variables
Number of Number of Age of home Price
Size (feet2)
bedrooms floors (years) ($1000)
2,104 5 1 45 460
1,416 3 2 40 232
1,534 3 2 30 315
852 2 1 36 178
… … … … …
"! = ["!# , "!0 , … , "!6 ]7 $!
• Input (features) of &34 training example: "!
• Value of features = in &34 training example: "!5
43
Multivariate Linear Regression
• Mapping Function for Multiple Variables
Previously: &# !! = +$ + +% !!
44
Multivariate Linear Regression
• Mapping Function for Multiple Variables
&# x! = +$ + +% !!% + +) !!) + +* !!* + ⋯ + +" !!"
• For convenience of notation, define !!$ = 1
x! = [!!$ , !!% , … , !!" ]+ ∈ ℝ",% , w = [+$ , +% , … , +" ]+ ∈ ℝ",%
&# x! = w + x!
• It is called “Multivariate Linear Regression”
45
Multivariate Linear Regression
• Mapping Function
&# x = w + x
• Parameters w = [+$ , +% , … , +" ]+ ∈ ℝ",%
• Cost function:
(
1
8 w = 6 (&# x! − $! ))
5
!'%
46
Gradient Descent for Multiple Variables
• Single variable: • Multiple variables:
Repeat{ Repeat{
B B
)% ← )% − A / )% , )# )5 ← )5 − A / w
B)% B)5
B $
)# ← )# − A / )% , )# B 1 0
B)# ← )5 − A 4 6/ x! − $!
B)5 (
} $
!"#
2
← )5 − A 4 (6/ x! − $! )"!5
(
!"#
}
47
Logistic Regression
48
What is Classification?
• Similar to regression,
classification relates the input variable ! to the output variable $,
but $ can only take on discrete values.
We say that $ is a categorical variable.
• Examples:
• Email: Spam / Not Spam?
• Online Transactions: Fraudulent (Yes / No)?
• Tumor: Malignant / Benign?
$ ∈ {0,1} 0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)
49
What is Classification?
• Examples of Classification
(Yes) 1
Malignant?
(No) 0
Tumor Size
• Solution? Normalize the output of mapping function
&# x = w + x between 0 and 1, and threshold at 0.5:
• If 6/ x ≥ 0.5, predict $ = 1
• If 6/ x < 0.5, predict $ = 0
50
The Classification Function (Hypothesis)
• Want 0 ≤ &# x ≤ 1
&# x = U(w + x) where U V = 1/(1 + exp(−V))
1 1
→ &# x =
1 + exp(−w + x)
0.5
→ Sigmoid Function / Logistic Function 0
• Predict $ = 1 if &# x ≥ 0.5, w + x ≥ 0
• Predict $ = 0 if &# x < 0.5, w + x < 0
51
The Classification Function (Hypothesis)
• &# x : Estimated probability that $ = 1 on input x
• &# x = P($ = 1|x; w)
“Probability that $ = 1, given x, parameterized by w”
• P $ = 0 x; w + P $ = 1 x; w = 1
• P $ = 0 x; w = 1 − P $ = 1 x; w
• Example: If x = [!$ , !% ]+ = [1, TumorSize]+ , &# x = 0.7
• Tell patient that 70% chance of tumor being malignant
• Tell patient that 30% chance of tumor being non-malignant
52
Decision Boundary
&# x = U w + x = U +$ + +% !% + +) !)
1 "0
3
&# x =
1 + exp(−w + x) 2
1 1
&# x =
1 + exp(−+$ − +% !% − +) !) ) 1 2 3 "#
• Predict $ = 1 if −3 + !% + !) ≥ 0
• Predict $ = 0 if −3 + !% + !) < 0
53
Non-linear Decision Boundary
&# x = U w + x = U(+$ + +% !% + +) !) "0
++* !%) + +- !)) ) 1
-1 1
"#
• Predict $ = 1 if −1 + !%) + !)) ≥0 -1
• Predict $ = 0 if −1 + !%) + !)) <0
54
Logistic Regression
• Training set: { x% , $% , x) , $) , … , x( , $( }
• )-th example: x! = [!!$ , !!% , … , !!" ]+ ∈ ℝ",% , !!$ = 1,
$! ∈ {0,1}
• The mapping function:
1
&# x =
1 + exp(−w + x)
• How to choose the parameter w?
55
Cost Function
• Simple possible cost function: e.g., Mean Square Error (MSE)
(
1
8 w = 6 (&# x! − $! ))
5
!'%
“non-convex” “convex”
/ w / w
w w
56
Cost Function for Logistic Regression
− log &# x! if $! = 1
• 9 &# x! , $! = h
−log 1 − &# x! if $! = 0
If $ = 1 H 6/ x! , $! = 0 if $! = 1, 6/ x! = 1
But as 6/ x! → 0, H 6/ x! , $! → ∞
If 6/ x! = 0 (predict P $ = 1 x; w = 0),
but $! = 1,
we’ll penalize learning algorithm
6/ x by a very large cost.
57
Cost Function for Logistic Regression
− log &# x! if $! = 1
• 9 &# x! , $! = h
−log 1 − &# x! if $! = 0
If $ = 0
6/ x
58
Cost Function for Logistic Regression
(
1
8 w = 6 9 &# x! , $!
5
!'%
− log &# x! if $! = 1
9 &# x! , $! = h
−log 1 − &# x! if $! = 0
Note: $! = 0 or $! = 1 always
9 &# x! , $! = −$! log &# x! − (1 − $! )log 1 − &# x!
59
Cost Function for Logistic Regression
(
1
8 w = 6 9 &# x! , $!
5
!'%(
1
8 w = − 6 $! log &# x! + (1 − $! )log 1 − &# x!
5
!'%
To fit parameters w ∗ = argmin 8 w
#
To make a prediction given a new x:
1
&# x =
1 + exp(−w + x)
60
Gradient Descent for Logistic Regression
(
1
8 w = − 6 $! log &# x! + (1 − $! )log 1 − &# x!
5
!'%
Want w ∗ = argmin 8 w
#
Repeat{
A
+. ← +. − @ 8 w
A+.
(
1
← +. − @ 6 (&# x! − $! )!!.
5
!'%
}
61
Multi-Class Classification: One vs. All
• E-mail Foldering/Tagging: Work, Friends, Family, Hobby
• Medical Diagram: Not ill, Cold, Flu
• Weather: Sunny, Cloudy, Rain, Snow
62
Multi-Class Classification: One vs. All
Binary Classification Multi-Class Classification
"0 "0
"# "#
63
Multi-Class Classification: One vs. All
"0 &$
&0
&$
"# &0
Class 1: &$
Class 2:
Class 3: &0
(9)
6/ x = L($ = M|x; w) (M = 1,2,3)
64
Multi-Class Classification: One vs. All
(0)
• Training a logistic regression classifier &# x for each class m
to predict the probability that $ = m
• On a new input x, to make a prediction, pick the class m that
∗ (0)
maximizes: m = argmax &# x
0
65
Up Next:
• Neural Network and Back-propagation
66
Thank you!
67