0% found this document useful (0 votes)
4 views

Lecture3_Linear Regression and Logistic Regression

Uploaded by

classaen9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture3_Linear Regression and Logistic Regression

Uploaded by

classaen9
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

SW4428: 컴퓨터비전

Lecture 3: Linear Regression and Logistic Regression

Sunok Kim
Department of Software Engineering
sunok.kim@kau.ac.kr
Today’s Topic
• Linear Regression
• What is Regression?
• Regression Function
• Linear Regression
• Cost Function for Linear Regression
• Gradient Descent
• Logistic Regression
• What is Classification?
• Logistic Regression
• Cost Function for Logistic Regression

9
Supervised Learning vs Unsupervised Learning

Supervised Learning Unsupervised Learning

10
Linear Regression

11
What is Regression?
• Regression vs. Classification (in CV)
• Examples:

12
What is Regression?
• Examples of Regression 500

• Housing Prices 400


Price
(Portland, OR) 300
(in 1000’s
200
of dollars)
100

0
0 500 1000 1500 2000 2500 3000

Size (feet2)

• Training Set of Housing Prices Size in feet2 (x) Price ($) in 1000's (y)
2,104 460
1,416 232
1,534 315
852 178
… … 13
What is Regression?
• Regression is to relate input variables to the output variable,
to either predict outputs for new inputs, and/or to understand
the effect of the input on the output.

$! ∈ ℝ
“output”

!! ∈ ℝ"
“input”
14
What is Regression?
• Dataset for Regression
• In regression, dataset consists of pairs { "! , $! $
!"# },
where $! is the &-th output and "! is a vector of ' input.
• The number of pairs ( is the dataset size.
• ' is the dimensionality.

* Machine Learning Term for Regression


• Input variables are also known as features, covariates, independent
variables, predictors, etc.
• Output variables are also known as target, label, response, outcome,
dependent variable, measured variable, etc.

15
What is Regression?
• Two Goals of Regression
• In prediction, we wish to predict the output for a new input vector,
e.g., what is the weight of a person who is 170 cm tall?
• In interpretation, we wish to understand the effect of inputs on output,
e.g., are taller people heavier too?

• Examples:
• What is the life-expectancy of a person who has been smoking for 10
years?
• Does smoking cause cancer?
• A massive scale earthquake will occur in California within next 30 years.

16
The Regression Function (Hypothesis)
• We need to find a mapping function that approximates
the output “well enough” given inputs.

$! ≈ &(!! ), for all )

Training set

Learning Algorithm

!! Mapping Function $! ≈ &(!! )

17
Linear Regression
• Linear regression is a model that assumes a linear relationship
between input and output.

“Weight”
“Height”
• Why learn about linear regression?
• Plenty of reasons: simple, easy to understand, most widely used, easily
generalized to non-linear models.
Most importantly, you can learn almost all fundamental concepts of
machine learning with single regression alone.
18
Simple Linear Regression
• With one-dimensional input, we get a simple linear regression.

$! ≈ & !! = &# !! = +$ + +% !!

• Here, w = (+$ , +% ) are the two parameters of the model.


They describe &# ! .
$! ∈ ℝ &# !
)#
“Weight”

)% 1

“Height”
!! ∈ ℝ 19
Learning or Estimation or Fitting
• Given data, we would like to find the parameter w ∗ .
• This is called learning or estimating the parameters or fitting
the model.
• To do so, we need an optimization algorithm.

20
Cost Function
• Training Set of Housing Prices Size in feet2 (x) Price ($) in 1000's (y)
&0 = 2,104 -0 = 460
&$ = 1,416 -$ = 232
&: = 1,534 -: = 315
&; = 852 -; = 178
… …

• Consider the following model: $! ≈ &# !! = +$ + +% !!


• How can we estimate (or guess) values of w = (+$ , +% ) given
the data?

21
Cost Function
• Consider the following model: $! ≈ &# !! = +$ + +% !!

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3

)% = 1.5 )% = 0 )% = 1
)# = 0 )# = 0.5 )# = 0.5

22
Cost Function
• Idea: Choose w = (+$ , +% ) so that &# ! is close to $ for
training dataset { !! , $! (
!'% }.
(
1
$ min 6 (&# !! − $! ))
# 5
(!! , $! ) !'%

(
1
8 w = 6 9 (&# !! , $! )
5
! !'%

w ∗ = argmin 8 w
#
23
Cost Function
• A cost function (or energy, loss, training objective) is used to
learn the parameters (w) that explain the data well.

• The cost function quantifies


how well our model explains the data or
how costly our mistakes are.

24
Cost Function
• Model: • Model:
&# ! = +% ! &# ! = +$ + +% !
• Parameters: • Parameters:
+% w = (+$ , +% )
• Cost function: • Cost function:
( (
1 1
8 w = 6 (&# !! − $! )) 8 w = 6 (&# !! − $! ))
5 5
!'% !'%
• Goal: • Goal:
w ∗ = argmin 8 w w ∗ = argmin 8 w
# #
25
Cost Function

&# ! 8 +%
(for fixed )# , this is a function of ") (function of the parameter )# )

$ 3
/ )# 3

2 2

1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
" )#

26
Cost Function
• Model:
&# ! = +$ + +% !
• Parameters:
w = (+$ , +% )
• Cost function:
(
1
8 w = 6 (&# !! − $! ))
5
!'%
• Goal:
w ∗ = argmin 8 w
#
27
Cost Function
&# ! 8 w
(for fixed w, this is a function of ") (function of the parameter w)

500

400
Price ($)

E F., F0
300
in 1000’s
200 ." = 50
100
.# = 0.06
&! ' = 50 + 0.06'
0
0 1000 2000 3000
Size in feet2 (x) F. F0

“Contour Plots” or “Contour Figures”


28
Cost Function
&# ! 8 w
(for fixed w, this is a function of ") (function of the parameter w)

500

400
Price ($) 300
in 1000’s <"
200

100

0
0 1000 2000 3000
Size in feet2 (x) <!

29
Cost Function
• Examples
• Mean Square Error (MSE): one of the most popular cost function.
$
1
MSE w = 4 (6/ "! − $! )0
(
!"#

• Mean Absolut Error (MAE)


$
1
MAE w = 4 6/ "! − $!
(
!"#

30
Cost Function
• Desirable Properties of Cost Functions
• When the target $ is real-valued, it is often desirable that the
cost is symmetric around :, since both positive and negative
errors should be penalized equally.
• Also, the cost function should penalize “large” mistakes and
“very-large” mistakes similarly.

9(=! ) $

=! ! 31
Cost Function
• Desirable Properties of Cost Functions
• Statistical vs.
Computational Trade-off
• If we want better statistical properties,
then we have to give up good
computational properties.
• So which loss function is the best?

[Figure taken from Patrick Breheny’s slides] 32


Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#

• Grid search is one of the simplest optimization algorithms.


We compute the cost over all values w in a grid,
and pick the best.
• This is brute-force, but extremely simple.
• But, it has too many “for loops”, resulting in an exponential
computational complexity.

33
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#

34
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#

• Follow the Gradient: A gradient (at a point) is the slope of the


tangent to the function (at the point). It points to the direction
of largest increase of the function.
• The direction of gradient descent is the negative gradient.

Negative gradient
1 ← Original w
direction: − 12 / ) →

35
Gradient Descent
• Have some function 8 w , want w ∗ = argmin 8 w .
#

• Follow the Gradient: A gradient (at a point) is the slope of the


tangent to the function (at the point). It points to the direction
of largest increase of the function.
• Outline:
• Start with some w = (+$ , +% ) (e.g., +$ = 0, +% = 1).
• Keep changing w = (+$ , +% ) to reduce 8 w
until we hopefully end up at a minimum.

36
Gradient Descent
Q: Where start?
Q: Which direction?
Q: How much move?

← Initial state
/ )% , )#

)#
)%

37
Gradient Descent
• Gradient Descent Algorithm
• Repeat until convergence
{
A
+$ ← +$ − @ 8 +$ , +%
A+$

A
+% ← +% − @ 8 +$ , +%
A+%
}

38
Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%

+% ← +% − @(BCDEFEGH number)
+%

+% ← +% − @(LHMNFEGH number)
+%
39
Gradient Descent
A
+% ← +% − @ 8 +$ , +%
A+%

• If @ is too small, gradient descent can


be slow.
+%
• If @ is too large, gradient descent
can overshoot the minimum. It may
fail to converge, or even diverge.

+% 40
Gradient Descent
• Limitation of Gradient Descent

A
+% ← +% − @ 8 +$ , +%
A+%

+% ← +% − @(0)
+%

41
Multivariate Linear Regression
• Remind) Single Variable
• Housing Prices 500

400
(Portland, OR) Price 300
(in 1000’s 200
of dollars)
100

0
0 500 1000 1500 2000 2500 3000

Size (feet2)
• Training Set of Housing Prices
Size in feet2 (x) Price ($) in 1000's (y)
2104 460
1416 232
1534 315
852 178
… …
42
Multivariate Linear Regression
• Multiple Variables
Number of Number of Age of home Price
Size (feet2)
bedrooms floors (years) ($1000)

2,104 5 1 45 460
1,416 3 2 40 232
1,534 3 2 30 315
852 2 1 36 178
… … … … …

"! = ["!# , "!0 , … , "!6 ]7 $!


• Input (features) of &34 training example: "!
• Value of features = in &34 training example: "!5
43
Multivariate Linear Regression
• Mapping Function for Multiple Variables

Previously: &# !! = +$ + +% !!

44
Multivariate Linear Regression
• Mapping Function for Multiple Variables

&# x! = +$ + +% !!% + +) !!) + +* !!* + ⋯ + +" !!"

• For convenience of notation, define !!$ = 1

x! = [!!$ , !!% , … , !!" ]+ ∈ ℝ",% , w = [+$ , +% , … , +" ]+ ∈ ℝ",%

&# x! = w + x!

• It is called “Multivariate Linear Regression”

45
Multivariate Linear Regression
• Mapping Function

&# x = w + x

• Parameters w = [+$ , +% , … , +" ]+ ∈ ℝ",%


• Cost function:
(
1
8 w = 6 (&# x! − $! ))
5
!'%

46
Gradient Descent for Multiple Variables
• Single variable: • Multiple variables:
Repeat{ Repeat{
B B
)% ← )% − A / )% , )# )5 ← )5 − A / w
B)% B)5
B $
)# ← )# − A / )% , )# B 1 0
B)# ← )5 − A 4 6/ x! − $!
B)5 (
} $
!"#
2
← )5 − A 4 (6/ x! − $! )"!5
(
!"#
}

47
Logistic Regression

48
What is Classification?
• Similar to regression,
classification relates the input variable ! to the output variable $,
but $ can only take on discrete values.
We say that $ is a categorical variable.

• Examples:
• Email: Spam / Not Spam?
• Online Transactions: Fraudulent (Yes / No)?
• Tumor: Malignant / Benign?
$ ∈ {0,1} 0: “Negative Class” (e.g., benign tumor)
1: “Positive Class” (e.g., malignant tumor)

49
What is Classification?
• Examples of Classification

(Yes) 1

Malignant?

(No) 0
Tumor Size

• Solution? Normalize the output of mapping function


&# x = w + x between 0 and 1, and threshold at 0.5:
• If 6/ x ≥ 0.5, predict $ = 1
• If 6/ x < 0.5, predict $ = 0
50
The Classification Function (Hypothesis)
• Want 0 ≤ &# x ≤ 1
&# x = U(w + x) where U V = 1/(1 + exp(−V))
1 1
→ &# x =
1 + exp(−w + x)
0.5

→ Sigmoid Function / Logistic Function 0

• Predict $ = 1 if &# x ≥ 0.5, w + x ≥ 0


• Predict $ = 0 if &# x < 0.5, w + x < 0

51
The Classification Function (Hypothesis)
• &# x : Estimated probability that $ = 1 on input x
• &# x = P($ = 1|x; w)
“Probability that $ = 1, given x, parameterized by w”

• P $ = 0 x; w + P $ = 1 x; w = 1
• P $ = 0 x; w = 1 − P $ = 1 x; w

• Example: If x = [!$ , !% ]+ = [1, TumorSize]+ , &# x = 0.7


• Tell patient that 70% chance of tumor being malignant
• Tell patient that 30% chance of tumor being non-malignant
52
Decision Boundary
&# x = U w + x = U +$ + +% !% + +) !)
1 "0
3
&# x =
1 + exp(−w + x) 2
1 1
&# x =
1 + exp(−+$ − +% !% − +) !) ) 1 2 3 "#

• Predict $ = 1 if −3 + !% + !) ≥ 0
• Predict $ = 0 if −3 + !% + !) < 0

53
Non-linear Decision Boundary
&# x = U w + x = U(+$ + +% !% + +) !) "0
++* !%) + +- !)) ) 1

-1 1
"#
• Predict $ = 1 if −1 + !%) + !)) ≥0 -1

• Predict $ = 0 if −1 + !%) + !)) <0

54
Logistic Regression
• Training set: { x% , $% , x) , $) , … , x( , $( }
• )-th example: x! = [!!$ , !!% , … , !!" ]+ ∈ ℝ",% , !!$ = 1,
$! ∈ {0,1}
• The mapping function:
1
&# x =
1 + exp(−w + x)

• How to choose the parameter w?

55
Cost Function
• Simple possible cost function: e.g., Mean Square Error (MSE)
(
1
8 w = 6 (&# x! − $! ))
5
!'%

“non-convex” “convex”
/ w / w

w w

56
Cost Function for Logistic Regression
− log &# x! if $! = 1
• 9 &# x! , $! = h
−log 1 − &# x! if $! = 0

If $ = 1 H 6/ x! , $! = 0 if $! = 1, 6/ x! = 1
But as 6/ x! → 0, H 6/ x! , $! → ∞

If 6/ x! = 0 (predict P $ = 1 x; w = 0),
but $! = 1,
we’ll penalize learning algorithm
6/ x by a very large cost.

57
Cost Function for Logistic Regression
− log &# x! if $! = 1
• 9 &# x! , $! = h
−log 1 − &# x! if $! = 0

If $ = 0

6/ x
58
Cost Function for Logistic Regression
(
1
8 w = 6 9 &# x! , $!
5
!'%
− log &# x! if $! = 1
9 &# x! , $! = h
−log 1 − &# x! if $! = 0
Note: $! = 0 or $! = 1 always

9 &# x! , $! = −$! log &# x! − (1 − $! )log 1 − &# x!

59
Cost Function for Logistic Regression
(
1
8 w = 6 9 &# x! , $!
5
!'%(
1
8 w = − 6 $! log &# x! + (1 − $! )log 1 − &# x!
5
!'%
To fit parameters w ∗ = argmin 8 w
#
To make a prediction given a new x:
1
&# x =
1 + exp(−w + x)

60
Gradient Descent for Logistic Regression
(
1
8 w = − 6 $! log &# x! + (1 − $! )log 1 − &# x!
5
!'%
Want w ∗ = argmin 8 w
#
Repeat{
A
+. ← +. − @ 8 w
A+.
(
1
← +. − @ 6 (&# x! − $! )!!.
5
!'%
}
61
Multi-Class Classification: One vs. All
• E-mail Foldering/Tagging: Work, Friends, Family, Hobby

• Medical Diagram: Not ill, Cold, Flu

• Weather: Sunny, Cloudy, Rain, Snow

62
Multi-Class Classification: One vs. All

Binary Classification Multi-Class Classification


"0 "0

"# "#

63
Multi-Class Classification: One vs. All
"0 &$

&0
&$

"# &0

Class 1: &$
Class 2:
Class 3: &0
(9)
6/ x = L($ = M|x; w) (M = 1,2,3)
64
Multi-Class Classification: One vs. All
(0)
• Training a logistic regression classifier &# x for each class m
to predict the probability that $ = m

• On a new input x, to make a prediction, pick the class m that


∗ (0)
maximizes: m = argmax &# x
0

65
Up Next:

• Neural Network and Back-propagation

66
Thank you!

67

You might also like