Machine Learning Basics
Lecture 2: Linear Classification
Princeton University COS 495
Instructor: Yingyu Liang
Review: machine learning basics
Math formulation
• Given training data𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 = σ𝑖=1 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖 )
𝑛
• s.t. the expected loss is small
𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷 [𝑙(𝑓, 𝑥, 𝑦)]
Machine learning 1-2-3
• Collect data and extract features
• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
Machine learning 1-2-3 Experience
• Collect data and extract features
• Build model: choose hypothesis class 𝓗 and loss function 𝑙
• Optimization: minimize the empirical loss
Prior knowledge
Example: Linear regression
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 2
𝑇
𝑛
𝑙2 loss
Linear model 𝓗
Why 𝑙2 loss
• Why not choose another loss
• 𝑙1 loss, hinge loss, exponential loss, …
• Empirical: easy to optimize
• For linear case: w = 𝑋 𝑇 𝑋 −1 𝑋 𝑇 𝑦
• Theoretical: a way to encode prior knowledge
Questions:
• What kind of prior knowledge?
• Principal way to derive loss?
Maximum likelihood Estimation
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
• Would like to pick 𝜃 so that 𝑃𝜃 (𝑥, 𝑦) fits the data well
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
• “fitness” of 𝜃 to one data point 𝑥𝑖 , 𝑦𝑖
likelihood 𝜃; 𝑥𝑖 , 𝑦𝑖 ≔ 𝑃𝜃 (𝑥𝑖 , 𝑦𝑖 )
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
• “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }
likelihood 𝜃; {𝑥𝑖 , 𝑦𝑖 } ≔ 𝑃𝜃 {𝑥𝑖 , 𝑦𝑖 } = ς𝑖 𝑃𝜃 (𝑥𝑖 , 𝑦𝑖 )
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
• MLE: maximize “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }
𝜃𝑀𝐿 = argmaxθ∈Θ ς𝑖 𝑃𝜃 (𝑥𝑖 , 𝑦𝑖 )
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
• MLE: maximize “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }
𝜃𝑀𝐿 = argmaxθ∈Θ log[ς𝑖 𝑃𝜃 𝑥𝑖 , 𝑦𝑖 ]
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log[𝑃𝜃 𝑥𝑖 , 𝑦𝑖 ]
Maximum likelihood Estimation (MLE)
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
• MLE: negative log-likelihood loss
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
𝐿 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )
MLE: conditional log-likelihood
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
Only care about predicting y
• MLE: negative conditional log-likelihood loss from x; do not care about p(x)
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝐿 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
MLE: conditional log-likelihood
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃
P(y|x): discriminative;
• MLE: negative conditional log-likelihood loss P(x,y): generative
𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
𝐿 𝑃𝜃 = − σ𝑖 log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 )
Example: 𝑙2 loss
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 = σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖 2
𝑛
Example: 𝑙2 loss
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 = σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖 2
𝑛
𝑙2 loss: Normal + MLE
• Define 𝑃𝜃 𝑦 𝑥 = Normal 𝑦; 𝑓𝜃 𝑥 , 𝜎 2
−1 1
• log(𝑃𝜃 𝑦𝑖 |𝑥𝑖 ) = 2 (𝑓𝜃 𝑥𝑖 − 𝑦𝑖 )2 −log(𝜎) − log(2𝜋)
2𝜎 2
1 𝑛 2
• 𝜃𝑀𝐿 = argminθ∈Θ σ𝑖=1 𝑓𝜃 (𝑥𝑖 ) − 𝑦𝑖
𝑛
Linear classification
Example 1: image classification
indoor
Indoor outdoor
Example 2: Spam detection
#”$” #”Mr.” #”sale” … Spam?
Email 1 2 1 1 Yes
Email 2 0 1 0 No
Email 3 1 1 1 Yes
…
Email n 0 0 0 No
New email 0 0 1 ??
Why classification
• Classification: a kind of summary
• Easy to interpret
• Easy for making decisions
Linear classification 𝑤𝑇𝑥 = 0
𝑤𝑇𝑥 > 0
𝑤𝑇𝑥 < 0
Class 1 𝑤
w đóng vai trò như 1 vecto pháp tuyến
cái nào ngược với pháp tuyến sẽ < 0
đường ở giữa chia thành 2 nửa <0 và >0 Class 0
Linear classification: natural attempt
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Hypothesis 𝑓𝑤 𝑥 = 𝑤 𝑇 𝑥
• 𝑦 = 1 if 𝑤 𝑇 𝑥 > 0 Linear model 𝓗
• 𝑦 = 0 if 𝑤 𝑇 𝑥 < 0
• Prediction: 𝑦 = step(𝑓𝑤 𝑥 ) = step(𝑤 𝑇 𝑥)
Linear classification: natural attempt
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 to minimize 𝐿 𝑓𝑤 = σ𝑖=1 𝕀[step(𝑤 𝑇 𝑥𝑖 ) ≠ 𝑦𝑖 ]
𝑇
𝑛
• Drawback: difficult to optimize
• NP-hard in the worst case 0-1 loss
Linear classification: simple approach
• Given training data
𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
1 𝑛
• Find 𝑓𝑤 𝑥 = 𝑤 𝑥 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝑤 𝑇 𝑥𝑖 − 𝑦𝑖 2
𝑇
𝑛
Reduce to linear regression;
ignore the fact 𝑦 ∈ {0,1}
Linear classification: simple approach
Drawback: not
robust to “outliers”
Figure borrowed from
Pattern Recognition and
Machine Learning, Bishop
Compare the two
𝑦
𝑦 = 𝑤𝑇𝑥
𝑦 = step(𝑤 𝑇 𝑥)
𝑤𝑇𝑥
Between the two
• Prediction bounded in [0,1]
• Smooth
1
• Sigmoid: 𝜎 𝑎 =
1+exp(−𝑎)
Figure borrowed from Pattern Recognition and Machine Learning, Bishop
Linear classification: sigmoid prediction
• Squash the output of the linear function
1
Sigmoid 𝑤𝑇𝑥 =𝜎 𝑤𝑇𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
1 𝑛
• Find 𝑤 that minimizes 𝐿 𝑓𝑤 = σ𝑖=1 𝜎(𝑤 𝑇 𝑥𝑖 ) − 𝑦𝑖 2
𝑛
Linear classification: logistic regression
• Squash the output of the linear function
1
Sigmoid 𝑤𝑇𝑥 =𝜎 𝑤𝑇𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
• A better approach: Interpret as a probability
𝑇
1
𝑃𝑤 (𝑦 = 1|𝑥) = 𝜎 𝑤 𝑥 =
1 + exp(−𝑤 𝑇 𝑥)
𝑃𝑤 𝑦 = 0 𝑥 = 1 − 𝑃𝑤 𝑦 = 1 𝑥 = 1 − 𝜎 𝑤 𝑇 𝑥
Linear classification: logistic regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑤 that minimizes
𝑛
1
𝐿 𝑤 = − log 𝑃𝑤 𝑦 𝑥
𝑛
𝑖=1
1 1
𝐿 𝑤 = − log𝜎(𝑤 𝑥𝑖 ) − log[1 − 𝜎 𝑤 𝑇 𝑥𝑖 ]
𝑇
𝑛 𝑛
𝑦𝑖 =1 𝑦𝑖 =0
Logistic regression:
MLE with sigmoid
Linear classification: logistic regression
• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷
• Find 𝑤 that minimizes
1 1
𝐿 𝑤 = − log𝜎(𝑤 𝑥𝑖 ) − log[1 − 𝜎 𝑤 𝑇 𝑥𝑖 ]
𝑇
𝑛 𝑛
𝑦𝑖 =1 𝑦𝑖 =0
No close form solution;
Need to use gradient descent
Properties of sigmoid function
• Bounded
1
𝜎 𝑎 = ∈ (0,1)
1 + exp(−𝑎)
• Symmetric
exp −𝑎 1
1−𝜎 𝑎 = = = 𝜎(−𝑎)
1 + exp −𝑎 exp 𝑎 + 1
• Gradient
exp −𝑎
𝜎 ′ (𝑎) = 2
= 𝜎(𝑎)(1 − 𝜎 𝑎 )
1 + exp −𝑎