794 Lec Intro Handout
794 Lec Intro Handout
794 Lec Intro Handout
Yaoliang Yu
September 9, 2022
Course Information
• Instructor: Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)
• Website: cs.uwaterloo.ca/~y328yu/mycourses/794
slides, notes, videos, assignments, policy, etc.
• Piazza: piazza.com/uwaterloo.ca/fall2022/co673cs794
announcements, questions, discussions, etc.
• Learn: learn.uwaterloo.ca/d2l/home/825963
assignments, solutions, grades, etc.
L00 1/41
Prerequisites
• Basic linear algebra, calculus, probability, algorithm
• Coding
https://www.python.org/ https://julialang.org/
L00 2/41
Textbooks
• No required textbook
• Notes, slides, and code will be posted on the course website
• Some fine textbooks for the ambitious ones:
L00 3/41
Workload
• Roughly 24 lectures, each lasting 80 mins (I hope)
L00 4/41
Policy
• Do your work independently and individually
– discussion is fine, but no sharing of text or code
• Ignorance is no excuse
– good online discussion, more on course website
L00 5/41
Machine Learning is Everywhere
• Everyone uses ML everyday
L00 6/41
At the Core is Optimization
Models Stats
Optimization
System Magic
L00 7/41
What You Will Learn
• Learn the basic theory and algorithms
L00 8/41
L00 9/41
Let the Journey Begin
What a Dataset Looks Like
x1 x2 x3 x4 ··· xn x x′
0 1 0 1 ··· 1 1 0.9
0 0 1 1 ··· 0 1 1.1
Rd ∋ .. .. .. .. .. .. .. ..
. . . . . . . .
1 0 1 0 ··· 1 1 −0.1
y + + – + ··· – ? ?!
L00 10/41
OR Dataset
1.5
+
1 +
x1 x2 x3 x4
0 1 0 1
0.5
0 0 1 1
y – + + +
+
−0.5 – 0.5 1 1.5
−0.5
L00 11/41
The Early Hype in AI...
NEW NAVY DEVICE LEARNS BY DOING: Psychologist Shows Embryo of Computer Designed to Read and Grow ...
New York Times (1923-); Jul 8, 1958; ProQuest Historical Newspapers: The New York Times
pg. 25
Frank Rosenblatt
(1928 – 1971)
L00 13/41
Perceptron as an Optimization Problem
• Affine function: f (x) = ⟨x, w⟩ + b, where ⟨x, w⟩ :=
P
j xj w j
find w ∈ S ⊆ Rd .
– we often can only describe S partially
L00 14/41
Geometrically
x2
+
+
b
−
w2
w
+
x1
–
⟨x, w⟩ + b = 0
L00 15/41
Algorithm 1: Perceptron
Input: Dataset D = *(xi , yi ) ∈ Rd × {±1} : i = 1, . . . , n+,
initialization w ∈ Rd and b ∈ R, threshold δ ≥ 0
Output: approximate solution w and b
1 for t = 1, 2, . . . do
2 receive index It ∈ {1, . . . , n} // It can be random
3 if yIt (⟨xIt , w⟩ + b) ≤ δ then
4 w ← w + yIt xIt // update after a “mistake”
5 b ← b + y It
• Typically δ = 0 and w0 = 0, b = 0
F. Rosenblatt (1958). “The perceptron: A probabilistic model for information storage and organization in the
brain”. Psychological Review, vol. 65, no. 6, pp. 386–408.
L00 16/41
Does it work? Z code
1.5
+
1 +
0.5
+
−0.5 – 0.5 1 1.5
−0.5
1 + –
x1 x2 x3 x4
0 1 0 1 0.5
0 0 1 1
y – + + – 0 – +
−0.5
−0.5 0 0.5 1 1.5
L00 18/41
Perceptron and the 1st AI Winter
T. S. Motzkin and I. J. Schoenberg (1954). “The Relaxation Method for Linear Inequalities”. Canadian
Journal of Mathematics, vol. 6, pp. 393–404; S. Agmon (1954). “The Relaxation Method for Linear Inequalities”.
Canadian Journal of Mathematics, vol. 6, pp. 382–392.
L00 20/41
Interpreting Perceptron
Theorem:
int cone∗ A ̸= ∅ ⇐⇒ int cone∗ A ∩ cone A ̸= ∅.
3 cone A
cone A := {Aλ : λ ≥ 0} 2 a1
cone ∗ A := {w : A⊤ w ≥ 0}
1
int cone ∗ A := {w : A⊤ w > 0} a2
−1 1 2 3
int cone∗A
−1
L00 21/41
Is Perceptron Unique?
1.5
+
1 +
0.5
+
−0.5 – 0.5 1 1.5
−0.5
L00 22/41
Support Vector Machines: Primal
1.5
+ +
1
0.5
−0.5
ŷi yi
max min , where ŷi := ⟨xi , w⟩ + b
w:∀i,ŷi yi >0 i=1,...,n ∥w∥
L00 23/41
Support Vector Machines: Dual
1.5
+
1 +
0.5
+
−0.5 – 0.5 1 1.5
−0.5
X X
min min µi xi − νj x j
µ∈∆+ ν∈∆−
i:yi =+ j:yj =−
L00 24/41
Beyond Separability
1.5
+
1 – +
–
0.5
–
+
−0.5 – 0.5 + 1 1.5
−0.5
L00 25/41
Empirical Risk Minimization
0
−4 −3 −2 −1 0 1 2
L00 yŷ 26/41
Regularization
0
−3 −2 −1 0 1 2 3
L00
w 27/41
Regression
3 square
ϵ-insensitive
absolute
2 Huber
ℓ
0
−3 −2 −1 0 1 2 3
y − ŷ
L00 28/41
Plan I: Basic
• Lec04: Proximal Gradient: smooth ℓ + nonsmooth reg
• Lec05: Subgradient: nonsmooth ℓ + nonsmooth reg
• Lec10: Acceleration: optimal algorithm under smoothness
• Lec11: Smoothing: nonsmooth —> smooth
• Lec12: Alternating: divide and conquer
• Lec13: Coordinate Gradient: large model
• Lec18: Stochastic Gradient: large dataset
• Lec22: Newton: even faster under smoothness
• Lec23: Quasi-Newton: Newton made economical
L00 29/41
Denoising
min 1
∥x − z∥22 + λ · ∥z∥tv
z |2 {z } | {z }
fidelity regularization
Shetland
layer 1 layer 2
Input
layer h1,1 h1,2
Output
x1 h2,1 h2,2 layer
x2 h3,1 h3,2 y1
..
x3 h4,1 h4,2 .
.. yc
. h5,1 h5,2
.. ..
Collie
xd
. .
hp,1 hp,2
L00 31/41
L00 32/41
Adversarial Attacks
L00 35/41
Plan II: Game-theoretic
• Lec07: Fictitious Play: playing against oneself
L00 36/41
Generative Adversarial Networks
p−
2
g66
q1+
−1 g65
L00 40/41
No Free Lunch
• On average, no algorithm is better than any other1
• In general, optimization problems are unsolvable2
• Implications:
– don’t try to solve all problems; one (class) at a time!
– “efficient optimization methods can be developed only by intelligently
employing the structure of particular instances of problems”
– know your algorithms and their limits
– be open to the impossible