794 Lec Intro Handout

Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

CS794/CO673: Optimization for Data Science

Lec 00: Introduction

Yaoliang Yu

September 9, 2022
Course Information
• Instructor: Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)

• Office hours: Friday 4-5pm (DC3617) or by email appointment

• TA: Zeou Hu (zeou.hu@uwaterloo.ca)

• Website: cs.uwaterloo.ca/~y328yu/mycourses/794
slides, notes, videos, assignments, policy, etc.

• Piazza: piazza.com/uwaterloo.ca/fall2022/co673cs794
announcements, questions, discussions, etc.

• Learn: learn.uwaterloo.ca/d2l/home/825963
assignments, solutions, grades, etc.

L00 1/41
Prerequisites
• Basic linear algebra, calculus, probability, algorithm

• Some relevant books on course website

• Coding

https://www.python.org/ https://julialang.org/

“Coding to programming is like typing to writing. ”


— Leslie Lamport

L00 2/41
Textbooks
• No required textbook
• Notes, slides, and code will be posted on the course website
• Some fine textbooks for the ambitious ones:

links available on the course website

L00 3/41
Workload
• Roughly 24 lectures, each lasting 80 mins (I hope)

• Expect 5 assignments, approx. 1 bi-weekly

– 20 points each; total: 100

– per approval, may substitute 1 assignment with a course project

• Small, constant progress every week

• Submit on LEARN. Submit early and often

– typeset using LATEX is recommended

L00 4/41
Policy
• Do your work independently and individually
– discussion is fine, but no sharing of text or code

– explicitly acknowledge any source that helped you

• Ignorance is no excuse
– good online discussion, more on course website

• Serious offense will result in expulsion. . .


• NO late submissions!
– except hospitalization, family urgency, . . . notify beforehand

• Appeal within two weeks

L00 5/41
Machine Learning is Everywhere
• Everyone uses ML everyday

• Lots of cool applications

• Excellent for job-hunting

L00 6/41
At the Core is Optimization

Models Stats

Optimization

System Magic

L00 7/41
What You Will Learn
• Learn the basic theory and algorithms

• Gain some implementation experience

• Know when to use which algorithm with what guarantees

• Start to formulate problems with algorithms in mind

L00 8/41
L00 9/41
Let the Journey Begin
What a Dataset Looks Like

 x1 x2 x3 x4 ··· xn x x′

 0 1 0 1 ··· 1 1 0.9
0 0 1 1 ··· 0 1 1.1

Rd ∋ .. .. .. .. .. .. .. ..


 . . . . . . . .
1 0 1 0 ··· 1 1 −0.1
y + + – + ··· – ? ?!

• each column is a data point: n in total; each has d features

• bottom y is the label vector; binary in this case

• x and x′ are test samples whose labels need to be predicted

L00 10/41
OR Dataset

1.5

+
1 +
x1 x2 x3 x4
0 1 0 1
0.5
0 0 1 1
y – + + +
+
−0.5 – 0.5 1 1.5

−0.5
L00 11/41
The Early Hype in AI...
NEW NAVY DEVICE LEARNS BY DOING: Psychologist Shows Embryo of Computer Designed to Read and Grow ...
New York Times (1923-); Jul 8, 1958; ProQuest Historical Newspapers: The New York Times
pg. 25

Reproduced with permissionNew York Times,


of the copyright 1958
owner. Further reproduc
L00 12/41
...due to Perceptron

Frank Rosenblatt
(1928 – 1971)

L00 13/41
Perceptron as an Optimization Problem
• Affine function: f (x) = ⟨x, w⟩ + b, where ⟨x, w⟩ :=
P
j xj w j

find w ∈ Rd , b ∈ R such that ∀i, yi (⟨xi , w⟩ + b) > 0.

• Perceptron solves the above optimization problem!


– it is iterative: going through the data one by one
– it converges faster if the problem is easier
– it behaves benignly even if no solution exists
• Abstract a bit more:

find w ∈ S ⊆ Rd .
– we often can only describe S partially

L00 14/41
Geometrically

x2

+
+

b

w2
w

+
x1

⟨x, w⟩ + b = 0

L00 15/41
Algorithm 1: Perceptron
Input: Dataset D = *(xi , yi ) ∈ Rd × {±1} : i = 1, . . . , n+,
initialization w ∈ Rd and b ∈ R, threshold δ ≥ 0
Output: approximate solution w and b
1 for t = 1, 2, . . . do
2 receive index It ∈ {1, . . . , n} // It can be random
3 if yIt (⟨xIt , w⟩ + b) ≤ δ then
4 w ← w + yIt xIt // update after a “mistake”
5 b ← b + y It

• Typically δ = 0 and w0 = 0, b = 0

– yŷ > 0 vs. yŷ < 0 vs. yŷ = 0, where ŷ = ⟨x, w⟩ + b

• Lazy update: “if it ain’t broke, don’t fix it”

F. Rosenblatt (1958). “The perceptron: A probabilistic model for information storage and organization in the
brain”. Psychological Review, vol. 65, no. 6, pp. 386–408.
L00 16/41
Does it work? Z code
1.5

+
1 +

0.5

+
−0.5 – 0.5 1 1.5

−0.5

w = [2, 2], b = −1, ŷ = sign(⟨x, w⟩ + b),


where sign(0) is undefined (i.e., always counted as a mistake).
L00 17/41
XOR Dataset
1.5

1 + –
x1 x2 x3 x4
0 1 0 1 0.5
0 0 1 1
y – + + – 0 – +

−0.5
−0.5 0 0.5 1 1.5

• Prove that no line can separate + from –

• What happens if we run Perceptron regardless?

L00 18/41
Perceptron and the 1st AI Winter

Marvin Minsky Seymour Papert


(1927 – 2016) (1928 – 2016)

M. L. Minsky and S. A. Papert (1969). “Perceptron”. MIT press.


L00 19/41
Projection Algorithms

find w ∈ Rd , b ∈ R such that ∀i, yi (⟨xi , w⟩ + b) > 0


find w = [w; b] ∈ Rd+1 such that ∀i, ⟨ai , w⟩ ≤ ci , ai = −yi [xi ; 1]
find w ∈ Rp such that A⊤ w ≤ c

Algorithm 2: Projection Algorithm for Linear Inequalities


Input: A ∈ Rp×n , c ∈ Rn , initialization w ∈ Rp , relaxation parameter
η ∈ (0, 2]
1 for t = 1, 2, . . . do
2 select index It ∈ {1, . . . , n} // index It can be random
 
(⟨aIt ,w⟩−cIt )+ aIt
3 w ← (1 − η)w + η w − ∥aI ∥2 · ∥aI ∥2
t t

T. S. Motzkin and I. J. Schoenberg (1954). “The Relaxation Method for Linear Inequalities”. Canadian
Journal of Mathematics, vol. 6, pp. 393–404; S. Agmon (1954). “The Relaxation Method for Linear Inequalities”.
Canadian Journal of Mathematics, vol. 6, pp. 382–392.
L00 20/41
Interpreting Perceptron

Theorem:
int cone∗ A ̸= ∅ ⇐⇒ int cone∗ A ∩ cone A ̸= ∅.

3 cone A

cone A := {Aλ : λ ≥ 0} 2 a1
cone ∗ A := {w : A⊤ w ≥ 0}
1
int cone ∗ A := {w : A⊤ w > 0} a2

−1 1 2 3
int cone∗A
−1

L00 21/41
Is Perceptron Unique?

1.5

+
1 +

0.5

+
−0.5 – 0.5 1 1.5

−0.5

L00 22/41
Support Vector Machines: Primal
1.5

+ +
1

0.5

−0.5 – 0.5 1+ 1.5

−0.5
ŷi yi
max min , where ŷi := ⟨xi , w⟩ + b
w:∀i,ŷi yi >0 i=1,...,n ∥w∥
L00 23/41
Support Vector Machines: Dual
1.5

+
1 +

0.5

+
−0.5 – 0.5 1 1.5

−0.5
X X
min min µi xi − νj x j
µ∈∆+ ν∈∆−
i:yi =+ j:yj =−
L00 24/41
Beyond Separability
1.5

+
1 – +

0.5

+
−0.5 – 0.5 + 1 1.5

−0.5

min Êℓ(yŷ) + reg(w), s.t. ŷ := ⟨x, w⟩ + b


w

L00 25/41
Empirical Risk Minimization

min Êℓ(yŷ) s.t. ŷ := ⟨x, w⟩ + b


w
5 zero-one
hinge
4 square hinge
logistic2
3 exponential
Perceptron

0
−4 −3 −2 −1 0 1 2
L00 yŷ 26/41
Regularization

min reg(w), s.t. ŷ := ⟨x, w⟩ + b


w
4
ℓ0
ℓ1
3 ℓ22
reg(w)

0
−3 −2 −1 0 1 2 3
L00
w 27/41
Regression

min Êℓ(y − ŷ) + reg(w), s.t. ŷ := ⟨x, w⟩ + b


w

3 square
ϵ-insensitive
absolute
2 Huber

0
−3 −2 −1 0 1 2 3
y − ŷ
L00 28/41
Plan I: Basic
• Lec04: Proximal Gradient: smooth ℓ + nonsmooth reg
• Lec05: Subgradient: nonsmooth ℓ + nonsmooth reg
• Lec10: Acceleration: optimal algorithm under smoothness
• Lec11: Smoothing: nonsmooth —> smooth
• Lec12: Alternating: divide and conquer
• Lec13: Coordinate Gradient: large model
• Lec18: Stochastic Gradient: large dataset
• Lec22: Newton: even faster under smoothness
• Lec23: Quasi-Newton: Newton made economical

L00 29/41
Denoising

min 1
∥x − z∥22 + λ · ∥z∥tv
z |2 {z } | {z }
fidelity regularization

• λ controls the trade-off


• regularization encodes prior knowledge
• crucial to not over-smooth
L00 30/41
Adversarial Examples
Hidden Hidden

Shetland
layer 1 layer 2
Input
layer h1,1 h1,2
Output
x1 h2,1 h2,2 layer
x2 h3,1 h3,2 y1
..
x3 h4,1 h4,2 .
.. yc
. h5,1 h5,2
.. ..

Collie
xd
. .
hp,1 hp,2

L00 31/41
L00 32/41
Adversarial Attacks

• Mathematically, a neural network is a function f (w; x)


• Typically, input x is given and network weights w optimized
• Could also freeze weights w and optimize x, adversarially!

min size(δ) s.t. pred[f (w; x + δ)] ̸= y


δ

• More generally: maxδ ℓ(w; x + δ, y) s.t. size(δ) ≤ ϵ


L00 33/41
L00 34/41
Robustness as Optimization
• Empirical risk minimization recalled:
min Êℓ(w; x, y)
w

• Adversarial attack perturbs (x, y) while fixing w:


max ℓ(w; x + δ, y)
size(δ)≤ϵ

• Robustness by anticipating the worst-case:


min Ê max ℓ(w; x + δ, y)
w size(δ)≤ϵ

• The game continues by anticipating the anticipation:


max ℓ(w; x + δ, y) leader
size(δ)≤ϵ

min Êℓ(w; x + δ, y) follower


w

L00 35/41
Plan II: Game-theoretic
• Lec07: Fictitious Play: playing against oneself

• Lec14: Minimax: understanding duality

• Lec15: Averaging: projected gradient descent ascent

• Lec16: Extragradient: faster under smoothness

• Lec17: Splitting: exploiting structure

• Lec20: Randomized Smoothing: simulating gradient

L00 36/41
Generative Adversarial Networks

min max Ê log Sφ (x) + Ê log(1 − Sφ ◦ Tθ (z))


θ φ

I. Goodfellow et al. (2014). “Generative Adversarial Nets”. In: NIPS.


L00 37/41
L00 38/41
Plan III: Exotic
g62
q3− g63
1
p+
3
Lϱρ (w)
g61
q2− p−
3
p+ q2+ w g64
−1 2 1

p−
2
g66
q1+
−1 g65

• Lec06: Conditional Gradient: model weights quantization


• Lec09: Metric Gradient: model gradient quantization
• Lec08: Mirror Descent: gradient under non-Euclidean geometry
L00 39/41
History Goes A Long Way Back
“Nothing in the world takes place without optimization,
and there is no doubt that all aspects of the world that have
a rational basis can be explained by optimization methods.”
— Leonhard Euler, 1744
“Every year I meet Ph.D. students of different special-
izations who ask me for advice on reasonable numerical
schemes for their optimization models. And very often they
seem to have come too late. In my experience, if an opti-
mization model is created without taking into account the
abilities of numerical schemes, the chances that it will be
possible to find an acceptable numerical solution are close
to zero. In any field of human activity, if we create some-
thing, we know in advance why we are doing so and what
we are going to do with the result.” — Yurii Nesterov

L00 40/41
No Free Lunch
• On average, no algorithm is better than any other1
• In general, optimization problems are unsolvable2
• Implications:
– don’t try to solve all problems; one (class) at a time!
– “efficient optimization methods can be developed only by intelligently
employing the structure of particular instances of problems”
– know your algorithms and their limits
– be open to the impossible

“There are no inferior algorithms, only inferior engineers.”


1
D. H. Wolpert and W. G. Macready (1997). “No free lunch theorems for optimization”. IEEE Transactions
on Evolutionary Computation, vol. 1, no. 1, pp. 67–82.
2
K. G. Murty and S. N. Kabadi (1987). “Some NP-complete problems in quadratic and nonlinear
programming”. Mathematical Programming, vol. 39, no. 2, pp. 117–129.
L00 41/41

You might also like