794 Lec Intro Handout

CS794/CO673: Optimization for Data Science
Lec 00: Introduction
Yaoliang Yu
September 9, 2022
Course Information
• Instructor: Yao-Liang Yu (yaoliang.yu@uwaterloo.ca)
• Office hours: Friday 4-5pm (DC3617) or by email appointment
• TA: Zeou Hu (zeou.hu@uwaterloo.ca)
• Website: cs.uwaterloo.ca/~y328yu/mycourses/794
slides, notes, videos, assignments, policy, etc.
• Piazza: piazza.com/uwaterloo.ca/fall2022/co673cs794
announcements, questions, discussions, etc.
• Learn: learn.uwaterloo.ca/d2l/home/825963
assignments, solutions, grades, etc.
L00 1/41
Prerequisites
• Basic linear algebra, calculus, probability, algorithm
• Some relevant books on course website
• Coding
https://www.python.org/ https://julialang.org/
“Coding to programming is like typing to writing. ”

— Leslie Lamport
L00 2/41
Textbooks
• No required textbook
• Notes, slides, and code will be posted on the course website
• Some fine textbooks for the ambitious ones:
links available on the course website
L00 3/41
Workload
• Roughly 24 lectures, each lasting 80 mins (I hope)
• Expect 5 assignments, approx. 1 bi-weekly
– 20 points each; total: 100
– per approval, may substitute 1 assignment with a course project
• Small, constant progress every week
• Submit on LEARN. Submit early and often
– typeset using LATEX is recommended
L00 4/41
Policy
• Do your work independently and individually
– discussion is fine, but no sharing of text or code
– explicitly acknowledge any source that helped you
• Ignorance is no excuse
– good online discussion, more on course website
• Serious offense will result in expulsion. . .

• NO late submissions!
– except hospitalization, family urgency, . . . notify beforehand
• Appeal within two weeks
L00 5/41
Machine Learning is Everywhere
• Everyone uses ML everyday
• Lots of cool applications
• Excellent for job-hunting
L00 6/41
At the Core is Optimization
Models Stats
Optimization
System Magic
L00 7/41
What You Will Learn
• Learn the basic theory and algorithms
• Gain some implementation experience
• Know when to use which algorithm with what guarantees
• Start to formulate problems with algorithms in mind
L00 8/41
L00 9/41
Let the Journey Begin
What a Dataset Looks Like
 x1 x2 x3 x4 ··· xn x x′

 0 1 0 1 ··· 1 1 0.9
0 0 1 1 ··· 0 1 1.1

Rd ∋ .. .. .. .. .. .. .. ..


 . . . . . . . .
1 0 1 0 ··· 1 1 −0.1
y + + – + ··· – ? ?!
• each column is a data point: n in total; each has d features
• bottom y is the label vector; binary in this case
• x and x′ are test samples whose labels need to be predicted
L00 10/41
OR Dataset
1.5
+
1 +
x1 x2 x3 x4
0 1 0 1
0.5
0 0 1 1
y – + + +
+
−0.5 – 0.5 1 1.5
−0.5
L00 11/41
The Early Hype in AI...
NEW NAVY DEVICE LEARNS BY DOING: Psychologist Shows Embryo of Computer Designed to Read and Grow ...
New York Times (1923-); Jul 8, 1958; ProQuest Historical Newspapers: The New York Times
pg. 25
Reproduced with permissionNew York Times,

of the copyright 1958
owner. Further reproduc
L00 12/41
...due to Perceptron
Frank Rosenblatt
(1928 – 1971)
L00 13/41
Perceptron as an Optimization Problem
• Affine function: f (x) = ⟨x, w⟩ + b, where ⟨x, w⟩ :=
P
j xj w j
find w ∈ Rd , b ∈ R such that ∀i, yi (⟨xi , w⟩ + b) > 0.
• Perceptron solves the above optimization problem!

– it is iterative: going through the data one by one
– it converges faster if the problem is easier
– it behaves benignly even if no solution exists
• Abstract a bit more:
find w ∈ S ⊆ Rd .
– we often can only describe S partially
L00 14/41
Geometrically
x2
+
+
b
−
w2
w
+
x1
–
⟨x, w⟩ + b = 0
L00 15/41
Algorithm 1: Perceptron
Input: Dataset D = *(xi , yi ) ∈ Rd × {±1} : i = 1, . . . , n+,
initialization w ∈ Rd and b ∈ R, threshold δ ≥ 0
Output: approximate solution w and b
1 for t = 1, 2, . . . do
2 receive index It ∈ {1, . . . , n} // It can be random
3 if yIt (⟨xIt , w⟩ + b) ≤ δ then
4 w ← w + yIt xIt // update after a “mistake”
5 b ← b + y It
• Typically δ = 0 and w0 = 0, b = 0
– yŷ > 0 vs. yŷ < 0 vs. yŷ = 0, where ŷ = ⟨x, w⟩ + b
• Lazy update: “if it ain’t broke, don’t fix it”
F. Rosenblatt (1958). “The perceptron: A probabilistic model for information storage and organization in the
brain”. Psychological Review, vol. 65, no. 6, pp. 386–408.
L00 16/41
Does it work? Z code
1.5
+
1 +
0.5
+
−0.5 – 0.5 1 1.5
−0.5
w = [2, 2], b = −1, ŷ = sign(⟨x, w⟩ + b),

where sign(0) is undefined (i.e., always counted as a mistake).
L00 17/41
XOR Dataset
1.5
1 + –
x1 x2 x3 x4
0 1 0 1 0.5
0 0 1 1
y – + + – 0 – +
−0.5
−0.5 0 0.5 1 1.5
• Prove that no line can separate + from –
• What happens if we run Perceptron regardless?
L00 18/41
Perceptron and the 1st AI Winter
Marvin Minsky Seymour Papert

(1927 – 2016) (1928 – 2016)
M. L. Minsky and S. A. Papert (1969). “Perceptron”. MIT press.

L00 19/41
Projection Algorithms
find w ∈ Rd , b ∈ R such that ∀i, yi (⟨xi , w⟩ + b) > 0

find w = [w; b] ∈ Rd+1 such that ∀i, ⟨ai , w⟩ ≤ ci , ai = −yi [xi ; 1]
find w ∈ Rp such that A⊤ w ≤ c
Algorithm 2: Projection Algorithm for Linear Inequalities

Input: A ∈ Rp×n , c ∈ Rn , initialization w ∈ Rp , relaxation parameter
η ∈ (0, 2]
1 for t = 1, 2, . . . do
2 select index It ∈ {1, . . . , n} // index It can be random

(⟨aIt ,w⟩−cIt )+ aIt
3 w ← (1 − η)w + η w − ∥aI ∥2 · ∥aI ∥2
t t
T. S. Motzkin and I. J. Schoenberg (1954). “The Relaxation Method for Linear Inequalities”. Canadian
Journal of Mathematics, vol. 6, pp. 393–404; S. Agmon (1954). “The Relaxation Method for Linear Inequalities”.
Canadian Journal of Mathematics, vol. 6, pp. 382–392.
L00 20/41
Interpreting Perceptron
Theorem:
int cone∗ A ̸= ∅ ⇐⇒ int cone∗ A ∩ cone A ̸= ∅.
3 cone A
cone A := {Aλ : λ ≥ 0} 2 a1
cone ∗ A := {w : A⊤ w ≥ 0}
1
int cone ∗ A := {w : A⊤ w > 0} a2
−1 1 2 3
int cone∗A
−1
L00 21/41
Is Perceptron Unique?
1.5
+
1 +
0.5
+
−0.5 – 0.5 1 1.5
−0.5
L00 22/41
Support Vector Machines: Primal
1.5
+ +
1
0.5
−0.5 – 0.5 1+ 1.5
−0.5
ŷi yi
max min , where ŷi := ⟨xi , w⟩ + b
w:∀i,ŷi yi >0 i=1,...,n ∥w∥
L00 23/41
Support Vector Machines: Dual
1.5
+
1 +
0.5
+
−0.5 – 0.5 1 1.5
−0.5
X X
min min µi xi − νj x j
µ∈∆+ ν∈∆−
i:yi =+ j:yj =−
L00 24/41
Beyond Separability
1.5
+
1 – +
–
0.5
–
+
−0.5 – 0.5 + 1 1.5
−0.5
min Êℓ(yŷ) + reg(w), s.t. ŷ := ⟨x, w⟩ + b

w
L00 25/41
Empirical Risk Minimization
min Êℓ(yŷ) s.t. ŷ := ⟨x, w⟩ + b

w
5 zero-one
hinge
4 square hinge
logistic2
3 exponential
Perceptron
ℓ
0
−4 −3 −2 −1 0 1 2
L00 yŷ 26/41
Regularization
min reg(w), s.t. ŷ := ⟨x, w⟩ + b

w
4
ℓ0
ℓ1
3 ℓ22
reg(w)
0
−3 −2 −1 0 1 2 3
L00
w 27/41
Regression
min Êℓ(y − ŷ) + reg(w), s.t. ŷ := ⟨x, w⟩ + b

w
3 square
ϵ-insensitive
absolute
2 Huber
ℓ
0
−3 −2 −1 0 1 2 3
y − ŷ
L00 28/41
Plan I: Basic
• Lec04: Proximal Gradient: smooth ℓ + nonsmooth reg
• Lec05: Subgradient: nonsmooth ℓ + nonsmooth reg
• Lec10: Acceleration: optimal algorithm under smoothness
• Lec11: Smoothing: nonsmooth —> smooth
• Lec12: Alternating: divide and conquer
• Lec13: Coordinate Gradient: large model
• Lec18: Stochastic Gradient: large dataset
• Lec22: Newton: even faster under smoothness
• Lec23: Quasi-Newton: Newton made economical
L00 29/41
Denoising
min 1
∥x − z∥22 + λ · ∥z∥tv
z |2 {z } | {z }
fidelity regularization
• λ controls the trade-off

• regularization encodes prior knowledge
• crucial to not over-smooth
L00 30/41
Adversarial Examples
Hidden Hidden
Shetland
layer 1 layer 2
Input
layer h1,1 h1,2
Output
x1 h2,1 h2,2 layer
x2 h3,1 h3,2 y1
..
x3 h4,1 h4,2 .
.. yc
. h5,1 h5,2
.. ..
Collie
xd
. .
hp,1 hp,2
L00 31/41
L00 32/41
Adversarial Attacks
• Mathematically, a neural network is a function f (w; x)

• Typically, input x is given and network weights w optimized
• Could also freeze weights w and optimize x, adversarially!
min size(δ) s.t. pred[f (w; x + δ)] ̸= y

δ
• More generally: maxδ ℓ(w; x + δ, y) s.t. size(δ) ≤ ϵ

L00 33/41
L00 34/41
Robustness as Optimization
• Empirical risk minimization recalled:
min Êℓ(w; x, y)
w
• Adversarial attack perturbs (x, y) while fixing w:

max ℓ(w; x + δ, y)
size(δ)≤ϵ
• Robustness by anticipating the worst-case:

min Ê max ℓ(w; x + δ, y)
w size(δ)≤ϵ
• The game continues by anticipating the anticipation:

max ℓ(w; x + δ, y) leader
size(δ)≤ϵ
min Êℓ(w; x + δ, y) follower

w
L00 35/41
Plan II: Game-theoretic
• Lec07: Fictitious Play: playing against oneself
• Lec14: Minimax: understanding duality
• Lec15: Averaging: projected gradient descent ascent
• Lec16: Extragradient: faster under smoothness
• Lec17: Splitting: exploiting structure
• Lec20: Randomized Smoothing: simulating gradient
L00 36/41
Generative Adversarial Networks
min max Ê log Sφ (x) + Ê log(1 − Sφ ◦ Tθ (z))

θ φ
I. Goodfellow et al. (2014). “Generative Adversarial Nets”. In: NIPS.

L00 37/41
L00 38/41
Plan III: Exotic
g62
q3− g63
1
p+
3
Lϱρ (w)
g61
q2− p−
3
p+ q2+ w g64
−1 2 1
p−
2
g66
q1+
−1 g65
• Lec06: Conditional Gradient: model weights quantization

• Lec09: Metric Gradient: model gradient quantization
• Lec08: Mirror Descent: gradient under non-Euclidean geometry
L00 39/41
History Goes A Long Way Back
“Nothing in the world takes place without optimization,
and there is no doubt that all aspects of the world that have
a rational basis can be explained by optimization methods.”
— Leonhard Euler, 1744
“Every year I meet Ph.D. students of different special-
izations who ask me for advice on reasonable numerical
schemes for their optimization models. And very often they
seem to have come too late. In my experience, if an opti-
mization model is created without taking into account the
abilities of numerical schemes, the chances that it will be
possible to find an acceptable numerical solution are close
to zero. In any field of human activity, if we create some-
thing, we know in advance why we are doing so and what
we are going to do with the result.” — Yurii Nesterov
L00 40/41
No Free Lunch
• On average, no algorithm is better than any other1
• In general, optimization problems are unsolvable2
• Implications:
– don’t try to solve all problems; one (class) at a time!
– “efficient optimization methods can be developed only by intelligently
employing the structure of particular instances of problems”
– know your algorithms and their limits
– be open to the impossible
“There are no inferior algorithms, only inferior engineers.”

1
D. H. Wolpert and W. G. Macready (1997). “No free lunch theorems for optimization”. IEEE Transactions
on Evolutionary Computation, vol. 1, no. 1, pp. 67–82.
2
K. G. Murty and S. N. Kabadi (1987). “Some NP-complete problems in quadratic and nonlinear
programming”. Mathematical Programming, vol. 39, no. 2, pp. 117–129.
L00 41/41

794 Lec Intro Handout

Uploaded by

Copyright:

Available Formats

794 Lec Intro Handout

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

794 Lec Intro Handout

Uploaded by

Copyright:

Available Formats

CS794/CO673: Optimization for Data Science

Lec 00: Introduction

• Office hours: Friday 4-5pm (DC3617) or by email appointment

• TA: Zeou Hu (zeou.hu@uwaterloo.ca)

• Some relevant books on course website

“Coding to programming is like typing to writing. ”

links available on the course website

• Expect 5 assignments, approx. 1 bi-weekly

– 20 points each; total: 100

– per approval, may substitute 1 assignment with a course project

• Small, constant progress every week

• Submit on LEARN. Submit early and often

– typeset using LATEX is recommended

– explicitly acknowledge any source that helped you

• Serious offense will result in expulsion. . .

• Appeal within two weeks

• Lots of cool applications

• Excellent for job-hunting

• Gain some implementation experience

• Know when to use which algorithm with what guarantees

• Start to formulate problems with algorithms in mind

• each column is a data point: n in total; each has d features

• bottom y is the label vector; binary in this case

• x and x′ are test samples whose labels need to be predicted

Reproduced with permissionNew York Times,

find w ∈ Rd , b ∈ R such that ∀i, yi (⟨xi , w⟩ + b) > 0.

• Perceptron solves the above optimization problem!

– yŷ > 0 vs. yŷ < 0 vs. yŷ = 0, where ŷ = ⟨x, w⟩ + b

• Lazy update: “if it ain’t broke, don’t fix it”

w = [2, 2], b = −1, ŷ = sign(⟨x, w⟩ + b),

• Prove that no line can separate + from –

• What happens if we run Perceptron regardless?

Marvin Minsky Seymour Papert

M. L. Minsky and S. A. Papert (1969). “Perceptron”. MIT press.

find w ∈ Rd , b ∈ R such that ∀i, yi (⟨xi , w⟩ + b) > 0

Algorithm 2: Projection Algorithm for Linear Inequalities

−0.5 – 0.5 1+ 1.5

min Êℓ(yŷ) + reg(w), s.t. ŷ := ⟨x, w⟩ + b

min Êℓ(yŷ) s.t. ŷ := ⟨x, w⟩ + b

min reg(w), s.t. ŷ := ⟨x, w⟩ + b

min Êℓ(y − ŷ) + reg(w), s.t. ŷ := ⟨x, w⟩ + b

• λ controls the trade-off

• Mathematically, a neural network is a function f (w; x)

min size(δ) s.t. pred[f (w; x + δ)] ̸= y

• More generally: maxδ ℓ(w; x + δ, y) s.t. size(δ) ≤ ϵ

• Adversarial attack perturbs (x, y) while fixing w:

• Robustness by anticipating the worst-case:

• The game continues by anticipating the anticipation:

min Êℓ(w; x + δ, y) follower

• Lec14: Minimax: understanding duality

• Lec15: Averaging: projected gradient descent ascent

• Lec16: Extragradient: faster under smoothness

• Lec17: Splitting: exploiting structure

• Lec20: Randomized Smoothing: simulating gradient

min max Ê log Sφ (x) + Ê log(1 − Sφ ◦ Tθ (z))

I. Goodfellow et al. (2014). “Generative Adversarial Nets”. In: NIPS.

• Lec06: Conditional Gradient: model weights quantization

“There are no inferior algorithms, only inferior engineers.”

You might also like