Lec 01
Lec 01
Lec 01
https://distill.pub/2017/feature-visualization/
https://distill.pub/2017/feature-visualization/
How you represent your data determines what questions are easy to
answer.
E.g. a dict of word counts is good for questions like “What is the most
common word in Hamlet?”
It’s not so good for semantic questions like “if Alice liked Harry Potter,
will she like The Hunger Games?”
TSNE
Denton et al., 2014, Deep generative image models using a Laplacian pyramid of adversarial networks
https://talktotransformer.com/
Linear models
One of the fundamental building blocks in deep learning are the linear
models, where you decide based on a linear function of the input
vector.
Here, we will review linear models, some other fundamental concepts
(e.g. gradient descent, generalization), and some of the common
supervised learning problems:
Regression: predict a scalar-valued target (e.g. stock price)
Binary classification: predict a binary label (e.g. spam vs. non-spam
email)
Multiway classification: predict a discrete label (e.g. object category,
from a list)
We can organize all the training examples into a matrix X with one
row per training example, and all the targets into a vector t.
y = Xw + b1
1
J = ky − tk2
2N
In Python:
= xj
∂y ∂ X
= wj 0 xj 0 + b
∂b ∂b 0 j
=1
w ← w − α∇J (w)
N
α X (i)
=w− (y − t (i) ) x(i)
N
i=1
Visualization:
http://www.cs.toronto.edu/~guerzhoy/321/lec/W01/linear_
regression.pdf#page=21
y = w0 + w1 x + · · · + wD x D
1 M =0 1 M =1
t t
0 0
−1 −1
0 x 1 0 x 1
y = w0 + w1 x + w2 x 2 + w3 x 3 y = w0 + w1 x + · · · + w9 x 9
1 M =3 1 M =9
t t
0 0
−1 −1
0 x 1 0 x 1
Underfitting : The model is too simple - does not fit the data.
1 M =0
t
−1
0 x 1
Overfitting : The model is too complex - fits perfectly, does not generalize.
1 M =9
t
−1
0 x 1
z = wT x + b
1 if z ≥ 0
output =
0 if z < 0
z = w> x + b
y = σ(z)
− log y if t = 1
LCE (y , t) =
− log(1 − y ) if t = 0
= −t log y − (1 − t) log(1 − y )
z = w> x + b
y = σ(z)
1
=
1 + e −z
LCE = −t log y − (1 − t) log(1 − y )
t = (0, . . . , 0, 1, 0, . . . , 0)
| {z }
entry k is 1
Vectorized:
z = Wx + b
z = Wx + b
y = softmax(z)
LCE = −t> (log y)