Unit 1 Lecture 3
Unit 1 Lecture 3
Unit 1 Lecture 3
If before we’d have predicted “the label is 3”, we now predict “the label is 3 with
80% chance and 1 with 20% chance”. This hard-coded 80% could suffice.◦ But ← As always, it depends on what specific thing
let’s do better: intuitively, a 3 is more likely when threeness is huge than when we’re trying to do!
threeness is nearly zero. So let’s replace that 80% by some smooth function of
threeness. A popular, theoretically warranted choice is σ(z) = 1/(1 + exp(−z)):◦ ← σ, the logistic or sigmoid function, has lin-
ear log-odds: σ(z)/(1−σ(z)) = exp(z)/1. It
sigma = lambda z : 1./(1.+np.exp(-z))
tends exponentially to the step function. It’s
prediction = {3 : sigma(threeness[0]),
symmetrical: σ(−z) = 1−σ(z). Its derivative
1 : 1.-sigma(threeness[0]) }
concentrates near zero: σ� (z) = σ(z)σ(−z).
Food For Thought: Plot σ(z) by hand.
Given training inputs xi , a hypothesis will have “hunches” about the training
outputs yi . Three hypotheses hthree! , hthree , and hone might, respectively, con-
fidently assert y42 = 3; merely lean toward y42 = 3; and think y42 = 1. If in
reality y42 = 1 then we’d say hone did a good job, hthree a bad job, and hthree!
a very bad job on the 42nd example. So the training set “surprises” different
hypotheses to different degrees. We may seek a hypothesis h� that is minimally
surprised, i.e., usually confidently right and when wrong not confidently so. In
short, by outputting probabilities instead of mere labels, we’ve earned this awe-
some upshot: the machine can automatically calibrate its confidence levels! It’s easy
to imagine how important this calibration is in language, self-driving, etc.
interpreting weights — We note two aspects of the ‘intuitive logic’ of weights.
Just because two features both correlate with a positive label (y = +1) doesn’t
mean both features will have positive weights. In other words, it could be that
the blah-feature correlates with y = +1 in the training set and yet, according to
the best hypothesis for that training set, the bigger a fresh input’s blah feature is,
the less likely its label is to be +1, all else being equal. That last phrase “all else
being equal” is crucial, since it refers to our choice of coordinates. See the figure,
left three panels.
Moreover, transforming coordinates, even linearly, can alter predictions. For
example, if we shear two features together — say, by using cooktime-plus-preptime
and cooktime as features rather than preptime and cooktime as features — this
can impact the decision boundary. Of course, the decision boundary will look
different because we’re in new coordinates; but we mean something more pro-
Figure 10: Relations between feature statis-
found: if we train in old coordinates and then predict a datapoint represented tics and optimal weights. Each of these six
figures shows a different binary classification
task along with a maximum-margin hypothe-
sis. We shade the datapoints that achieve the
margin. — Left: positive weights don’t imply pos-
itive correlation! — Right: presenting the same
information in different coordinates alters predic-
tions!
15
humble models — Let’s modify logistic classification to allow for unknown unknowns. We’ll
do this by allowing a classifier to allot probability mass not only among labels
in Y but also to a special class � that means “no comment” or “alien input”. A
logistic classifier always sets py|x [�|x] = 0, but other probability models may put
nonzero mass on “no comment”. For example, consider:
logistic perceptron svm
py|x [+1|x] ⊕/(� + ⊕) ⊕ · (� ∧ ⊕)/2 ⊕ · (� ∧ ⊕/e)/2 Table 1: Three popular models for binary
classification. Top rows: Modeled chance
py|x [−1|x] �/(� + ⊕) � · (� ∧ ⊕)/2 � · (�/e ∧ ⊕)/2 given x that y = +1, −1, �. We use d = w � · �x,
py|x [�|x] 1 − above = 0 1 − above 1 − above ⊕ = e+d/2 , � = e−d/2 , a ∧ b = min(a, b)
outliers responsive robust robust to save ink. Middle rows: All models re-
inliers sensitive blind sensitive spond to misclassifications. But are they robust
to well-classified outliers? Sensitive to well-
acc bnd good bad good
classified inliers? Bottom rows: For optimiza-
loss name softplus(·) srelu(·) hinge(·) tion, which we’ll discuss later, we list (negative
formula log2 (1 + e(·) ) max(1, ·) + 1 max(1, · + 1) log-probability) losses. An SGD step looks like
update 1/(1 + e+yd ) step(−yd) step(1 − yd)
� t + η · update · y�x
� t+1 = w
w
MLE with the perceptron model or svm model minimizes the same thing, but
with srelu(z) = max(0, z) + 1 or hinge(z) = max(0, z + 1) instead of softplus(z).
Two essential properties of softplus are that: (a) it is convex◦ and (b) it upper
bounds the step function. Note that srelu and hinge also enjoy these properties.
Property (a) ensures that the optimization problem is relatively easy — under
mild conditions, gradient descent will find a global minimum. By property (b), ← A function is convex when its graph is bowl-
shaped rather than wriggly. It’s easy to min-
the total loss on a training set upper bounds the rate of erroneous classification imize convex functions by ‘rolling downhill’,
on that training set. So loss is a surrogate for (in)accuracy: if the minimized loss since we’ll never get stuck in a local wrig-
gle. Don’t worry about remembering or un-
is nearly zero, then the training accuracy is nearly 100%.◦ derstanding this word.
So we have a family of related models: logistic, perceptron, and SVM. In ← The perceptron satisfies (b) in a trivial way
Project 1 we’ll find hypotheses optimal with respect to the perceptron and SVM that yields a vacuous bound of 100% on the
error rate.
models (the latter under a historical name of pegasos), but soon we’ll focus
mainly on logistic models, since they fit best with deep learning.
richer outputs: multiple classes — We’ve explored hypotheses fW (x) = readout(W ·
featurize(x)) where W represents the linear-combination step we tune to data.
We began with hard binary classification, wherein we map inputs to definite
labels (say, y = cow or y = dog):
I hope some of you felt bugged by the above formulas’ asymmetry: W mea-
sures “cow-ishness minus dog-ishness” — why not the other way around? Let’s
17
describe the same set of hypotheses but in a more symmetrical way. A common
theme in mathematical problem solving is to trade irredundancy for symmetry
(or vice versa). So let’s posit both a Wcow and a Wdog . One measures “cow-
ishness”; the other, “dog-ishness”. They assemble to give W, which is now a
matrix of shape 2×number-of-features. So d is now a list of 2 numbers: dcow and
ddog . Now dcow − ddog plays the role that d used to play.
Then we can do hard classification by:
readout(d) = argmaxy dy
readout(d) = d
← VERY OPTIONAL PASSAGE
This is far from the only choice! For example, if we know that the true ys
will always be positive, then readout(d) = exp(d) may make more sense. I’ve
encountered a learning task (about alternating current in power lines) where
what domain knowledge suggested — and what ended up working best — were
trigonometric functions for featurization and readout! There are also many ways
to return a distribution instead of a number. One way to do such soft one-output
regression is to use normal distributions:◦ ← Ask on Piazza about the interesting world of
alternatives and how they influence learning!
readout(d) = “normal distribution with mean d and variance 25”
(dog.5mya.(cow.2mya.raccoon))
or
((chicken.63mya.snake).64mya.(cow)).120mya.(snail.1mya.slug)
That is, a tree is either a pair of trees together with a real number OR a species
name. The numbers represent how long ago various clades diverged. Propose
an architecture that, given a list of species, predicts a phylogenetic tree for that
species. Don’t worry about featurization.