AI Lec2.1 MLsupervised
AI Lec2.1 MLsupervised
AI Lec2.1 MLsupervised
Fangzhen Lin
Given that the feature names are fixed, there is no need to have them, so
a state can be represented by a feature vector: for an input x, its feature
vector is φ(x) = [φ1 (x), ..., φn (x)], where φi (x) is the value of the ith
feature of x.
Meaning: the input x is in the class if the output is 1, not in the class if
the output is -1, and don’t care/can’t tell/flip a coin if the output is 0.
I Progression: just use score(x, w ).
It’s called a linear predictor because the relationships between the
score and the features are linear.
How good a linear predictor can be depends crucially on if you have a right
set of features - remember that it’s linear in the features.
Consider the input x to be your body temperature (in Celsius), and the
output y to be that smaller it is, healthier you are.
y = |x − 37| or y = (x − 37)2 .
Squared loss:
Consider an extreme case of Dt = {(1, 1), (1, 0), (1, 11)}, and
Thus to compute
w ∗ = arg minw Loss(Dt , w ).
we compute the derivative of Losssq (Dt , w ) and set it to 0:
(w − 1) + w + (w − 11) = 0.
Margin: Given input x, its feature vector φ(x), the current weight vector
w , and its correct label y , the margin on (x, y ) is
margin(x, y , w ) = [w · φ(x)]y
= score(x, w )y .
The margin is for binary classifiers, and measures how correct (incorrect)
the current classification is.
A corresponding loss function, called zero-one loss, is the following
Boolean valued function:
Same idea but instead of doing a pass on the entire training set in each
step, it go through training instances one at a time:
There are reports that in practice, doing one pass over the training
instances with SGD, often performs comparably to taking ten passes over
the data with GD.