AI Lec2.1 MLsupervised

CSIT5900
Machine Learning (Supervised Learning) - The Ideas
Fangzhen Lin
Department of Computer Science and Engineering

Hong Kong University of Science and Technology
Fangzhen Lin (HKUST) ML 1/1

ML tasks
Classification: classify the given input (data) into one of the

pre-defined classes. Examples: Is the mail spam/not spam? Is the
applicant qualified/not qualified? Which meat is the dish made of,
chicken/pork/beef/lamb/duck? Which digit is the handwritten one
most likely stands for?
I Binary (two-class, binomial) classification: decide whether the input is
in the class or not.
I Multi-class classification: decide which class the input belongs to.
Regression: predict a value from the given input. Examples: predict
the stock price of a given company on a given date; predict the
electricity bill of a future month.

Feature Extraction
As in designing agents, instead of using the raw input, we

extract/compute features that are relevant to the task in hand.
To predict which direction the robot should move given the readings
from the sonar rings, we extract/compute 8 features s1 , ..., s8
corresponding to the 8 surrounding directions.
To classify if a given string is an email address, we extract features
such as “presence of @”, “endWith .com”, “endWith of .org”, etc.
What features to use for classifying a given email into spam/not
spam?
What features to use for classifying a given image (a matrix of pixels)
into pretty/not pretty?

Feature Vectors
If we have n features h1 ,...,hn for an input x, then we can represent a state

by a dictionary:
{h1 : v1 , ..., hn : vn }
where vi is the value of the feature hi . Example: for a string input
“abc@google.com”, we may have:
{presence of @ : 1, endWith .com : 1, endWith .org : 0}
Given that the feature names are fixed, there is no need to have them, so
a state can be represented by a feature vector: for an input x, its feature
vector is φ(x) = [φ1 (x), ..., φn (x)], where φi (x) is the value of the ith
feature of x.

Weight Vector and Score
An n-dimension weight vector w is a an n-vector of real numbers.

Assuming features are also real-valued, then the score of input x under
weight vector w is the the weighted sum of its features:
n
X
score(x, w ) = w · φ(x) = wi φi (x).
i=1

Linear Predictors
Prediction: given input x, predict output y .

Linear predictors: use score(x, w ) = w · φ(x) to predict y .
I Binary classifier: output sign(score(x, w )), where

1, if u > 0

sign(u) = 0, if u = 0

−1, if u < 0

Meaning: the input x is in the class if the output is 1, not in the class if
the output is -1, and don’t care/can’t tell/flip a coin if the output is 0.
I Progression: just use score(x, w ).
It’s called a linear predictor because the relationships between the
score and the features are linear.

Linear Predictors
How good a linear predictor can be depends crucially on if you have a right
set of features - remember that it’s linear in the features.
Consider the input x to be your body temperature (in Celsius), and the
output y to be that smaller it is, healthier you are.

Linear Predictors
Some possible equations for y :
y = |x − 37| or y = (x − 37)2 .
None of them is linear! So we need to invent some features so that the

output is linear in the features!
Consider
y = (x − 37)2 = x 2 − 74x + 372 .
So we introduce two features: φ1 (x) = x and φ2 (x) = x 2 . Now y is linear
in φ(x)!
Question: What features would make exclusive or linear?

Supervised Learning
Goal: learn a function of x: y = f (x).

Supervised learning: learn a function of φ(x): y = f (φ(x)) from a
training set Dt of feature vectors and their labels:
Dt = {[φ(x1 ), y1 ], ..., [φ(xk ), yk ]}.
Learning linear predictor: learn a weight vector w so that

sign(w · φ(x)) (binary classifier) or w · φ(x) (regression) agrees well
with the training set.

Data
Supervised learning is as good as your training set.

A good training set needs to have a good coverage.
Learning is more than just fitting the data from the training set, it
uses data to come up with a generalization.
What good is a perfect match like the following (rote learning): for
any x, if for some y [φ(x), y ] is in Dt then output y else pick a
random answer (or say don’t know).
Split your data into three sets: training set, for training your function;
validation set, for fine tuning your trained function; and test set, for
testing how good your learned function is.
Fangzhen Lin (HKUST) ML 10 / 1

Learning as Optimization
Recall that linear predictor uses sign(score(x, w )) (binary classifier) or

just score(x, w ) (regression) to make prediction:
n
X
score(x, w ) = w · φ(x) = wi φi (x).
i=1
So it all comes down to learning the weight vector from Dt .

One way is to cast it as the problem of minimizing training loss:
compute the weight vector w ∗ so that
w ∗ = arg minw Loss(Dt , w ).

Loss
Loss(Dt , w ) is supposed to measure the “loss”, the “differences” between

the desired output and the actual output using w for those in the training
set Dt . It’s normally the average of the individual losses:
1 X
Loss(Dt , w ) = Loss(x, y , w ),
|Dt |
(x,y )∈Dt
where Loss(x, y , w ) is the “loss”, or the “difference” between the desired

output (y ) and the predicted output using w .

Loss
For regression, two popular loss functions are

Absolute difference loss:
Lossabs (x, y , w ) = |score(x, w ) − y |.
Squared loss:
Losssq (x, y , w ) = (score(x, w ) − y )2 .

Loss
Consider an extreme case of Dt = {(1, 1), (1, 0), (1, 11)}, and
Under Lossabs , w ∗ = 1 (median). The median is less sensitive to

some outliers.
Under Losssq , w ∗ = 12/3 (mean). The mean tries to take into
account all. This is a popular loss function because it’s easier to use.

Loss
Consider Losssq (Dt , w ):
X
= 1/|Dt | Losssq (x, y , w )
(x,y )∈Dt
X
= 1/3 (score(x, w ) − y )2
(x,y )∈Dt
= 1/3[(score(1, w ) − 1)2 + (score(1, w ) − 0)2 + (score(1, w ) − 11)2 ]

= 1/3[(w − 1)2 + w 2 + (w − 11)2 ]
Thus to compute
we compute the derivative of Losssq (Dt , w ) and set it to 0:
(w − 1) + w + (w − 11) = 0.

Loss for Binary Classification
Margin: Given input x, its feature vector φ(x), the current weight vector
w , and its correct label y , the margin on (x, y ) is
margin(x, y , w ) = [w · φ(x)]y
= score(x, w )y .
The margin is for binary classifiers, and measures how correct (incorrect)
the current classification is.
A corresponding loss function, called zero-one loss, is the following
Boolean valued function:
Loss0−1 (x, y , w ) = margin(x, y , w ) ≤ 0.

Algorithms
Once a loss function is chosen, it is time to compute the weight vector
that will minimize it over the training set. This is a computationally hard
problem, and we can try a local search (“hill climbing” for maximizing and
“descending a canyon” for minimizing) for it:
Initialize w (e.g. 0);
while C {
w = successor (w );
update C ; }
where C is the condition for keep updating the weights, e.g. it can be
i ≤ N for some fixed N or could be a condition about the desired
accuracy; successor (w ) returns a “better” w and is computed using
training instances and the loss function.
In ML, two popular ways of “descending the canyon” are gradient descend
and stochastic gradient descend.

Gradient descend (GD)
In gradient descend, we use the entire training set Dt to compute the

gradient of the loss function on the current weight vector and decrease the
weight by a “fraction” of it:
successor (w ) = w − c∇w Loss(Dt , w ),
where c is the learning rate or step size.

Intuitively, ∇w Loss(Dt , w ) is the direction that increases the loss the most.

Gradient descend - least squared loss
Recall
1 X
Losssq (Dt , w ) = Losssq (x, y , w )
|Dt |
(x,y )∈Dt
1 X
= (score(x, w ) − y )2
|Dt |
(x,y )∈Dt
1 X
= (w · φ(x) − y )2
|Dt |
(x,y )∈Dt
Now use chain rule to compute the gradient:

1 X
∇w Losssq (Dt , w ) = 2(w · φ(x) − y )φ(x).
|Dt |
(x,y )∈Dt
This is slow when Dt is large!

Stochastic gradient descend (SGD)
Same idea but instead of doing a pass on the entire training set in each
step, it go through training instances one at a time:
successor (w , x, y ) = w − c∇w Loss(x, y , w ),

For each (x, y ) ∈ Dt :
w = successor (w , x, y )
There are reports that in practice, doing one pass over the training
instances with SGD, often performs comparably to taking ten passes over
the data with GD.

(Artificial) Neural Networks
An (artificial) neural network is a directed graph, most often acyclic.
The sources (no incoming arcs) are inputs and the targets (no
outgoing arcs) are outputs. The internal nodes represent “hidden”
features.
Figure: From Percy Liang’s AI note


AI Lec2.1 MLsupervised

Uploaded by

Copyright:

Available Formats

AI Lec2.1 MLsupervised

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI Lec2.1 MLsupervised

Uploaded by

Copyright:

Available Formats

CSIT5900

Machine Learning (Supervised Learning) - The Ideas

Department of Computer Science and Engineering

Fangzhen Lin (HKUST) ML 1/1

Classification: classify the given input (data) into one of the

Fangzhen Lin (HKUST) ML 2/1

As in designing agents, instead of using the raw input, we

Fangzhen Lin (HKUST) ML 3/1

If we have n features h1 ,...,hn for an input x, then we can represent a state

{presence of @ : 1, endWith .com : 1, endWith .org : 0}

Fangzhen Lin (HKUST) ML 4/1

An n-dimension weight vector w is a an n-vector of real numbers.

Fangzhen Lin (HKUST) ML 5/1

Prediction: given input x, predict output y .

Fangzhen Lin (HKUST) ML 6/1

Fangzhen Lin (HKUST) ML 7/1

Some possible equations for y :

None of them is linear! So we need to invent some features so that the

Fangzhen Lin (HKUST) ML 8/1

Goal: learn a function of x: y = f (x).

Dt = {[φ(x1 ), y1 ], ..., [φ(xk ), yk ]}.

Learning linear predictor: learn a weight vector w so that

Fangzhen Lin (HKUST) ML 9/1

Supervised learning is as good as your training set.

Fangzhen Lin (HKUST) ML 10 / 1

Recall that linear predictor uses sign(score(x, w )) (binary classifier) or

So it all comes down to learning the weight vector from Dt .

w ∗ = arg minw Loss(Dt , w ).

Fangzhen Lin (HKUST) ML 11 / 1

Loss(Dt , w ) is supposed to measure the “loss”, the “differences” between

where Loss(x, y , w ) is the “loss”, or the “difference” between the desired

Fangzhen Lin (HKUST) ML 12 / 1

For regression, two popular loss functions are

Lossabs (x, y , w ) = |score(x, w ) − y |.

Losssq (x, y , w ) = (score(x, w ) − y )2 .

Fangzhen Lin (HKUST) ML 13 / 1

w ∗ = arg minw Loss(Dt , w ).

Under Lossabs , w ∗ = 1 (median). The median is less sensitive to

Fangzhen Lin (HKUST) ML 14 / 1

= 1/3[(score(1, w ) − 1)2 + (score(1, w ) − 0)2 + (score(1, w ) − 11)2 ]

Fangzhen Lin (HKUST) ML 15 / 1

Loss0−1 (x, y , w ) = margin(x, y , w ) ≤ 0.

Fangzhen Lin (HKUST) ML 16 / 1

Fangzhen Lin (HKUST) ML 17 / 1

In gradient descend, we use the entire training set Dt to compute the

successor (w ) = w − c∇w Loss(Dt , w ),

where c is the learning rate or step size.

Fangzhen Lin (HKUST) ML 18 / 1

Now use chain rule to compute the gradient:

This is slow when Dt is large!

successor (w , x, y ) = w − c∇w Loss(x, y , w ),

Fangzhen Lin (HKUST) ML 20 / 1

Figure: From Percy Liang’s AI note

You might also like