AI Lec2.1 MLsupervised

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

CSIT5900

Machine Learning (Supervised Learning) - The Ideas

Fangzhen Lin

Department of Computer Science and Engineering


Hong Kong University of Science and Technology

Fangzhen Lin (HKUST) ML 1/1


ML tasks

Classification: classify the given input (data) into one of the


pre-defined classes. Examples: Is the mail spam/not spam? Is the
applicant qualified/not qualified? Which meat is the dish made of,
chicken/pork/beef/lamb/duck? Which digit is the handwritten one
most likely stands for?
I Binary (two-class, binomial) classification: decide whether the input is
in the class or not.
I Multi-class classification: decide which class the input belongs to.
Regression: predict a value from the given input. Examples: predict
the stock price of a given company on a given date; predict the
electricity bill of a future month.

Fangzhen Lin (HKUST) ML 2/1


Feature Extraction

As in designing agents, instead of using the raw input, we


extract/compute features that are relevant to the task in hand.
To predict which direction the robot should move given the readings
from the sonar rings, we extract/compute 8 features s1 , ..., s8
corresponding to the 8 surrounding directions.
To classify if a given string is an email address, we extract features
such as “presence of @”, “endWith .com”, “endWith of .org”, etc.
What features to use for classifying a given email into spam/not
spam?
What features to use for classifying a given image (a matrix of pixels)
into pretty/not pretty?

Fangzhen Lin (HKUST) ML 3/1


Feature Vectors

If we have n features h1 ,...,hn for an input x, then we can represent a state


by a dictionary:
{h1 : v1 , ..., hn : vn }
where vi is the value of the feature hi . Example: for a string input
“abc@google.com”, we may have:

{presence of @ : 1, endWith .com : 1, endWith .org : 0}

Given that the feature names are fixed, there is no need to have them, so
a state can be represented by a feature vector: for an input x, its feature
vector is φ(x) = [φ1 (x), ..., φn (x)], where φi (x) is the value of the ith
feature of x.

Fangzhen Lin (HKUST) ML 4/1


Weight Vector and Score

An n-dimension weight vector w is a an n-vector of real numbers.


Assuming features are also real-valued, then the score of input x under
weight vector w is the the weighted sum of its features:
n
X
score(x, w ) = w · φ(x) = wi φi (x).
i=1

Fangzhen Lin (HKUST) ML 5/1


Linear Predictors

Prediction: given input x, predict output y .


Linear predictors: use score(x, w ) = w · φ(x) to predict y .
I Binary classifier: output sign(score(x, w )), where

1, if u > 0

sign(u) = 0, if u = 0

−1, if u < 0

Meaning: the input x is in the class if the output is 1, not in the class if
the output is -1, and don’t care/can’t tell/flip a coin if the output is 0.
I Progression: just use score(x, w ).
It’s called a linear predictor because the relationships between the
score and the features are linear.

Fangzhen Lin (HKUST) ML 6/1


Linear Predictors

How good a linear predictor can be depends crucially on if you have a right
set of features - remember that it’s linear in the features.
Consider the input x to be your body temperature (in Celsius), and the
output y to be that smaller it is, healthier you are.

Fangzhen Lin (HKUST) ML 7/1


Linear Predictors

Some possible equations for y :

y = |x − 37| or y = (x − 37)2 .

None of them is linear! So we need to invent some features so that the


output is linear in the features!
Consider
y = (x − 37)2 = x 2 − 74x + 372 .
So we introduce two features: φ1 (x) = x and φ2 (x) = x 2 . Now y is linear
in φ(x)!
Question: What features would make exclusive or linear?

Fangzhen Lin (HKUST) ML 8/1


Supervised Learning

Goal: learn a function of x: y = f (x).


Supervised learning: learn a function of φ(x): y = f (φ(x)) from a
training set Dt of feature vectors and their labels:

Dt = {[φ(x1 ), y1 ], ..., [φ(xk ), yk ]}.

Learning linear predictor: learn a weight vector w so that


sign(w · φ(x)) (binary classifier) or w · φ(x) (regression) agrees well
with the training set.

Fangzhen Lin (HKUST) ML 9/1


Data

Supervised learning is as good as your training set.


A good training set needs to have a good coverage.
Learning is more than just fitting the data from the training set, it
uses data to come up with a generalization.
What good is a perfect match like the following (rote learning): for
any x, if for some y [φ(x), y ] is in Dt then output y else pick a
random answer (or say don’t know).
Split your data into three sets: training set, for training your function;
validation set, for fine tuning your trained function; and test set, for
testing how good your learned function is.

Fangzhen Lin (HKUST) ML 10 / 1


Learning as Optimization

Recall that linear predictor uses sign(score(x, w )) (binary classifier) or


just score(x, w ) (regression) to make prediction:
n
X
score(x, w ) = w · φ(x) = wi φi (x).
i=1

So it all comes down to learning the weight vector from Dt .


One way is to cast it as the problem of minimizing training loss:
compute the weight vector w ∗ so that

w ∗ = arg minw Loss(Dt , w ).

Fangzhen Lin (HKUST) ML 11 / 1


Loss

Loss(Dt , w ) is supposed to measure the “loss”, the “differences” between


the desired output and the actual output using w for those in the training
set Dt . It’s normally the average of the individual losses:
1 X
Loss(Dt , w ) = Loss(x, y , w ),
|Dt |
(x,y )∈Dt

where Loss(x, y , w ) is the “loss”, or the “difference” between the desired


output (y ) and the predicted output using w .

Fangzhen Lin (HKUST) ML 12 / 1


Loss

For regression, two popular loss functions are


Absolute difference loss:

Lossabs (x, y , w ) = |score(x, w ) − y |.

Squared loss:

Losssq (x, y , w ) = (score(x, w ) − y )2 .

Fangzhen Lin (HKUST) ML 13 / 1


Loss

Consider an extreme case of Dt = {(1, 1), (1, 0), (1, 11)}, and

w ∗ = arg minw Loss(Dt , w ).

Under Lossabs , w ∗ = 1 (median). The median is less sensitive to


some outliers.
Under Losssq , w ∗ = 12/3 (mean). The mean tries to take into
account all. This is a popular loss function because it’s easier to use.

Fangzhen Lin (HKUST) ML 14 / 1


Loss
Consider Losssq (Dt , w ):
X
= 1/|Dt | Losssq (x, y , w )
(x,y )∈Dt
X
= 1/3 (score(x, w ) − y )2
(x,y )∈Dt

= 1/3[(score(1, w ) − 1)2 + (score(1, w ) − 0)2 + (score(1, w ) − 11)2 ]


= 1/3[(w − 1)2 + w 2 + (w − 11)2 ]

Thus to compute
w ∗ = arg minw Loss(Dt , w ).
we compute the derivative of Losssq (Dt , w ) and set it to 0:

(w − 1) + w + (w − 11) = 0.

Fangzhen Lin (HKUST) ML 15 / 1


Loss for Binary Classification

Margin: Given input x, its feature vector φ(x), the current weight vector
w , and its correct label y , the margin on (x, y ) is

margin(x, y , w ) = [w · φ(x)]y
= score(x, w )y .

The margin is for binary classifiers, and measures how correct (incorrect)
the current classification is.
A corresponding loss function, called zero-one loss, is the following
Boolean valued function:

Loss0−1 (x, y , w ) = margin(x, y , w ) ≤ 0.

Fangzhen Lin (HKUST) ML 16 / 1


Algorithms
Once a loss function is chosen, it is time to compute the weight vector
that will minimize it over the training set. This is a computationally hard
problem, and we can try a local search (“hill climbing” for maximizing and
“descending a canyon” for minimizing) for it:
Initialize w (e.g. 0);
while C {
w = successor (w );
update C ; }
where C is the condition for keep updating the weights, e.g. it can be
i ≤ N for some fixed N or could be a condition about the desired
accuracy; successor (w ) returns a “better” w and is computed using
training instances and the loss function.
In ML, two popular ways of “descending the canyon” are gradient descend
and stochastic gradient descend.

Fangzhen Lin (HKUST) ML 17 / 1


Gradient descend (GD)

In gradient descend, we use the entire training set Dt to compute the


gradient of the loss function on the current weight vector and decrease the
weight by a “fraction” of it:

successor (w ) = w − c∇w Loss(Dt , w ),

where c is the learning rate or step size.


Intuitively, ∇w Loss(Dt , w ) is the direction that increases the loss the most.

Fangzhen Lin (HKUST) ML 18 / 1


Gradient descend - least squared loss
Recall
1 X
Losssq (Dt , w ) = Losssq (x, y , w )
|Dt |
(x,y )∈Dt
1 X
= (score(x, w ) − y )2
|Dt |
(x,y )∈Dt
1 X
= (w · φ(x) − y )2
|Dt |
(x,y )∈Dt

Now use chain rule to compute the gradient:


1 X
∇w Losssq (Dt , w ) = 2(w · φ(x) − y )φ(x).
|Dt |
(x,y )∈Dt

This is slow when Dt is large!


Fangzhen Lin (HKUST) ML 19 / 1
Stochastic gradient descend (SGD)

Same idea but instead of doing a pass on the entire training set in each
step, it go through training instances one at a time:

successor (w , x, y ) = w − c∇w Loss(x, y , w ),


For each (x, y ) ∈ Dt :
w = successor (w , x, y )

There are reports that in practice, doing one pass over the training
instances with SGD, often performs comparably to taking ten passes over
the data with GD.

Fangzhen Lin (HKUST) ML 20 / 1


(Artificial) Neural Networks
An (artificial) neural network is a directed graph, most often acyclic.
The sources (no incoming arcs) are inputs and the targets (no
outgoing arcs) are outputs. The internal nodes represent “hidden”
features.

Figure: From Percy Liang’s AI note


Fangzhen Lin (HKUST) ML 21 / 1

You might also like