Testing: Training

PSfrag repla
ements
Review of Le ture 17
Sampling bias
Hi
O am's Razor
The simplest model that
ts the data is also the
most plausible.
P(x)
testing
training
Hi
Data snooping
Cumulative Prot %
30
snooping
20
10
omplexity of h omplexity of H
unlikely event signi ant if it happens
-10
no snooping
100
200
300
Day
400
500
Learning From Data

Yaser S. Abu-Mostafa
California Institute of Te hnology
Le ture 18:
Epilogue
Sponsored by Calte h's Provost O e, E&AS Division, and IST
Thursday, May 31, 2012
Outline
The map of ma hine learning
Bayesian learning
Aggregation methods
A knowledgments
AM

L
Creator: Yaser Abu-Mostafa - LFD Le ture 18
2/23
It's a jungle out there
semisupervised learning
stochastic gradient descent
overfitting
Gaussian processes
distributionfree
collaborative filtering
deterministic noise
linear regression
VC dimension
nonlinear transformation
decision trees
data snooping
sampling bias
Q learning
SVM
learning curves
mixture of expe
neural networks
no free
training versus testing

RBF
noisy targets
Bayesian prior
active learning
linear models
biasvariance tradeoff
weak learners
ordinal regression
logistic regression
data contamination
cross validation
ensemble learning
types of learning
xploration versus exploitation
error measures
is learning feasible?
clustering
AM

L
regularization
kernel methods
hidden Markov mod

perceptrons
graphical models
softorder constraint
weight decay
Occams razor
Boltzmann mach
3/23
The map
THEORY
TECHNIQUES
models
VC
biasvariance
complexity
linear
methods
supervised
regularization
neural networks
SVM
nearest neighbors
bayesian
PARADIGMS
RBF
gaussian processes
unsupervised
validation
reinforcement
aggregation
active
input processing
online
SVD
graphical models
AM

L
4/23
Outline
Bayesian learning
Aggregation methods
A knowledgments
AM

L
5/23
Probabilisti approa h
Hi
Extend probabilisti role to all omponents
P (D | h = f )
How about
de ides whi h
P (h = f | D)
UNKNOWN TARGET DISTRIBUTION

P(y |
x)
target function f: X
UNKNOWN
INPUT
DISTRIBUTION
plus noise
P( x )
(likelihood)
x1 , ... , xN
DATA SET
D = ( x1 , y1 ), ... , ( xN , yN )
g ( x )~
~ f (x )
LEARNING
ALGORITHM
FINAL
HYPOTHESIS
g: X Y
HYPOTHESIS SET
H
Hi
AM

L
6/23
The prior
P (h = f | D)
requires an additional probability distribution:
P (D | h = f ) P (h = f )
P (h = f | D) =
P (D | h = f ) P (h = f )
P (D)
P (h = f )
is the
P (h = f | D)
prior
is the
posterior
Given the prior, we have the full distribution
AM

L
7/23
Example of a prior
Consider a per eptron:
A possible prior on
w:
h is
Ea h
determined by
wi
w = w0, w1, , wd
is independent, uniform over
[1, 1]
This determines the prior over
Given
D,
we an ompute
P (h = f )
P (D | h = f )
Putting them together, we get
P (h = f | D)
P (h = f )P (D | h = f )
AM

L
8/23
A prior is an assumption
Even the most neutral prior:
Hi
is unknown
is random
P(x)
x
Hi
The true equivalent would be:

Hi
is unknown
is random
(xa)
1
AM

L
x
Hi
9/23
If we knew the prior

...
we ould ompute
P (h = f | D)
for every
hH
we an nd the most probable
we an derive
E(h(x))
we an derive the
h given the data
for every
error bar for every x
we an derive everything in a prin ipled way
AM

L
10/23
When is Bayesian learning justied?

1. The prior is
valid
trumps all other methods
2. The prior is
irrelevant
just a omputational atalyst
AM

L
11/23
Outline
Bayesian learning
Aggregation methods
A knowledgments
AM

L
12/23
What is aggregation?
Combining dierent solutions
Hi
h 1 , h2 , , hT
that were trained on
D:
Hi
Regression: take an average

Classi ation: take a vote
a.k.a.
AM

L
ensemble learning
and
boosting
13/23
Dierent from 2-layer learning

Hi
In a 2-layer model, all units learn
In aggregation, they learn
training data
jointly:
Learning
Algorithm
independently then get ombined:
Hi
Hi
training data
Learning
Algorithm
Hi
AM

L
14/23
Two types of aggregation

1. After the fa t: ombines existing solutions
Example. Netix teams merging
blending
2. Before the fa t: reates solutions to be ombined

Example. Bagging - resampling D
Hi
training data
Learning
Algorithm
Hi
AM

L
15/23
De orrelation - boosting
Create
h 1 , , ht ,
sequentially: Make
ht
de orrelated with previous
h's:
Hi
training data
Learning
Algorithm
Hi
Emphasize points in
Choose weight of
AM

L
ht
that were mis lassied
based on
E (ht)
in
16/23
Blending - after the fa t

For regression,
h 1 , h2 , , hT
g(x) =
T
X
t ht(x)
t=1
Prin ipled hoi e of
t's:
minimize the error on an aggregation data set
Some
t's
pseudo-inverse
an ome out negative
Most valuable
ht
in the blend?
Un orrelated ht's help the blend

AM

L
17/23
Outline
Bayesian learning
Aggregation methods
A knowledgments
AM

L
18/23
Course ontent
AM

L
Professor
Malik Magdon-Ismail, RPI
Professor
Hsuan-Tien Lin, NTU
19/23
Course sta
Carlos Gonzalez (Head TA)
Ron Appel
Costis Sideris
Doris Xin
AM

L
20/23
Filming, produ tion, and infrastru ture

Leslie Maxeld and the AMT sta
Ri h Fagen and the IMSS sta
AM

L
21/23
Calte h support
IST -
AM

L
Mathieu Desbrun
E&AS Division -
Ares Rosakis
Provost's O e -
Ed Stolper
and
and
Mani Chandy
Melany Hunt
22/23
Many others
Calte h TA's and sta members
Calte h alumni and Alumni Asso iation
Colleagues all over the world
AM

L
23/23
To the fond memory of
Faiza A. Ibrahim

Testing: Training

Uploaded by

Copyright:

Available Formats

Testing: Training

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Testing: Training

Uploaded by

Copyright:

Available Formats

PSfrag repla

Learning From Data

California Institute of Te hnology

Sponsored by Calte h's Provost O e, E&AS Division, and IST

Thursday, May 31, 2012

The map of ma hine learning

Creator: Yaser Abu-Mostafa - LFD Le ture 18

It's a jungle out there

stochastic gradient descent

training versus testing

xploration versus exploitation

Creator: Yaser Abu-Mostafa - LFD Le ture 18

hidden Markov mod

Creator: Yaser Abu-Mostafa - LFD Le ture 18

The map of ma hine learning

Creator: Yaser Abu-Mostafa - LFD Le ture 18

Extend probabilisti role to all omponents

UNKNOWN TARGET DISTRIBUTION

Creator: Yaser Abu-Mostafa - LFD Le ture 18

requires an additional probability distribution:

Given the prior, we have the full distribution

Creator: Yaser Abu-Mostafa - LFD Le ture 18

is independent, uniform over

This determines the prior over

Putting them together, we get

Creator: Yaser Abu-Mostafa - LFD Le ture 18

The true equivalent would be:

Creator: Yaser Abu-Mostafa - LFD Le ture 18

If we knew the prior

we an nd the most probable

h given the data

error bar for every x

we an derive everything in a prin ipled way

Creator: Yaser Abu-Mostafa - LFD Le ture 18

When is Bayesian learning justied?

trumps all other methods

just a omputational atalyst

Creator: Yaser Abu-Mostafa - LFD Le ture 18

The map of ma hine learning

Creator: Yaser Abu-Mostafa - LFD Le ture 18

that were trained on

Regression: take an average

Creator: Yaser Abu-Mostafa - LFD Le ture 18

Dierent from 2-layer learning

In a 2-layer model, all units learn

In aggregation, they learn

independently then get ombined:

Creator: Yaser Abu-Mostafa - LFD Le ture 18

Two types of aggregation

2. Before the fa t: reates solutions to be ombined

Creator: Yaser Abu-Mostafa - LFD Le ture 18

de orrelated with previous

Creator: Yaser Abu-Mostafa - LFD Le ture 18

that were mis lassied

Blending - after the fa t

minimize the error on an aggregation data set

an ome out negative

Un orrelated ht's help the blend

Sponsored by Calte h's Provost O e, E&AS Division, and IST

we an nd the most probable

When is Bayesian learning justied?

Dierent from 2-layer learning

that were mis lassied

minimize the error on an aggregation data set