INTRODUCTION TO MACHINE LEARNING VIA
LINEAR REGRESSION
Study key concepts band on tu
peu nd learning
3 I Supervised learning
Problem formulation Gwu training ht D containing
N training punts tn tn nal N
X n are independent variables i w variates domainpoint
explanatory variables
1n are dependent variables dependent variables labels
responses
E
ntn
15
N IO
X
I
OJ X
X
L I X l I 2 En
U L O Y OG O f
UT
X
l
X
1.5
bout predict t 1W an unobserved domain point
need to make assuphins about the mechanism
generating data inductive bias
difference between me minting and learning
3 2 Statistical inference
T
predicting r r
given observation X and known
joint distribution pix t
mhm a non
negative loss function llt.tn
Cust Russ risk if correct value is tandeshmate ist
la loss lg Ct It It I 19
Ex quadraticloss lzlt.FI ft Il
I t I
U I loss to It It It I lo O t I
Generalization risk loss average Ivr pm prediction fly
Lp Ltte Emina htt I HID
Optimalprediction tix obtained by mhinking
I E It
aryfin Etna felt
txt
Onlyneed to know the pushionir distribution pm
joint distribution pt not required
o Fir le E HI Emp Ttt
as
Et S TEl p the DT ft 2 IT EY
Ait DT
Elt Ix 2 2 I ELT ly
O
duEI 2E 2E IT 1 1 0 I E Thr
Et PHI x OS fI t t t 0.2 f tt x
n pI t 1 1
Li t xI t O S t l il O 2 OG x
Performance ofpredictor t r is measured
by the
Clitheroe between Lpl f and the minimum
generals tution loss Lp Et of the optimal
predictor
m the following three learning approaches for t HI
are discussed frequentist approach Bayesian
approach ND L approach
3 3 Frequentist approach
Assumption Turning data punts xn t n I e D are
drawn i i d from a true but unknown distribution
p It t 1 tn Tn T.ee pl t t I e l N
Distribution unknown ay run strategy from us we
cannot be applied
Two ways to approach problem of unknown dishib
separate learning and inference
lean
approx of distribution Bott Ix 1 bated on
data and u te
I lxl ay renin Eta pm I l IT E Ix ell
direct interference via empirical risk minimitation
E Rn learn approximationTp Lt 1 of optimal
decision as
Eb CH agmein Lb Ctl
with the empirical risk loss
LA fr e I t n f Hn
Remaki In contrast to the generahtation loss where
expectation over true distribution is calculated
here we take the average over the available data
we hist look at vice alinear refresh in example
Assure i pl t t I p It I pl t Ix xn Unit 10,1 and
PTI t TN Lh ZI x O l
n th
15 N lo
X
OJ X
X
L I X l I 2 En
U L O Y OG O f
UT
X
l
X
1.5
If distribution is known the optimal predictor
ee act
uncle ez loss is th x I an L2 Tx
Minimum generalisation loss
Lp I I l J tf Ptv dt
hh ziti
E Thx 2 tin 12TH ELT x 4h42 It
E T2 H E 4T x Var NH o I
3 3 1 Discriminative vs generative models
Formulate a hypothesisclass familyofparametric
probabilistic models learn parans of mold which best
ht the data
Muhl la bdt as a
polynomial of domain puntt
Gaussian noir
c Curtin
µ w Wj WTH HI
je O
with weights he two wrist and featureveeter
T
of It I El x x2 XM J M model order
Now define the parametric prob model
pl t l x El N µ tie B Y modulepunnets
E he M
B precinin inverse variance
Discriminative model
learning the anchtimal distribution
plt l t E by learning theparametervector
directlyfromthe data estimator in 14 can be directly
calculated
mode discriminates t band on their Apps
Main focus in this section
Generative probabilistic mode
learning the joint distribution p l t x1 parameterled
by E i e plx.tl e
Remark This mutils also the distribution pit
ofthe cover ates model privates a realitation
of x via p Ct l E bycomputing the marginaldistribution
Un bag es theorem to obtain p I t l t E and
then the estimate fol x l i n l
3 3.2 ML learning
Assume model crow is hired we want to learn
model parameters I
in Mm mu f dahe points
Discriminative more
p Itb Ltd he D
II p I th l th e P
F Nlt
n I
n 1µL xn et D Y
Take log on both sides
N
lug likelihood LL function
Lm p Lt Lt i K B I
n I
hn p l th Hn E Al
E In µ Hn wt tn
E en
ML learning problem chhned as minimitation of
the negative L L I N LL
which is only knit in if muchl pawns
Yip tr Fahr pl t n l xn K Al
cross entropy w tog lo is criterion
why Strong law of late numbers
ftp.uhnpltnlxn er B E exit tenPITHEAD
p
i
Ex up THIplTHI Ilpl t Ike PDDeawpetaiahk.gg
expected cross entropy ML problem attempts
to make model band pl t l x y pl closeto actual pit 1 1
de problem requires only learning the a posteriori mean
IA tem can be ignored
tn
Mff Lpl Inline 4
w
meh
Lp e training loss
u 21 can be solved in cloud 1am
hes N HED ell 131
T
Ep II Lt I
Il th of Itn 1 Nx Ime 1 matrix
M T
H El x1
Mini mitation of 131 is a least squares ILS
problem with the solution
Ini l EptEb T EptEso
overdetermined case Na M t I
them
got
Es 5 is also known under the
name pseudo inverse j
o
Dittantiating the NLL withrespectto p yields
plum Lpl Emr
0Wh Hung and undetithing
Assumption lz loss
Going back to learple with p It 1 1 NI ah 12TH U D
and the optimal but unknown predictor E Lx
h n 12TH
Hw does this impure with the ML predictor
In H µ x Wme
n t Eru lN
M l
L M 3
X Ma g
O L Oy U6 O f
n
f
o
M t predictor underhits the data
large training loss
o M 9 we hits the data i
small training loss
but late generuhtation loss
Lp Incl Et He pull IT Mlt End
model memorites the training tehonly
o M 3 good choice
Is r
undehthny
omitting
i i
v
o
s
f
u u
t l t t t t e I
n
what happens Aw a
large training set
reiss n
1
o.s
tent genuhration
o.o
oy
ut
Fp training
N
to 2 Jo lo fo to fo
Renate
If N is lap enough compared tothe of
parameters in E Lp Lyle let the
weight vector yn that minimum also
approx mr minutes Lp
Etf It
Im Lp he
o
yn af
LD Lynut Lp Inc a LpLett
l w Lage N L we make this precise late
Lp lend IN as it becomes more chillicult to
hit data
Error analysis
Two hourtypes bias and estimation error
Write generulitation loss as follows
bestgeneralizationloss forgiven
lv
motel
Lpl him Lp E t
Lphe l LplFt givenhyp
class
Mh generalisation bias
Lp hem Lp Htt
estimation error
generalisation 9
best ML estimate
Neafl
ML estimate fu hind N
squarerootloss
is
Lpl him
OG iph.it
o y
v l Lpltt
stem
l l l l l I I J N
lo w 30 40 50 Go 70
Validation and testing
Problem i hw to what the model crown
The distribution p Lt t is unknown and
Lp El cannot be evaluated
Shihan divide available data in twoparts
training ht hold out or validation set
used to obtain empirical average
Vr
Lp ht e
th I NI 1
Ll th µ tn ul
our validation ht of hse Nv
he has been obtained via training set
Lp Ld Unit 1W model ordertelectris
Test ht is additionally needed to compute estimate
of l p l w determined choice of M E
3 3 3 MAP learning
Use a
prini information on vectorof weights
can be used to reduce the effects of our hitting
11Wh explodes for increaring M butut Sechin2.3.41
Remedy priori distribution which puts
apply an a
less weight on large values i e
Ku NII L 1 iid variance d
ML matimiting LL ie
pl to 1 E BI
MAP matinite
w
pLtd El XD B p let IT
n I
pl t n l tn ie B
MAP learning criterion
en pie
Iif
En lnpltnltn.ie pl
with training loss Int N II to EDEN
and t.tn p we obtain
meh htt Hell
ML criterion Imlaitation term RIE
flute on Lbystandad LS analytis
Knap It t
Is I II to why
As N fits
lay contribution of term 21 becomes
Mpi gbh ML estimate
ate 1h51
1T
I Lp
oT
LD
l l l l l s
ko Lo o hn Ld
In creating 7 has the same effect as reducing mold orders
0 the example for a
priori dis hi b Lagrange pdf
M
s R y II hell I o Iwl
j
Remthly MAPproblem
LASSO
Mj
n Lp El t
In 11well
leastabsolute shrinkage
and selection operator
3 U Bayesian approach
putunknown
Frequentist approach existence of true distribution
plt ltl assumed ML IMAP problem tries to hnd
parameter vector E huh that a model dis ti b
p I t l x we B is close to the true distribution
Bayesian approach
Iii Data points are jointly distributed via a known
distribution
Ii Model parameters e are jointly distributed with the data
Ignoring P in the IMaung assuming only E to beoplimited
Character led by joint distribution
p I t b he t I xD x 1 1 for a new domain point x
and new label t 1 to be predicted
o
Bayesian mold evaluates the a posteriori posterior
distribution p l t l xp tp.tl p It I D t jiver
XD t to predict new label
a to
posteriori distribution obtained by manipulating
dis hi b da
By why the chain rule of cond probability
a p l bl c
p l ab l c p al b c we obtain
p I t bet I xp x I I tou l t s x pl t I ta tis uh
p
plusI xD y p I t b l exist pI t l D I
a p ri distribution
PI tis he t I xD x
ply pltblxb.ie pl t Ix y H
a penni Uist likelihood hist if new label
likelihood term p I ta l xD E
II Mtn Hulin uh B I
o
potty new label PLH x El NI t I felt It AY
u Factoritation can be praphically represented by a
Bayesian network
We hrs t chop the dependency on the domain
punts in L1
pltp.w.tl plus pl t.rs w1pltlw
Pluta vertex tar each involved r V and
acliched eye for all conditioned r v s he
the main r r S in each dishib
h l N
Ot
Bayetianguaffproach is inference based
approach the learningstandpoint is hidden as
v
all quarh his in the model are r r S
hav bturn a posteriori distribution
o
bymanipulating the
Plt D plwt p It D Hel dy
p t1D H
pl D x
t
f pl DX
Jp let pl tlDx E de LY
a posteriori distrib predictive dis hits
y u
tor nu del
UhrigBages
Phd Pl tool XD k
ply 1 D pl TDI xp
a posteriori posterior
belief
Computation in funeral di th cult but for the example
that p l to I xp uh pl t Ix I are Gaussian
distributions L in wens the convolution of these
distributions and I lie loss assumed we obtain
Pitt xD I N Ml x Wmap s l x with
s ith B l it E HIT 17 It Eb I IH
tu Bishop lyns 13.58 13.59 12 115
optimal predictor is MAP uncle ez loss
not true in general
3 4.1 Comparison with ML and MA P
Dagen us a posteriori distribution p.lt It D
allows for refined prediction of labels t
a more
compared to ML a posteriori distrib pl t l t En I
Nl µ It um B 12milady far MAP
Tame variance for all the data points
Bayesian i accuracy of prediction depends on the
value
of set by the uneven prediction of data
t n
y Mltikmn.pl IS Lt
n't
I X X X
y
x
l l l l l S t
OL 0.4 U G US 1.0
adaptive uncertainty for Bayetian
For N s is i
Wmap Wn L
t
S2 Lx p Bayetian approaches ML
3 9.2 Marginal likelihood and mold will
Advantage
Bayesian model selection without validation possible
marginal likelihood
pl t I xD f pure IIF p I tn l tn ie de
la je M car mult in smaller likelihood
entrust to ML
in
approach where pl te I xp End
can only increase with in creating M allows to telect
model order
plt.rs1 too
selected model order
t
Z
r l
4
l
b
l
f
M
3 5 Mr mmmm descriptionlength In DL
Cm push'm Y r V XE X with distribution pit
tussle Is compressionscheme can be dehorned which
needs F log pc.tl 7 hits to represent x
o MD L selects a model which lostlessly Compresses
D to theshortest possible description
u th u lil Wile M is selected which minimises
description length
N
T lo
L y p I t n I t n Wen l B nu t C th I
nIl
of bits toquantize
smallest of bits to describe parametersWmc Bnc
model
acts as regularize
not discussed in class
u su Chapter 2 G i n te t
read about pitfalls associated with learned distributions
Chapter 2.71