Algorithm-Independent Learning
Algorithm-Independent Learning
imal
”Al
gor
ithm
•In this course we have defined a large number of
PR algorithms.
7. Algorithm-Independent
•The obvious question to ask next is: which one is
Learning “be st”?
•Of course to properly answer this, we would need
Aleix M. Martinez
tode finewha t“ be st
”me ans.Le t
’sa s
sumewe
aleix@ece.osu.edu agree on that; e.g. according to the Bayes error.
Handouts
Handoutsfor
forECE
ECE874,
874,2007
2007
•Even then, no PR algorithm is inherently superior
to any other.
E[error | D]
Theorem: No Free Lunch
P(x)
1
h , F xD
F (x), h(x) P (h | D) P ( F | D). • For any two learning algorithms P1(h|D) and
P2(h|D), the following is true, independently of
the sampling distribution P(x) and the number n
where (.) is the Kronecker delta function.
of training samples:
•And the expected off-training-set 1. Uniformly averaged over all target functions F,
classification error is for algorithm k is: E1[error|F,n]- E2[error|F,n]=0.
2. For any fixed training set D, uniformly average over
Ek [error | F , n] F, E1[error|F,D]- E2[error|F,D]=0.
P(x)
xD
1
F (x), h(x) P
h( x) | D
. k
3. Uniformly averaged over all priors P(F), E1[error|n]-
E2[error|n]=0.
4. For any fixed training set D, uniformly averaged over
algorithm k P(F), E1[error|D]- E2[error|D]=0.
1
•1 says that uniformly averaged over all Problem space
target functions the expected-off training set
error for all learning algorithms is the same.
I.e., if all target functions are equally likely,
the“ goo d”a lgor i
thmswi llno toutpe rfo r
m
the“ ba d”o nes.
•2 says that even if we know D, the off-
training error averaged over all target
functions is the same.
•3 & 4 concern nonuniform target function
distributions.
2
Rank
Venn diagrams
•“Therank r of a predicate is the number of
•Each pattern represents d-tuples of binary the simples or indivisible elements it
features fi. cont ai
ns;”e .
g .,x1, x2, x1 or x2, x1 and x2,
etc.
3
Minimum Description Length
Similarity
•Obviously, we can (efficiently) represent
•This theorem tells us that: even the our data in many ways.
apparently simple notion of similarity •Nevertheless, one is always interested to
between patterns is fundamentally based on find the smallest representation.
implicit assumptions about the problem •Oc cman’ srazor:“ ent
ities”s ho uldno tbe
domain. multiplied beyond necessary => in PR: one
should not use classifiers that are more
compl i
c at
edt han“ necess ary”( whe r
e
“ne cessary”me ansqua lit
yo ff i
t).
MDL
Example
•In general the members of a class have
•x consists of n 1s. We need K(x)=log2 n bits. common and different features.
•To represent K(x)=1. •Wewa nttol ea rnt
he“e ssential”one s.
•To represent an arbitrary string of n bits, •We do not want to keep redundant (overfit) or
K(x)=n. noisy features.
•Mi ni mizethes umofthemode l’sa lgorithmi c
complexity and the description of the training
data:
K ( h, D) K (h) K ( D using h).
4
•When n , it can be shown that the Relation to Bayes
MDLc onverge stothe“ideal”(
true)mo del.
P ( h) P ( D | h )
•However, we may not be clever enough to p (h | D )
P ( D)
fi
ndt hat“be st
”r epr
e s
entat
ionint hefinite •The optimal hypothesis h* is:
case.
h* arg max[ P( h) P ( D | h)]
•Variations of MDL use a weighted version h
5
Bias & variance for regression •Algorithms with more parameters (i.e.,
more flexibility), tend to have lower bias
•We create an estimate g(x;D) of the true but but higher variance => it will fit the data
unknown distribution F(x). very well.
•For some D this approximation will be •Simpler algorithms tend to have high bias
excellent, while for others will be poor. but lower variance (are more predictable).
•The mean-square deviation from its true value •Not always so simple though.
is:
E D [( g ( x; D) F (x)) 2 ] •The best way to have low bias and variance
is to have some knowledge of the problem
E[ g ( x; D) F (x)]E D [( g ( x; D) E D [ g ( x; D)]) 2 ].
2
n
1
Jackenife •The mean is then given by: () n ( j )
j
1
•And the variance: n 1 n
•It is easy to calculate the sample mean and
Var[
ˆ]
n
j
1
2
( j) () .
sample covariance: 1 n
ˆ xi
n i 1 •In general for any estimator .̂
2
1 n
( xi
ˆ)2 . •Jackenife Bias Estimate:
n 1 i 1
•But this cannot be used for other estimates bias ( n 1)( ˆ).
jack (
)
like the median or the mod.
•To do this we compute the mean leaving the •Jackenife Variance Estimate:
jth sample out: 1 n 1 n
( j ) xi . ˆ
( j ) ˆ
2
Varjack []
n 1 i j n j 1
6
Example: mod Bootstrap
ˆ10. •It is similar to the previous method, only that we
•D={0,10,10,10,20,20}, n=6 and
randomly select n ’samples B times:
B
ˆ 1
n
(
) ˆ(b ) .
B b 1
(
) n (i)
ˆ 1 ˆ 1 (10 15 15 15 10 10) 12.5.
6 +/- 5.6 •Bootstrap Bias Estimate:
i
1
1 B ˆ ˆ ˆ ˆ
bias jack ˆ
(n 1)((
)
ˆ
) 5(12.5 10) 12.5 biasboot ( b) () .
B b 1
ˆ
Var jack []
n i 1
n 1 n ˆ ˆ 2
(i ) () •Bootstrap Variance Estimate:
5
(10 12.5) 3(15 12.5) 2 2(10 12.5) 2 31.25.
6
2
Varboot []
B b1
1 B ˆ ˆ2
( b)
() .
7
Resampling for Classification Boosting
•Arcing (adaptive reweighting and combining): is
the idea of reusing (or selecting specific) data to 1. We randomly select n1 (n1<n) samples
improve upon current classification results. from D; we call this D1.
•Bagging (bootstrap aggregation): uses many 2. We train the first classifier, C1, with D1.
subsets of n ’samples (n’ <n ) for D with 3. Now we create a second training set, D2,
replacements to train different component where half of the samples are correctly
classifiers. classified by C1 and half are not; D1 D2 0.
•Classification is based on the vote given by each 4. Train C2 with D2. (Note that D2 is
of the component classifiers –which are usually of complementary to D1.)
the same form; e.g. Perceptrons.
AdaBoost (
“adapt
ingbo
ost
ing
”)
Notes on boosting
•Allows us to keep adding (weak) learners.
•The component classifier, Ci, needs only be a •In this case each sample in D receives a
weak learner –an algorithm with results above weighting factor, Wk. At each iteration we
chance.
randomly select n’samples according to Wk.
•The classification improve as we add component
k ln(1 Ek ) / Ek
1
classifiers.
2
•There is no general way to select n1. Ideally one
W e k if correctly classified
might want n1=n2=n3=1/3, but there is no way to Wk k
guarantee this. In practice we might need to run Z k e k otherwise.
the algorithm several time to optimize n1. normalizing constant
increases or decreases according to the results
8
k max
k max 2
1Gk
E 2 Ek (1 Ek ) exp
•AddaBoost focuses on the most informative or .
most difficult patterns.
k 1 k 1
•Fore achg roupofn’s a mples ,wet ra
ina
component classifier.
•We can keep learning until the training error is
below a pre-defined threshold.
•The final classification function is:
k max
g ( x) k hk (x).
k
1
Samples in class 1
AddaBoost-based Feature Initialized: w1,i 1 2m if yi 1,
Samples in class 2
Selection and w1,i 1 2l if yi 0.
•The same idea defined above can now be For t=1,…,n
wt ,i
• Normalize weights: wt ,i n .
used to select those features that best
classify (discriminate) our sample vectors.
j 1 wt , j
•It is only applicable to the 2-class problem • Train a classifier hi for each of the p
though. features.
• Error: E j i wi h j ( xi ) yi .
•The idea is that at each iteration i, we will
select that feature fi associated to the • Choose that classifier with smallest error.
• Update the weights: 1ei
smallest classification error. wt 1,i wt ,i t ,
9
•Voting-based or committee-based query
selection (multiclass): select the patterns
that yield the greatest disagreement among
the k discriminant functions.
•Advantages:
–we need not guess (or learn) the form of the
underlying distribution of the data. Instead, we
need to estimate the classification boundary.
–We need not test our classifier with data drawn
from the same distribution.
10