0% found this document useful (0 votes)
14 views

Algorithm-Independent Learning

The document discusses several key concepts: 1) The "No Free Lunch" theorem states that averaged over all problems, no learning algorithm performs better than any other. Prior knowledge about the problem is needed to determine the best algorithm. 2) Similarly, the "Ugly Duckling" theorem indicates that in the absence of assumptions, there is no privileged or best set of features to use. The notion of similarity between patterns depends on the assumptions. 3) Venn diagrams and predicate ranks are introduced to represent combinations of binary features of patterns, but without prior knowledge, unsatisfactory results may occur in determining similarity.

Uploaded by

Aditya Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Algorithm-Independent Learning

The document discusses several key concepts: 1) The "No Free Lunch" theorem states that averaged over all problems, no learning algorithm performs better than any other. Prior knowledge about the problem is needed to determine the best algorithm. 2) Similarly, the "Ugly Duckling" theorem indicates that in the absence of assumptions, there is no privileged or best set of features to use. The notion of similarity between patterns depends on the assumptions. 3) Venn diagrams and predicate ranks are introduced to represent combinations of binary features of patterns, but without prior knowledge, unsatisfactory results may occur in determining similarity.

Uploaded by

Aditya Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

The“Opt

imal
”Al
gor
ithm
•In this course we have defined a large number of
PR algorithms.
7. Algorithm-Independent
•The obvious question to ask next is: which one is
Learning “be st”?
•Of course to properly answer this, we would need
Aleix M. Martinez
tode finewha t“ be st
”me ans.Le t
’sa s
sumewe
aleix@ece.osu.edu agree on that; e.g. according to the Bayes error.
Handouts
Handoutsfor
forECE
ECE874,
874,2007
2007
•Even then, no PR algorithm is inherently superior
to any other.

No Free Lunch theorem •Consider the 2-class problem where D={xi} is


the training set with labels yi=+/- 1, generated
•Recall we do not known the function to be by the unknown target function F(x).
learned. We do not even know if it is parametric •If F incorporates noise (error), then the Bayes
or not, linearly separable, etc. error will be different than zero.
•We usually use test data to estimate the
performance of an algorithm. •Now, let H be the (discrete) set of hypothesis
•If the training and testing sets are independent, and h(x) H with prior P(h).
then all algorithms will perform poorly. •The probability for an algorithm to yield
•If the training set is very large, then the testing set hypothesis h using D is given by P(h|D).
will overlap with it and we will merely be testing •The natural error measure is the expected value
what we had learned (not the generalization).
of error given D, summed over all possible h:
We will return to this key point latter.

E[error | D] 
Theorem: No Free Lunch
P(x)
1 
h , F xD

F (x), h(x) P (h | D) P ( F | D). • For any two learning algorithms P1(h|D) and
P2(h|D), the following is true, independently of
the sampling distribution P(x) and the number n
where  (.) is the Kronecker delta function.
of training samples:
•And the expected off-training-set 1. Uniformly averaged over all target functions F,
classification error is for algorithm k is: E1[error|F,n]- E2[error|F,n]=0.
2. For any fixed training set D, uniformly average over
Ek [error | F , n]  F, E1[error|F,D]- E2[error|F,D]=0.
P(x)
xD
1  
F (x), h(x) P
h( x) | D 
. k
3. Uniformly averaged over all priors P(F), E1[error|n]-
E2[error|n]=0.
4. For any fixed training set D, uniformly averaged over
algorithm k P(F), E1[error|D]- E2[error|D]=0.

1
•1 says that uniformly averaged over all Problem space
target functions the expected-off training set
error for all learning algorithms is the same.
I.e., if all target functions are equally likely,
the“ goo d”a lgor i
thmswi llno toutpe rfo r
m
the“ ba d”o nes.
•2 says that even if we know D, the off-
training error averaged over all target
functions is the same.
•3 & 4 concern nonuniform target function
distributions.

•No system can perform well for all possible


problems. There is always a trade-off.
Example 1
•While our hope is that we will never have to
use algorithm A for certain problems, we
ca nonl y“ hope ”tha tthiswi lldefini
tel
ybe
so.
•This stresses the point that the assumptions
and prior knowledge we have about our
problem is what makes the difference. This
is what a paper should be all about.

Example 2 Ugly Ducking Theorem


x F h1 h2 •Si mi l
artot he“ nof r
eel unc h”theorem,but
000 1 1 1 for features.
training 001 -1 -1 -1 •In the absence of assumptions, there is no
pr i
v il
eg edo r“best”s etoff eat
ures.
010 1 1 1
011 -1 1 -1
•Theno ti
ono f“si
mi lari
ty”be twe enpa tterns
is also determined by the assumptions –
100 1 1 -1 which may or may not be correct.
101 -1 1 -1
•Predicates: any combination of n patterns
110 1 1 -1 (n =1,2,…) .
111 1 1 -1

2
Rank
Venn diagrams
•“Therank r of a predicate is the number of
•Each pattern represents d-tuples of binary the simples or indivisible elements it
features fi. cont ai
ns;”e .
g .,x1, x2, x1 or x2, x1 and x2,
etc.

Rank r=1 Rank r=2 Example


X1 f1 AND NOT f2 X1 OR X2 f1
X2 f1 AND f2 X1 OR X3 f1 XOR f2 •f1 = blind in the right eye; x1={1,0}.
•f2 = blind in the left eye; x2={0,1}.
X3 f2 AND NOT f1 X1 OR X4 NOT f2
•Total blind is equally dissimilar to a normally
X4 NOT(f1 OR f2) X2 OR X3 f2 sighted than x1 is to x2.
4  X2 OR X4 NOT (f1 AND f2) •Without prior knowledge, we obtain

 4 unsatisfactory results.
1 X3 OR X4 NOT f1 1 f 2 f f
1’ f2’
x1 0 0 0 1
4 
f
’1 = blind in the right.
 x2 0 1 0 0
2 
6 f
’2 = same in both eyes.
n

n 
Total # of predicates: 
r 
2 .
n x3 1 0 1 0
r 0  x4 1 1 1 1

Similarity Theorem: Ugly Duckling


•We can use the number of predicates rather •Given that we use a finite set of predicates
than the number of features. that enables us to distinguish any two
•The number of shared predicates is: patterns under consideration, the number of
d
d 2  d 2 predicates shared by any two such patterns
 
r 2 
r 2 
d . is constant and independent of the choice of

those patterns. Furthermore, if pattern
•Note that this result is independent of the
similarity is based on the total number of
choice of xi’s .
predicates shared by two patterns, then any
•Any2di sti
nc tpatte
rnsa re“equall
ys imi lar
.” twopa tternsa re“ equa ll
ys imi lar.

3
Minimum Description Length
Similarity
•Obviously, we can (efficiently) represent
•This theorem tells us that: even the our data in many ways.
apparently simple notion of similarity •Nevertheless, one is always interested to
between patterns is fundamentally based on find the smallest representation.
implicit assumptions about the problem •Oc cman’ srazor:“ ent
ities”s ho uldno tbe
domain. multiplied beyond necessary => in PR: one
should not use classifiers that are more
compl i
c at
edt han“ necess ary”( whe r
e
“ne cessary”me ansqua lit
yo ff i
t).

Overfiting avoidance Algorithmic complexity


•If there are no problem-independent reasons to
prefer one algorithm over another, why should we
•Is the inherent complexity of a binary string.
pr eferthe“ si
mpl est”one ? •If sender and receiver have a common L, then
•Obviously, there are problems for which overfiting L(y)=x; and the cost to transmit x is |y|.
avoidance is bad. •In general, an abstract computer takes y and
•In general though, the less features (or parameters), outputs x. The algorithmic complexity of x is
the lower the probability of error. defined as the shortest program that computes
•It can be seen as eliminating the noisy features. x and halts:
K ( x) min
U ( y ) x 
, U a Turing machine.
| y|

MDL
Example
•In general the members of a class have
•x consists of n 1s. We need K(x)=log2 n bits. common and different features.
•To represent  K(x)=1. •Wewa nttol ea rnt
he“e ssential”one s.
•To represent an arbitrary string of n bits, •We do not want to keep redundant (overfit) or
K(x)=n. noisy features.
•Mi ni mizethes umofthemode l’sa lgorithmi c
complexity and the description of the training
data:
K ( h, D) K (h) K ( D using h).

4
•When n  , it can be shown that the Relation to Bayes
MDLc onverge stothe“ideal”(
true)mo del.
P ( h) P ( D | h )
•However, we may not be clever enough to p (h | D ) 
P ( D)
fi
ndt hat“be st
”r epr
e s
entat
ionint hefinite •The optimal hypothesis h* is:
case.
h* arg max[ P( h) P ( D | h)] 
•Variations of MDL use a weighted version h

of the equation shown above. arg max[log 2 P( h) log 2 P ( D | h)].


h

•One early approach was based on the idea of


Mixture Models and MDL information criterion:
AIC (C , ) 2 log p y ( y | C , ) 2 L
•The mixture of models introduced in section 5 L M M 2 (mean & covariance matrix)
requires for us to define the number of clusters
 ( M 1) M 
C. E.g., L C 
1 M  1.
•The a posteriori probability cannot be used to  2 
find C, because it increases with it. •Unfortunately, AIC does not lead to a
•MDLc anhe lpuss electt he“ be
st”C. consistent estimator.
•To do this we need to incorporate a penalty •To solve this we use MDL. Rissanen
term, which prevents us to increase C to developed an approximate expression:
unnecessary values. 1
MDL(C , ) log p ( y | C , )  L log( NM ).
2

•Which gives us the following term (to be Bias and Variance


minimized):
•How can we evaluate the quality and precision of
N
C 1
MDL(C , ) log p( yn | i, )i  L log( NM ). a classifier in a particular problem?
n 1 i 1 2 •Bias: measures the accuracy; high bias implies a
poor match.
•Variance: measures the precision; high variance
implies a weak match.
•It can be studied in regression and classification.
The regression case is much simpler though.

5
Bias & variance for regression •Algorithms with more parameters (i.e.,
more flexibility), tend to have lower bias
•We create an estimate g(x;D) of the true but but higher variance => it will fit the data
unknown distribution F(x). very well.
•For some D this approximation will be •Simpler algorithms tend to have high bias
excellent, while for others will be poor. but lower variance (are more predictable).
•The mean-square deviation from its true value •Not always so simple though.
is:
E D [( g ( x; D) F (x)) 2 ]  •The best way to have low bias and variance
is to have some knowledge of the problem

E[ g ( x; D) F (x)]E D [( g ( x; D) E D [ g ( x; D)]) 2 ].
2

(i.e. target function).


bias variance

Resampling for Estimation


•How can one calculate the bias and variance
of a problem with unknown target function?
•There are two general methods to do this:
–Leave-one-out procedure (also known as
Jacknife).
–Bootstrap

n
1
Jackenife •The mean is then given by: () n ( j )
j
1
•And the variance: n 1 n
•It is easy to calculate the sample mean and
Var[ 
ˆ]
n

j
1
2
( j) () .
sample covariance:  1 n
ˆ xi
n i 1 •In general for any estimator .̂
2 
1 n
( xi 
ˆ)2 . •Jackenife Bias Estimate:
n 1 i 1
•But this cannot be used for other estimates bias ( n 1)(  ˆ).
jack (
)
like the median or the mod.
•To do this we compute the mean leaving the •Jackenife Variance Estimate:
 
jth sample out: 1 n 1 n
( j )  xi . ˆ
( j ) ˆ
2
Varjack []
n 1 i j n j 1

6
Example: mod Bootstrap
ˆ10. •It is similar to the previous method, only that we
•D={0,10,10,10,20,20}, n=6 and 
randomly select n ’samples B times: 
B
ˆ 1 
n
(
) ˆ(b ) .
B b 1
(
) n  (i)
ˆ 1  ˆ 1 (10 15 15 15 10 10) 12.5.
6 +/- 5.6 •Bootstrap Bias Estimate:
i
1
1 B ˆ ˆ ˆ ˆ
bias jack ˆ 
(n 1)((
)
ˆ
) 5(12.5 10) 12.5 biasboot   ( b) () .
B b 1
ˆ
Var jack []
n i 1

n 1 n ˆ ˆ 2

(i ) ()  •Bootstrap Variance Estimate:
5

 (10 12.5) 3(15 12.5) 2 2(10 12.5) 2 31.25.
6
2
 Varboot []  
B b1

1 B ˆ ˆ2
( b)  
() .

Stability of the Classifier Generalization


•A classifier is unstable if small changes on D
result in different classifiers with large differences •Def: the empirical error (i.e., the
in classification accuracy. performance on the training examples) must
•The idea of Resampling (to be discussed next) is be a good indicator of the expected error
to use several datasets to find (select) a more
stable solution. (i.e., the performance of future samples).
•We do this by combining several component •Empirical error: 1 n
classifiers. ES ( f )  V ( f , zi ).
n i 1
•There are no convincing theoretical results to loss function
•Expected error:
prove this. A recent result relates to the E ( f ) 
V ( f , z ) d( z ).
generalization of an algorithm. z

•Mapping: a learning algorithm is a mapping


Sufficient condition: CVEEEloo
L : n1 Z n  H .
where H is the hypothesis space (set of • L is distribution-independent, CVEEEloo
possible functions). stable for all if: The probability (on ) of
•We have seen that without restriction on H, it those samples where our eq.
1. is CVloo stable,
is impossible to guarantee generalization. does not hold is approx 0.
•Stability of an algorithm is a way to see this. sup E ( f S ) E ( f S ) 0 in probability,
2. lim
n 
•Cross-validation leave-one-out (CVloo) i
1,...,n
i

stability: L is distribution-independent, Cvloo 3. lim sup ES ( f S ) E f ( f S ) 0 in probability.


stable if uniformly over all pdf , n i
1,...,n
Si i

lim sup V ( f Si , zi ) V ( f S , zi ) 0 in probability. • If these conditions hold and the loss


n i
1,...,n
function is bounded, then fS generalizes.
S leaving the ith sample out.

7
Resampling for Classification Boosting
•Arcing (adaptive reweighting and combining): is
the idea of reusing (or selecting specific) data to 1. We randomly select n1 (n1<n) samples
improve upon current classification results. from D; we call this D1.
•Bagging (bootstrap aggregation): uses many 2. We train the first classifier, C1, with D1.
subsets of n ’samples (n’ <n ) for D with 3. Now we create a second training set, D2,
replacements to train different component where half of the samples are correctly
classifiers. classified by C1 and half are not; D1 D2 0.
•Classification is based on the vote given by each 4. Train C2 with D2. (Note that D2 is
of the component classifiers –which are usually of complementary to D1.)
the same form; e.g. Perceptrons.

• Classification is based according to the


votes given by C1 and C2.
1. We now create D3 with those n3 samples
that are not consistently classified (i.e.,
received different votes) by C1 and C2.
2. Train C3 with D3.
• Classification:
– If C1 and C2 agree with the classification of
our test vector, we output that class,
– Otherwise, we select the class given by C3.

AdaBoost (
“adapt
ingbo
ost
ing
”)
Notes on boosting
•Allows us to keep adding (weak) learners.
•The component classifier, Ci, needs only be a •In this case each sample in D receives a
weak learner –an algorithm with results above weighting factor, Wk. At each iteration we
chance.
randomly select n’samples according to Wk.
•The classification improve as we add component
k  ln(1 Ek ) / Ek 
1
classifiers.
2
•There is no general way to select n1. Ideally one
W  e k if correctly classified
might want n1=n2=n3=1/3, but there is no way to Wk  k  
guarantee this. In practice we might need to run Z k e k otherwise.
the algorithm several time to optimize n1. normalizing constant
increases or decreases according to the results

8
k max
 k max 2 
1Gk 
E 2 Ek (1 Ek ) exp
•AddaBoost focuses on the most informative or .
most difficult patterns. 
k 1  k 1 
•Fore achg roupofn’s a mples ,wet ra
ina
component classifier.
•We can keep learning until the training error is
below a pre-defined threshold.
•The final classification function is:
k max
g ( x) k hk (x).
k
1

•The classification decision is given by


sgn[g(x)].

Samples in class 1
AddaBoost-based Feature Initialized: w1,i 1 2m if yi 1,
Samples in class 2
Selection and w1,i 1 2l if yi 0.
•The same idea defined above can now be For t=1,…,n
wt ,i
• Normalize weights: wt ,i  n .
used to select those features that best
classify (discriminate) our sample vectors.
j 1 wt , j
•It is only applicable to the 2-class problem • Train a classifier hi for each of the p
though. features.
• Error: E j i wi h j ( xi ) yi .
•The idea is that at each iteration i, we will
select that feature fi associated to the • Choose that classifier with smallest error.
• Update the weights: 1ei
smallest classification error. wt 1,i wt ,i t ,

where ei=0 if xi is successfully classified; Learning with queries


otherwise ei=1; and ei
 .
1 ei •The previous methods are supervised (i.e., data is
labeled).
•The final strong classifier is given by
•Many applications require unsupervised
 1
  h ( x) 2  
T T techniques though. We assume that our data can
1
h( x)  t
1 t t t
1 t be labeled (by an oracle), but there is a cost


0 otherwise associated to this. E.g., handwritten text.
•Cost-based learning: the goal is minimize the
1 overall cost –classification accuracy & labeling.
where t log . •Confidence-based query selection: select those
t patterns with similar discriminant value (~ ½).

9
•Voting-based or committee-based query
selection (multiclass): select the patterns
that yield the greatest disagreement among
the k discriminant functions.
•Advantages:
–we need not guess (or learn) the form of the
underlying distribution of the data. Instead, we
need to estimate the classification boundary.
–We need not test our classifier with data drawn
from the same distribution.

10

You might also like