Machine Learning - Computational Learning Theory PDF
Machine Learning - Computational Learning Theory PDF
Machine Learning - Computational Learning Theory PDF
3 4
Example Trace
Oracle: 1 3 6 11
t(n)
..
{ L
h(n) + 1
Correct Concept: Learn a decent approximation most of
the time.
Requires polynomial sample complexity and
computational complexity.
Learner: 0 2 5 10
natural
odd int
pos int
h: h(n)=h(n-1)+n+1
5 6
1
Cannot Learn Exact Concepts Cannot Learn Even Approximate Concepts
from Limited Data, Only Approximations from Pathological Training Sets
Positive
Positive
Negative
Positive Positive
Negative
7 8
9 10
11 12
2
-Exhausted Version Space Proof
The version space, VSH,D, is said to be -exhausted iff every Let Hbad={h1,hk} be the subset of H with error > . The VS
hypothesis in it has true error less than or equal to . is not -exhausted if any of these are consistent with all m
In other words, there are enough training examples to examples.
guarantee than any consistent hypothesis has error at most . A single hi Hbad is consistent with one example with
One can never be sure that the version-space is -exhausted, probability:
but one can bound the probability that it is not. P (consist (hi , e j )) (1 )
Theorem 7.1 (Haussler, 1988): If the hypothesis space H is
finite, and D is a sequence of m1 independent random A single hi Hbad is consistent with all m independent random
examples for some target concept c, then for any 0 1, examples with probability:
the probability that the version space VSH,D is not - P (consist (hi , D )) (1 ) m
exhausted is less than or equal to:
The probability that any hi Hbad is consistent with all m
|H|em examples is:
P (consist ( H bad , D )) = P (consist (h1 , D) consist ( hk , D)) L
13 14
Therefore, any consistent learner, given at least: Consider conjunctions over n boolean features. There are 3n of these
since each feature can appear positively, appear negatively, or not
1 appear in a given conjunction. Therefore |H|= 3n, so a sufficient
ln + ln H /
number of examples to learn a PAC concept is:
1 n 1
examples will produce a result that is PAC. ln + ln 3 / = ln + n ln 3 /
Just need to determine the size of a hypothesis space to
Concrete examples:
instantiate this result for learning specific classes of
==0.05, n=10 gives 280 examples
concepts. =0.01, =0.05, n=10 gives 312 examples
This gives a sufficient number of examples for PAC ==0.01, n=10 gives 1,560 examples
learning, but not a necessary number. Several ==0.01, n=50 gives 5,954 examples
approximations like that used to bound the probability of a Result holds for any consistent learner, including FindS.
disjunction make this a gross over-estimate in practice.
17 18
3
Sample Complexity of Learning
Other Concept Classes
Arbitrary Boolean Functions
L
Consider any boolean function over n boolean features such as the k-term DNF: Disjunctions of at most k unbounded
hypothesis space of DNF or decision trees. There are 22^n of these, so a conjunctive terms: T1 T2 Tk
sufficient number of examples to learn a PAC concept is: ln(|H|)=O(kn)
L L L
1 1 k-DNF: Disjunctions of any number of terms each limited to
+ ln 22 / = ln + 2 n ln 2 /
n
ln
at most k literals: (( L1 L2 Lk ) ( M 1 M 2 M k )
ln(|H|)=O(nk )
Concrete examples:
L
k-clause CNF: Conjunctions of at most k unbounded
==0.05, n=10 gives 14,256 examples disjunctive clauses: C1 C2 Ck
==0.05, n=20 gives 14,536,410 examples ln(|H|)=O(kn)
==0.05, n=50 gives 1.561x1016 examples
L L L
k-CNF: Conjunctions of any number of clauses each limited
to at most k literals: ((L1 L2 Lk ) (M 1 M 2 M k )
ln(|H|)=O(nk )
n + k 1
Expanded
(n + k 1)! Data for Construct all k-CNF
k - samples : n k k - selections : = disj. features data with O(nk) Find-S
k
k-CNF formula
k!(n 1)! concept with k literals new features
n!
k - permutatio ns : n n!
(n k )! k - combinatio ns : =
k k!( n k )! Sample complexity of learning k-DNF and k-CNF are O(nk)
Training on O(nk) examples with O(nk) features takes O(n2k) time
All O(nk) 21 22
4
Infinite Hypothesis Spaces Shattering Instances
The preceding analysis was restricted to finite hypothesis A hypothesis space is said to shatter a set of instances iff
spaces. for every partition of the instances into positive and
Some infinite hypothesis spaces (such as those including negative, there is a hypothesis that produces that partition.
real-valued thresholds or parameters) are more expressive For example, consider 2 instances described using a single
than others. real-valued feature being shattered by intervals.
Compare a rule allowing one threshold on a continuous feature x y +
(length<3cm) vs one allowing two thresholds (1cm<length<3cm).
_ x,y
Need some measure of the expressiveness of infinite x y
hypothesis spaces. y x
The Vapnik-Chervonenkis (VC) dimension provides just x,y
such a measure, denoted VC(H).
Analagous to ln|H|, there are bounds for sample
complexity using VC(H).
25 26
29 30
5
Conjunctive Learning
Upper Bound on Sample Complexity with VC
with Continuous Features
Using VC dimension as a measure of expressiveness, the Consider learning axis-parallel hyper-rectangles,
following number of examples have been shown to be conjunctions on intervals on n continuous features.
sufficient for PAC Learning (Blumer et al., 1989). 1.2 length 10.5 2.4 weight 5.7
1 2 13 Since VC(H)=2n sample complexity is
4 log 2 + 8VC ( H ) log 2
1 2 13
4 log 2 + 16n log 2
Compared to the previous result using ln|H|, this bound has Since the most-specific conjunctive algorithm can easily
some extra constants and an extra log2(1/) factor. Since find the tightest interval along each dimension that covers
VC(H) log2 |H|, this can provide a tighter upper bound on all of the positive instances (fmin f fmax) and runs in
the number of examples needed for PAC learning. linear time, O(|D|n), axis-parallel hyper-rectangles are
PAC learnable.
31 32
6
Interpretation of Occams Razor Result COLT Conclusions
Since the encoding is unconstrained it fails to The PAC framework provides a theoretical framework for
analyzing the effectiveness of learning algorithms.
provide any meaningful definition of simplicity.
The sample complexity for any consistent learner using
Hypothesis space could be any sufficiently small some hypothesis space, H, can be determined from a
space, such as the 2n most complex boolean measure of its expressiveness |H| or VC(H), quantifying
bias and relating it to generalization.
functions, where the complexity of a function is
If sample complexity is tractable, then the computational
the size of its smallest DNF representation complexity of finding a consistent hypothesis in H governs
Assumes that the correct concept (or a close its PAC learnability.
approximation) is actually in the hypothesis space, Constant factors are more important in sample complexity
than in computational complexity, since our ability to
so assumes a priori that the concept is simple. gather data is generally not growing exponentially.
Does not provide a theoretical justification of Experimental results suggest that theoretical sample
Occams Razor as it is normally interpreted. complexity bounds over-estimate the number of training
instances needed in practice since they are worst-case
upper bounds.
37 38
39