Bayesian Learning: Berrin Yanikoglu
Bayesian Learning: Berrin Yanikoglu
Bayesian Learning: Berrin Yanikoglu
Berrin Yanikoglu
Oct 2010
Basic Probability
Probability Theory
Marginal Probability of X
Conditional Probability of Y
given X
Probability Theory
Marginal Probability of X
Conditional Probability of Y
given X
Probability Theory
Probability Theory
Sum Rule
Product Rule
Probability Theory
Sum Rule
Product Rule
Bayes Theorem
Bayesian Decision
Consider the task of classifying a certain fruit as Orange
(C1) or Tangerine (C2) based on its measurements, x. In
this case we will be interested in finding P(Ci| x). That is
how likely for it to be an orange/tangerine given its
features?
If you have not seen x, but you still have to decide on its
class Bayesian decision theory says that we should
decide by prior probabilities of the classes.
Choose C1 if P(C1) > P(C2) :prior probabilities
Choose C2 otherwise
10
Bayesian Decision
2) How about if you have one measured feature X about your instance?
e.g. P(C2 |x=70)
10 20 30 40 50 60 70 80 90
11
Definition of probabilities
27 samples in C2
19 samples in C1
Total 46 samples
Bayesian Decision
Histogram representation better highlights the decision problem.
13
Bayesian Decision
You would minimize the number of misclassifications if you choose
the class that has the maximum posterior probability:
14
15
Example to Work on
16
17
18
Probability Densities
Cumulative Probability
20
Probability Densities
P(x [a, b]) = 1 if the interval [a, b] corresponds to the whole of Xspace.
Note that to be proper, we use upper-case letters for probabilities and
lower-case letters for probability densities.
For continuous variables, the class-conditional probabilities introduced
above become class-conditional probability density functions, which we
write in the form p(x|Ck).
21
Multible attributes
If there are d variables/attributes x1,...,xd, we may group
them into a vector x =[x1,... ,xd]T corresponding to a point in
a d-dimensional space.
The distribution of values of x can be described by
probability density function p(x), such that the probability of
x lying in a region R of the d-dimensional space is given by
23
Decision Regions
Assign a feature x to Ck if Ck=argmax (P(Cj|x))
j
Equivalently, assign a feature x to Ck if:
Discriminant Functions
Although we have focused on probability distribution functions, the
decision on class membership in our classifiers has been based
solely on the relative sizes of the probabilities.
This observation allows us to reformulate the classification process
in terms of a set of discriminant functions y1(x),...., yc(x) such that an
input vector x is assigned to class Ck if:
26
Discriminant Functions
We can use any monotonic function of yk(x) that would simplify calculations,
since a monotonic transformation does not change the order of yks.
27
Classification Paradigms
In fact, we can categorize three fundamental approaches to classification:
Generative models: Model p(x|Ck) and P(Ck) separately and use the
Bayes theorem to find the posterior probabilities P(C k|x)
E.g. Naive Bayes, Gaussian Mixture Models, Hidden Markov
Models,
Discriminative models:
Determine P(Ck|x) directly and use in decision
E.g. Linear discriminant analysis, SVMs, NNs,
Find a discriminant function f that maps x onto a class label directly
without calculating probabilities
Advantages? Disadvantages?
28
30
32
33
P(a1,a2,an| vj)
Nave Bayesian Approach: We assume that the
attribute values are conditionally independent given the
class vj so that
P(a1,a2,..,an|vj) =i P(a1|vj)
Nave Bayes Classifier:
34
Independence
If P(X,Y)=P(X)P(Y)
the random variables X and Y are said to be independent.
Since P(X,Y)= P(X | Y) P(Y) by definition, we have the equivalent
definition of P(X | Y) = P(X)
Independence and conditional independence are important because
they significantly reduce the number of parameters needed and
reduce computation time.
Consider estimating the joint probability distribution of two random
variables A and B:
Conditional Independence
We say that X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the value of Y
given a value for Z.
(xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk)
Or simply: P(X|Y,Z)=P(X|Z)
Using Bayes thm, we can also show:
P(X,Y|Z) = P(X|Z) P(Y|Z) since:
P(X|Y,Z)P(Y|Z)
P(X|Z)P(Y|Z)
36
Fj
i j
P Fi | C , F j P Fi | C
39
40
Illustrative Example
41
Illustrative Example
42
43
44
45
Document Classification
Given a document, find its class (e.g. headlines, sports,
economics, fashion)
We assume the document is a bag-of-words.
P ( d | c ) P ( t i , t 2 , , t nd | c )
P(t
cC
| c)
1 k nd
(t | c)
P
k
1 k nd
46
Smoothing
For each term, t, we need to estimate P(t|c)
P (t | c)
Tct
t 'V
Tct '
P (t | c)
Tct 1
Tct 1
47
docID
Training
set
c=
China?
Yes
Yes
Chinese Macao
Yes
No
Two
China,
notChinese
ChinaChinese Tokyo
Testtopic
set classes:
5
Chinese
Japan
P (c) 3 / 4
P (c ) 1 / 4
48
48
docID
Training
set
c=
China?
Yes
Yes
Chinese Macao
Yes
No
Test
set
5
Chinese Chinese Chinese
Tokyo
Classification
Probability
Estimation
Japan
P(Chinese | c) (5 1) /(8 6) 3 / 7
P (c | d ) P ( c )
P(t
1 k nd
| c)
P (c | d 5 ) 3 / 4 (3 / 7) 3 1 / 14 1 / 14 0.0003
P (c | d 5 ) 1 / 4 ( 2 / 9) 3 2 / 9 2 / 9 0.0001
49
49
Summary: Miscellanious
Nave Bayes is linear in the time is takes to scan the
data
When we have many terms, the product of probabilities
with cause a floating point underflow, therefore:
cMAP arg max[log P (c)
cC
log P(t
1 k nd
| c)
50
Mitchell Chp.6
Maximum Likelihood (ML) &
Maximum A Posteriori (MAP)
Hypotheses
51
52
Bayes Theorem
55
Choosing Hypotheses
56
Choosing Hypotheses
57
58
60
No other classifier
using the same hypothesis space
and same prior knowledge
can outperform this method
on average
61
63
65