Bayesian Learning: Berrin Yanikoglu

Bayesian Learning
Machine Learning by Mitchell-Chp. 6

Ethem Chp. 3 (Skip 3.6)
Pattern Recognition & Machine Learning by Bishop Chp. 1
Berrin Yanikoglu
Oct 2010
Basic Probability
Probability Theory
Marginal Probability of X
Conditional Probability of Y
given X
Joint Probability of X and Y
Probability Theory
Marginal Probability of X
Conditional Probability of Y
given X
Joint Probability of X and Y
Probability Theory
Probability Theory
Sum Rule
Product Rule
Probability Theory
Sum Rule
Product Rule
Bayesian Decision Theory
Bayes Theorem
Using this formula for classification problems, we get

P(C| X)
P (X |C) P(C) / P(X)
posterior probability = x class conditional probability x prior

9
Bayesian Decision
Consider the task of classifying a certain fruit as Orange
(C1) or Tangerine (C2) based on its measurements, x. In
this case we will be interested in finding P(Ci| x). That is
how likely for it to be an orange/tangerine given its
features?
If you have not seen x, but you still have to decide on its
class Bayesian decision theory says that we should
decide by prior probabilities of the classes.
Choose C1 if P(C1) > P(C2) :prior probabilities
Choose C2 otherwise
10
Bayesian Decision
2) How about if you have one measured feature X about your instance?
e.g. P(C2 |x=70)
10 20 30 40 50 60 70 80 90
11
Definition of probabilities
27 samples in C2
19 samples in C1
Total 46 samples
P(C1,X=x) = num. samples in corresponding box

num. all samples
//joint probability of C1 and X
P(X=x|C1) = num. samples in corresponding box
num. of samples in C1-row
//class-conditional probability of X
P(C1)
= num. of of samples in C1-row

num. all samples
P(C1,X=x) = P(X=x|C1) P(C1)

Bayes Thm.
12
Bayesian Decision
Histogram representation better highlights the decision problem.
13
Bayesian Decision
You would minimize the number of misclassifications if you choose
the class that has the maximum posterior probability:
Choose C1 if p(C1|X=x) > p(C2|X=x)

Choose C2 otherwise
Equivalently, since p(C1|X=x) =p(X=x|C1)P(C1)/P(X=x)
Choose C1 if p(X=x|C1)P(C1) > p(X=x|C2)P(C2)
Choose C2 otherwise
Notice that both p(X=x|C1) and P(C1) are easier to compute
than P(Ci|x).
14
Posterior Probability Distribution
15
Example to Work on
16
17
You should be able:

E.g. derive marginal and conditional probabilities given
a joint probability table.
Use them to compute P(Ci |x) using the Bayes
theorem
18
PROBABLTY DENSTES FOR

CONTNUOUS VARABLES
19
Probability Densities
Cumulative Probability
20
Probability Densities
P(x [a, b]) = 1 if the interval [a, b] corresponds to the whole of Xspace.
Note that to be proper, we use upper-case letters for probabilities and
lower-case letters for probability densities.
For continuous variables, the class-conditional probabilities introduced
above become class-conditional probability density functions, which we
write in the form p(x|Ck).
21
Multible attributes
If there are d variables/attributes x1,...,xd, we may group
them into a vector x =[x1,... ,xd]T corresponding to a point in
a d-dimensional space.
The distribution of values of x can be described by
probability density function p(x), such that the probability of
x lying in a region R of the d-dimensional space is given by
Note that this is a simple extension of integrating in a 1d-interval, shown

before.
22
Bayes Thm. w/ Probability Densities

The prior probabilities can be combined with the class
conditional densities to give the posterior probabilities
P(Ck|x) using Bayes theorem (notice no significant
change in the formula!):
p(x) can be found as follows (though not needed) for two

classes which can be generalized for k classes:
23
DECSON REGIONS AND

DISCRIMINANT FUNCTIONS
24
Decision Regions
Assign a feature x to Ck if Ck=argmax (P(Cj|x))
j
Equivalently, assign a feature x to Ck if:
This generates c decision regions R1Rc such that a point falling in

region Rk is assigned to class Ck.
Note that each of these regions need not be contiguous.
The boundaries between these regions are known as decision
surfaces or decision boundaries.
25
Discriminant Functions
Although we have focused on probability distribution functions, the
decision on class membership in our classifiers has been based
solely on the relative sizes of the probabilities.
This observation allows us to reformulate the classification process
in terms of a set of discriminant functions y1(x),...., yc(x) such that an
input vector x is assigned to class Ck if:
We can recast the decision rule for minimizing the probability of

misclassification in terms of discriminant functions, by choosing:
26
Discriminant Functions
We can use any monotonic function of yk(x) that would simplify calculations,
since a monotonic transformation does not change the order of yks.
27
Classification Paradigms
In fact, we can categorize three fundamental approaches to classification:
Generative models: Model p(x|Ck) and P(Ck) separately and use the
Bayes theorem to find the posterior probabilities P(C k|x)
E.g. Naive Bayes, Gaussian Mixture Models, Hidden Markov
Models,
Discriminative models:
Determine P(Ck|x) directly and use in decision
E.g. Linear discriminant analysis, SVMs, NNs,
Find a discriminant function f that maps x onto a class label directly
without calculating probabilities
Advantages? Disadvantages?
28
Generative vs Discriminative Model Complexities
30
Why Separate Inference and Decision?

Having probabilities are useful (greys are material not yet covered):
Minimizing risk (loss matrix may change over time)
If we only have a discriminant function, any change in the loss function would
require re-training
Reject option
Posterior probabilities allow us to determine a rejection criterion that will
minimize the misclassification rate (or more generally the expected loss) for a
given fraction of rejected data points
Unbalanced class priors
Artificially balanced data
After training, we can divide the obtained posteriors by the class fractions in
the data set and multiply with class fractions for the true population
Combining models
We may wish to break a complex problem into smaller subproblems
E.g. Blood tests, X-Rays,
As long as each model gives posteriors for each class, we can combine the
outputs using rules of probability. How?
31
Naive Bayes Classifier

Mitchell [6.7-6.9]
32
Nave Bayes Classifier
33
Nave Bayes Classifier

But it requires a lot of data to estimate (roughly O(|A|n)
parameters for each class):
P(a1,a2,an| vj)
Nave Bayesian Approach: We assume that the
attribute values are conditionally independent given the
class vj so that
P(a1,a2,..,an|vj) =i P(a1|vj)
Nave Bayes Classifier:
vNB = argmaxvj V P(vj) i P(ai|vj)
34
Independence
If P(X,Y)=P(X)P(Y)
the random variables X and Y are said to be independent.
Since P(X,Y)= P(X | Y) P(Y) by definition, we have the equivalent
definition of P(X | Y) = P(X)
Independence and conditional independence are important because
they significantly reduce the number of parameters needed and
reduce computation time.
Consider estimating the joint probability distribution of two random
variables A and B:
10x10=100 vs 10+10=20 if each have 10 possible outcomes

1004=10,000 vs 100+100=200 if each have 100 possible
outcomes
35
Conditional Independence
We say that X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the value of Y
given a value for Z.
(xi,yj,zk) P(X=xi|Y=yj,Z=zk)=P(X=xi|Z=zk)
Or simply: P(X|Y,Z)=P(X|Z)
Using Bayes thm, we can also show:
P(X,Y|Z) = P(X|Z) P(Y|Z) since:
P(X|Y,Z)P(Y|Z)
P(X|Z)P(Y|Z)
36
Naive Bayes Classifier - Derivation

Use repeated applications of the definition of conditional
probability.
Expanding just using the Bayes theorem:
P(F1,F2,F3| C) = P(F3|F1,F2,C) P(F2|F1,C) P(F1|C)
Assume that each

is conditionally independent of every
Fi C:
other
for
given
Fj
i j
P Fi | C , F j P Fi | C
Then with these simplifications, we get:

P(F1,F2,F3| C) = P(F3|C) P(F2|C) P(F1|C)
37
37
Nave Bayes Classifier-Algorithm
I.e. Estimate P(v )j and P(a |v

i )j
possibly by counting occurence
of each class an each attribute in
each class among all examples
38
Nave Bayes Classifier-Example
39
Example from Mitchell Chp 3.
40
Illustrative Example
41
42
Naive Bayes Subtleties
43
Naive Bayes Subtleties
44
Naive Bayes for Document

Classification
45
Document Classification
Given a document, find its class (e.g. headlines, sports,
economics, fashion)
We assume the document is a bag-of-words.
d ~ { t1, t2, t3, tnd }

Using Naive Bayes:
P ( d | c ) P ( t i , t 2 , , t nd | c )
P(t
cC
| c)
1 k nd
cMAP arg max P (c | d ) arg max P (c)

cC
(t | c)
P
k
1 k nd
46
Smoothing
For each term, t, we need to estimate P(t|c)
P (t | c)
Tct
t 'V
Tct is the count of term t in all documents of class c
Tct '
Because an estimate will be 0 if a term does not appear with a

class in the training data, we need smoothing:
Laplace
Smoothing
P (t | c)
Tct 1
Tct 1
t 'V (Tct ' 1) (t 'V Tct ' ) | V |
|V| is the number of terms in the vocabulary

47
47
docID
Training
set
c=
China?
Chinese Beijing Chinese
Yes
Chinese Chinese Shangai
Yes
Chinese Macao
Yes
Tokyo Japan Chinese
No
Two
China,
notChinese
ChinaChinese Tokyo
Testtopic
set classes:
5
Chinese
Japan
V = {Beijing, Chinese, Japan, Macao, Tokyo, Shangai}

N=4
P (c) 3 / 4
P (c ) 1 / 4
48
48
docID
Training
set
c=
China?
Chinese Beijing Chinese
Yes
Chinese Chinese Shangai
Yes
Chinese Macao
Yes
Tokyo Japan Chinese
No
Test
set
5
Chinese Chinese Chinese
Tokyo
Classification
Probability
Estimation
Japan
P(Chinese | c) (5 1) /(8 6) 3 / 7
P (Tokyo | c) P (Japan | c) (0 1) /(8 6) 1 / 14

P (Chinese | c ) (1 1) /(3 6) 2 / 9
P (Tokyo | c) P (Japan | c) (1 1) /(3 6) 2 / 9
P (c | d ) P ( c )
P(t
1 k nd
| c)
P (c | d 5 ) 3 / 4 (3 / 7) 3 1 / 14 1 / 14 0.0003
P (c | d 5 ) 1 / 4 ( 2 / 9) 3 2 / 9 2 / 9 0.0001
49
49
Summary: Miscellanious
Nave Bayes is linear in the time is takes to scan the
data
When we have many terms, the product of probabilities
with cause a floating point underflow, therefore:
cMAP arg max[log P (c)
cC
log P(t
1 k nd
| c)
For a large training set, the vocabulary is large. It is

better to select only a subset of terms. For that is used
feature selection.
50
50
Mitchell Chp.6
Maximum Likelihood (ML) &
Maximum A Posteriori (MAP)
Hypotheses
51
Advantages of Bayesian Learning

Bayesian approaches, including the Naive Bayes
classifier, are among the most common and practical
ones in machine learning
Bayesian decision theory allows us to revise probabilities
based on new evidence
Bayesian methods provide a useful perspective for
understanding many learning algorithms that do not
manipulate probabilities
52
Features of Bayesian Learning

Each observed training data can incrementally decrease
or increase the estimated probability of a hypothesis
rather than completely eliminating a hypothesis if it is found
to be inconsistent with a single example
Prior knowledge can be combined with observed data to
determine the final probability of a hypothesis
New instances can be classified by combining predictions
of multiple hypotheses
Even in computationally intractable cases, Bayesian
optimal classifier provides a standard of optimal
decision against which other practical methods can be
compared
53
Evolution of Posterior Probabilities
The evolution of the probabilities associated with the hypotheses

As we gather more data (nothing, then sample D1, then sample D2),
inconsistent hypotheses gets 0 posterior probability and consistent ones share the remaining
probabilities (summing up to 1). Here Di is used to indicate one training instance.
54
Bayes Theorem
- also called likelihood
We are interested in finding the best hypothesis from some

space H, given the observed data D + any initial knowledge
about the prior probabilities of various hypotheses in H
55
Choosing Hypotheses
56
Choosing Hypotheses
57
Bayes Optimal Classifier

Mitchell [6.7-6.9]
58

Skip 6.5 (Gradient Search to Maximize Likelihood in a Neural Net)
So far we have considered the question "what is the
most probable hypothesis given the training data?
In fact, the question that is often of most significance is
"what is the most probable classiffication of the new
instance given the training data?
Although it may seem that this second question can be
answered by simply applying the MAP hypothesis to the
new instance, in fact it is possible to do better.
59
60
No other classifier
using the same hypothesis space
and same prior knowledge
can outperform this method
on average
61
The value vj can be a classification label or regression value.

Instead of being interested in the most likely value vj, it may be
clearer to specify our interest as calculating:
p(vj|x) = p(vj|hi) p(hi|D)
hi
where the dependence on x is implicit on the right hand side.

Then for classification, we can use the most likely class (vj here is
the class labels) as our prediction by taking argmax over vjs.
For later: For regression, we can compute further estimates of
interest, such as the mean of the distribution of vj (which is the
possible regression values for a given x).
62

Bayes Optimal Classification: The most probable classification
of a new instance is obtained by combining the predictions of all
hypotheses, weighted by their posterior probabilities:
argmaxvjV hi HP(vh|hi)P(hi|D)
where V is the set of all the values a classification can take and vj
is one possible such classification.
The classification error rate of the Bayes optimal classifier is
called
the Bayes error rate (or just Bayes rate)
63
Gibbs Classifier (Opper and Haussler, 1991, 1994)

Bayes optimal classifier returns the best result, but
expensive with many hypotheses.
Gibbs classifier:
Choose one hypothesis hi at random, by Monte Carlo
sampling according to reliability P(hi|D).
Use this hypothesis so that v = hi(x).
Surprising fact: The expected error is equal to or less
than twice the Bayes optimal error!
E[errorGibbs] <= 2E[errorBayesOptimal]
64
Bayesian Belief Networks

The Bayes Optimal Classifier is often too costly to
apply.
The Nave Bayes Classifier uses the conditional
independence assumption to defray these costs.
However, in many cases, such an assumption is overly
restrictive.
Bayesian belief networks provide an intermediate
approach which allows stating conditional independence
assumptions that apply to subsets of the variables.
65

Bayesian Learning: Berrin Yanikoglu

Uploaded by

Copyright:

Available Formats

Bayesian Learning: Berrin Yanikoglu

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bayesian Learning: Berrin Yanikoglu

Uploaded by

Copyright:

Available Formats

Bayesian Learning

Machine Learning by Mitchell-Chp. 6

Joint Probability of X and Y

Joint Probability of X and Y

Bayesian Decision Theory

Using this formula for classification problems, we get

P (X |C) P(C) / P(X)

posterior probability = x class conditional probability x prior

P(C1,X=x) = num. samples in corresponding box

= num. of of samples in C1-row

P(C1,X=x) = P(X=x|C1) P(C1)

Choose C1 if p(C1|X=x) > p(C2|X=x)

Posterior Probability Distribution

You should be able:

PROBABLTY DENSTES FOR

Note that this is a simple extension of integrating in a 1d-interval, shown

Bayes Thm. w/ Probability Densities

p(x) can be found as follows (though not needed) for two

DECSON REGIONS AND

This generates c decision regions R1Rc such that a point falling in

We can recast the decision rule for minimizing the probability of

Generative vs Discriminative Model Complexities

Why Separate Inference and Decision?

Naive Bayes Classifier

Nave Bayes Classifier

Nave Bayes Classifier

vNB = argmaxvj V P(vj) i P(ai|vj)

10x10=100 vs 10+10=20 if each have 10 possible outcomes

Naive Bayes Classifier - Derivation

Assume that each

Then with these simplifications, we get:

Nave Bayes Classifier-Algorithm

I.e. Estimate P(v )j and P(a |v

Nave Bayes Classifier-Example

Example from Mitchell Chp 3.

Naive Bayes Subtleties

Naive Bayes Subtleties

Naive Bayes for Document

d ~ { t1, t2, t3, tnd }

cMAP arg max P (c | d ) arg max P (c)

Tct is the count of term t in all documents of class c

Because an estimate will be 0 if a term does not appear with a

t 'V (Tct ' 1) (t 'V Tct ' ) | V |

|V| is the number of terms in the vocabulary

Chinese Beijing Chinese

Chinese Chinese Shangai

Tokyo Japan Chinese

V = {Beijing, Chinese, Japan, Macao, Tokyo, Shangai}

Chinese Beijing Chinese

Chinese Chinese Shangai

Tokyo Japan Chinese

P (Tokyo | c) P (Japan | c) (0 1) /(8 6) 1 / 14

For a large training set, the vocabulary is large. It is

Advantages of Bayesian Learning

Features of Bayesian Learning

Evolution of Posterior Probabilities

The evolution of the probabilities associated with the hypotheses

- also called likelihood