0% found this document useful (0 votes)
12 views13 pages

Chap4 Naive Bayes

example

Uploaded by

kailashkh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Chap4 Naive Bayes

example

Uploaded by

kailashkh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Mining

Classification: Alternative Techniques

Bayesian Classifiers

Introduction to Data𝑝 Mining, 2nd Edition


by
Tan, Steinbach, Karpatne, Kumar

Bayes Classifier

• A probabilistic framework for solving classification


problems
• Conditional Probability: P( X , Y )
P(Y | X ) 
P( X )
P( X , Y )
P( X | Y ) 
P(Y )
• Bayes theorem:
P( X | Y ) P(Y )
P(Y | X ) 
P( X )

2/08/2021 Introduction to Data Mining, 2nd Edition 2

2
Using Bayes Theorem for Classification

• Consider each attribute and class


label as random variables
• Given a record with attributes (X1,
X2,…, Xd), the goal is to predict
class Y c c c
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


– Specifically, we want to find the value of
2 No Married 100K No
Y that maximizes P(Y| X1, X2,…, Xd )
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
• Can we estimate P(Y| X1, X2,…, Xd ) 6 No Married 60K No
directly from data? 7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/08/2021 Introduction to Data Mining, 2nd Edition 3

Using Bayes Theorem for Classification

• Approach:
– compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem
P ( X 1 X 2  X d | Y ) P (Y )
P (Y | X 1 X 2  X n ) 
P( X 1X 2  X d )

– Maximum a-posteriori: Choose Y that maximizes


P(Y | X1, X2, …, Xd)

– Equivalent to choosing value of Y that maximizes


P(X1, X2, …, Xd|Y) P(Y)

• How to estimate P(X1, X2, …, Xd | Y )?


2/08/2021 Introduction to Data Mining, 2nd Edition 4

4
Example Data
Given a Test Record:
X  ( Refund  No, Divorced, Income  120K)
c c c
Tid Refund Marital
Status
Taxable
Income Evade • We need to estimate
1 Yes Single 125K No P(Evade = Yes | X) and P(Evade = No | X)
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No In the following we will replace
5 No Divorced 95K Yes
Evade = Yes by Yes, and
6 No Married 60K No
7 Yes Divorced 220K No Evade = No by No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/08/2021 Introduction to Data Mining, 2nd Edition 5

Example Data
Given a Test Record:
X  ( Refund  No, Divorced, Income  120K)
c c c
Tid Refund Marital Taxable
Status Income Evade

1 Yes Single 125K No


2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

2/08/2021 Introduction to Data Mining, 2nd Edition 6

6
Conditional Independence

• X and Y are conditionally independent given Z if


P(X|YZ) = P(X|Z)

• Example: Arm length and reading skills


– Young child has shorter arm length and
limited reading skills, compared to adults
– If age is fixed, no apparent relationship
between arm length and reading skills
– Arm length and reading skills are conditionally
independent given age

2/08/2021 Introduction to Data Mining, 2nd Edition 7

Naïve Bayes Classifier

• Assume independence among attributes Xi when class is


given:
– P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)

– Now we can estimate P(Xi| Yj) for all Xi and Yj


combinations from the training data

– New point is classified to Yj if P(Yj)  P(Xi| Yj) is


maximal.

2/08/2021 Introduction to Data Mining, 2nd Edition 8

8
Naïve Bayes on Example Data
Given a Test Record:
X  ( Refund  No, Divorced, Income  120K)
c c c
Tid Refund Marital Taxable
Status Income Evade P(X | Yes) =
1 Yes Single 125K No
P(Refund = No | Yes) x
2 No Married 100K No
3 No Single 70K No
P(Divorced | Yes) x
4 Yes Married 120K No P(Income = 120K | Yes)
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
P(X | No) =
8 No Single 85K Yes P(Refund = No | No) x
9 No Married 75K No
P(Divorced | No) x
10 No Single 90K Yes
10

P(Income = 120K | No)

2/08/2021 Introduction to Data Mining, 2nd Edition 9

Estimate Probabilities from Data


• P(y) = fraction of instances of class y
c c c
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Status Income Evade P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No • For categorical attributes:
3 No Single 70K No
4 Yes Married 120K No
P(Xi =c| y) = nc/ n
5 No Divorced 95K Yes
– where |Xi =c| is number of
6 No Married 60K No instances having attribute
7 Yes Divorced 220K No value Xi =c and belonging to
8 No Single 85K Yes class y
9 No Married 75K No
– Examples:
10 No Single 90K Yes
P(Status=Married|No) = 4/7
10

P(Refund=Yes|Yes)=0

2/08/2021 Introduction to Data Mining, 2nd Edition 10

10
Estimate Probabilities from Data

• For continuous attributes:


– Discretization: Partition the range into bins:
 Replace continuous value with bin value
– Attribute changed from continuous to ordinal

– Probability density estimation:


 Assume attribute follows a normal distribution
 Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
 Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)

2/08/2021 Introduction to Data Mining, 2nd Edition 11

11

Estimate Probabilities from Data


Tid Refund Marital
Status
Taxable
Income Evade
• Normal distribution:
( X i  ij ) 2

1 Yes Single 125K No 1 2 ij2
P( X i | Y j )  e
2 No Married 100K No
2 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No • For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
 sample mean = 110
10 No Single 90K Yes  sample variance = 2975
10

1 
( 120 110 ) 2

P ( Income  120 | No)  e 2 ( 2975 )


 0.0072
2 (54.54)
2/08/2021 Introduction to Data Mining, 2nd Edition 12

12
Example of Naïve Bayes Classifier
Given a Test Record:

X  (Refund  No, Divorced, Income  120K)


Naïve Bayes Classifier:

P(Refund = Yes | No) = 3/7


P(Refund = No | No) = 4/7 • P(X | No) = P(Refund=No | No)
P(Refund = Yes | Yes) = 0  P(Divorced | No)
P(Refund = No | Yes) = 1  P(Income=120K | No)
P(Marital Status = Single | No) = 2/7 = 4/7  1/7  0.0072 = 0.0006
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3 • P(X | Yes) = P(Refund=No | Yes)
P(Marital Status = Divorced | Yes) = 1/3  P(Divorced | Yes)
P(Marital Status = Married | Yes) = 0  P(Income=120K | Yes)
= 1  1/3  1.2  10-9 = 4  10-10
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
Since P(X|No)P(No) > P(X|Yes)P(Yes)
If class = Yes: sample mean = 90 Therefore P(No|X) > P(Yes|X)
sample variance = 25
=> Class = No

2/08/2021 Introduction to Data Mining, 2nd Edition 13

13

Naïve Bayes Classifier can make decisions with partial


information about attributes in the test record
Even in absence of information
about any attributes, we can use P(Yes) = 3/10
Apriori Probabilities of Class P(No) = 7/10
Variable:
Naïve Bayes Classifier: If we only know that marital status is Divorced, then:
P(Yes | Divorced) = 1/3 x 3/10 / P(Divorced)
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7 P(No | Divorced) = 1/7 x 7/10 / P(Divorced)
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1 If we also know that Refund = No, then
P(Marital Status = Single | No) = 2/7 P(Yes | Refund = No, Divorced) = 1 x 1/3 x 3/10 /
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Divorced, Refund = No)
P(Marital Status = Single | Yes) = 2/3 P(No | Refund = No, Divorced) = 4/7 x 1/7 x 7/10 /
P(Marital Status = Divorced | Yes) = 1/3 P(Divorced, Refund = No)
P(Marital Status = Married | Yes) = 0
If we also know that Taxable Income = 120, then
For Taxable Income: P(Yes | Refund = No, Divorced, Income = 120) =
If class = No: sample mean = 110 1.2 x10-9 x 1 x 1/3 x 3/10 /
sample variance = 2975
P(Divorced, Refund = No, Income = 120 )
If class = Yes: sample mean = 90
sample variance = 25 P(No | Refund = No, Divorced Income = 120) =
0.0072 x 4/7 x 1/7 x 7/10 /
P(Divorced, Refund = No, Income = 120)
2/08/2021 Introduction to Data Mining, 2nd Edition 14

14
Issues with Naïve Bayes Classifier
Given a Test Record:
X = (Married)

Naïve Bayes Classifier:


P(Yes) = 3/10
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(No) = 7/10
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7 P(Yes | Married) = 0 x 3/10 / P(Married)
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(No | Married) = 4/7 x 7/10 / P(Married)
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0

For Taxable Income:


If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25

2/08/2021 Introduction to Data Mining, 2nd Edition 15

15

Issues with Naïve Bayes Classifier


Consider the table with Tid = 7 deleted Naïve Bayes Classifier:
Tid Refund Marital Taxable
Status Income Evade P(Refund = Yes | No) = 2/6
P(Refund = No | No) = 4/6
1 Yes Single 125K No P(Refund = Yes | Yes) = 0
2 No Married 100K No P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/6
3 No Single 70K No P(Marital Status = Divorced | No) = 0
4 Yes Married 120K No P(Marital Status = Married | No) = 4/6
5 No Divorced 95K Yes
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
6 No Married 60K No P(Marital Status = Married | Yes) = 0/3
7 Yes Divorced 220K No For Taxable Income:
If class = No: sample mean = 91
8 No Single 85K Yes
sample variance = 685
9 No Married 75K No If class = No: sample mean = 90
10 No Single 90K Yes sample variance = 25
10

Given X = (Refund = Yes, Divorced, 120K)


Naïve Bayes will not be able to
P(X | No) = 2/6 X 0 X 0.0083 = 0 classify X as Yes or No!
P(X | Yes) = 0 X 1/3 X 1.2 X 10-9 = 0

2/08/2021 Introduction to Data Mining, 2nd Edition 16

16
Issues with Naïve Bayes Classifier

• If one of the conditional probabilities is zero, then


the entire expression becomes zero
• Need to use other estimates of conditional probabilities
than simple fractions n: number of training
instances belonging to class y
• Probability estimation:
nc: number of instances with
Xi = c and Y = y
𝑛
original: 𝑃 𝑋 𝑐𝑦 v: total number of attribute
𝑛 values that Xi can take
𝑛 1 p: initial estimate of
Laplace Estimate: 𝑃 𝑋 𝑐𝑦
𝑛 𝑣 (P(Xi = c|y) known apriori
m: hyper-parameter for our
𝑛 𝑚𝑝 confidence in p
m estimate: 𝑃 𝑋 𝑐𝑦
𝑛 𝑚

2/08/2021 Introduction to Data Mining, 2nd Edition 17

17

Example of Naïve Bayes Classifier

Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M )      0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M )  0.06   0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P( N )  0.004   0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals

2/08/2021 Introduction to Data Mining, 2nd Edition 18

18
Naïve Bayes (Summary)

• Robust to isolated noise points

• Handle missing values by ignoring the instance


during probability estimate calculations

• Robust to irrelevant attributes

• Redundant and correlated attributes will violate


class conditional assumption
–Useother techniques such as Bayesian Belief
Networks (BBN)

2/08/2021 Introduction to Data Mining, 2nd Edition 19

19

Naïve Bayes

• How does Naïve Bayes perform on the following dataset?

Conditional independence of attributes is violated

2/08/2021 Introduction to Data Mining, 2nd Edition 20

20
Bayesian Belief Networks

• Provides graphical representation of probabilistic


relationships among a set of random variables
• Consists of:
– A directed acyclic graph (dag) A B
 Node corresponds to a variable
 Arc corresponds to dependence
C
relationship between a pair of variables

– A probability table associating each node to its


immediate parent

2/08/2021 Introduction to Data Mining, 2nd Edition 21

21

Conditional Independence

D is parent of C
A is child of C
B is descendant of D
D is ancestor of A

• A node in a Bayesian network is conditionally


independent of all of its nondescendants, if its
parents are known
2/08/2021 Introduction to Data Mining, 2nd Edition 22

22
Conditional Independence

• Naïve Bayes assumption:

2/08/2021 Introduction to Data Mining, 2nd Edition 23

23

Probability Tables

• If X does not have any parents, table contains


prior probability P(X)

• If X has only one parent (Y), table contains


conditional probability P(X|Y)

• If X has multiple parents (Y1, Y2,…, Yk), table


contains conditional probability P(X|Y1, Y2,…, Yk)

2/08/2021 Introduction to Data Mining, 2nd Edition 24

24
Example of Bayesian Belief Network

Exercise=Yes 0.7 Diet=Healthy 0.25


Exercise=No 0.3 Diet=Unhealthy 0.75

Exercise Diet

D=Healthy D=Healthy D=Unhealthy D=Unhealthy


Heart E=Yes E=No E=Yes E=No
Disease HD=Yes 0.25 0.45 0.55 0.75
HD=No 0.75 0.55 0.45 0.25

Blood
Chest Pain
Pressure

HD=Yes HD=No HD=Yes HD=No


CP=Yes 0.8 0.01 BP=High 0.85 0.2
CP=No 0.2 0.99 BP=Low 0.15 0.8

2/08/2021 Introduction to Data Mining, 2nd Edition 25

25

Example of Inferencing using BBN

• Given: X = (E=No, D=Yes, CP=Yes, BP=High)


– Compute P(HD|E,D,CP,BP)?
• P(HD=Yes| E=No,D=Yes) = 0.55
P(CP=Yes| HD=Yes) = 0.8
P(BP=High| HD=Yes) = 0.85
– P(HD=Yes|E=No,D=Yes,CP=Yes,BP=High)
 0.55  0.8  0.85 = 0.374
Classify X
• P(HD=No| E=No,D=Yes) = 0.45 as Yes
P(CP=Yes| HD=No) = 0.01
P(BP=High| HD=No) = 0.2
– P(HD=No|E=No,D=Yes,CP=Yes,BP=High)
 0.45  0.01  0.2 = 0.0009

2/08/2021 Introduction to Data Mining, 2nd Edition 26

26

You might also like