Naive Bayes Classification
Naive Bayes Classification
Naive Bayes Classification
Bina Nusantara
Topic
• Bayes Theorem
• Naïve Bayesian Classification
• Model Evaluation and Selection
4
Bina Nusantara
Bayes Theorem
P( A | C ) P(C )
P(C | A)
P( A)
Bayesian Theorem: Basics
8
Bayesian Theorem
9
Example of Bayes Theorem
• Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) 0.0002
P( S ) 1 / 20
Bayesian Classifiers
• Approach:
– compute the posterior probability P(C | A1, A2, …, An) for all
values of C using the Bayes theorem
P( A A A | C ) P(C )
P(C | A A A ) 1 2 n
P( A A A )
1 2 n
1 2 n
18
Avoiding the 0-Probability Problem
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore
loss of accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history,
etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by
Naïve Bayesian Classifier
• How to deal with these dependencies? Bayesian Belief
Networks (Chapter 9)
20
Play-tennis example:
estimating P(xi|C)
outlook
Outlook Temperature Humidity Windy Class P(sunny|p) = 2/9 P(sunny|n) = 3/5
sunny hot high false N
sunny hot high true N
P(overcast|p) = 4/9 P(overcast|n) = 0
overcast hot high false P P(rain|p) = 3/9 P(rain|n) = 2/5
rain mild high false P
rain cool normal false P temperature
rain cool normal true N
overcast cool normal true P P(hot|p) = 2/9 P(hot|n) = 2/5
sunny mild high false N
sunny cool normal false P P(mild|p) = 4/9 P(mild|n) = 2/5
rain mild normal false P
sunny mild normal true P P(cool|p) = 3/9 P(cool|n) = 1/5
overcast mild high true P
overcast hot normal false P
humidity
rain mild high true N P(high|p) = 3/9 P(high|n) = 4/5
P(p) = 9/14 P(normal|p) = 6/9 P(normal|n) = 2/5
windy
P(n) = 5/14
P(true|p) = 3/9 P(true|n) = 3/5
P(false|p) = 6/9 P(false|n) = 2/5
December 17, 2019 Data Mining: Concepts and Techniques 21
Likelihood Table
Example:
The posterior probability can be calculated by first, constructing a frequency
table for each attribute against the target. Then, transforming the frequency
tables to likelihood tables and finally use the Naive Bayesian equation to
calculate the posterior probability for each class. The class with the highest
posterior probability is the outcome of prediction.
Play-tennis example: classifying
• An unseen sample X = <rain, hot, high, false>
• P(X|p)·P(p) =
P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) =
3/9·2/9·3/9·6/9·9/14 = 0.010582
• P(X|n)·P(n) =
P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) =
2/5·2/5·4/5·2/5·5/14 = 0.018286
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
36
Confusion Metrics Performance Metrics –
Example (1)
4. Precision is the proportion of data points predicted as “+” that are truly “+”
37
Metrics for Evaluating Classifier
Performance – Example (2)
http://www.clips.uantwerpen.be/~vincent/pdf/microaverage.pdf
Macro-averaged measure
• L = {λj : j = 1...q} is the set of all labels. [. . . ]
• Consider a binary evaluation measure B(tp, tn, fp, fn) that
is calculated based on the number of true positives (tp),
true negatives (tn), false positives (fp) and false negatives
(fn).
• Let tpλ, fpλ, tnλ and fnλ be the number of true positives,
false positives, true negatives and false negatives after
binary evaluation for a label λ. [. . . ]
Macro-averaged measure
• The holdout method is what we have alluded to so far in our discussions about accuracy.
– In this method, the given data are randomly partitioned into two independent sets, a training set
and a test set.
– Typically, two-thirds of the data are allocated to the training set, and the remaining one-
third is allocated to the test set.
– The training set is used to derive the model.
– The model’s accuracy is then estimated with the test set The estimate is pessimistic because
only a portion of the initial data is used to derive the model.
• Random subsampling is a variation of the holdout method in which the holdout method is repeated k
times. The overall accuracy estimate is taken as the average of the
• accuracies obtained from each iteration.
Cross-Validation
• In k-fold cross-validation, the initial data are randomly
partitioned into k mutually exclusive subsets or “folds,”
D1, D2, : : : , Dk, each of approximately equal size.
• Training and testing is performed k times. In iteration i,
partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model.
Cross-validation
Strategy:
1. Divide up the dataset into two non-overlapping subsets
2. One subset is called the “test” and the other the “training”
3. Build the model using the “training” dataset
4. Obtain predictions of the “test” set
5. Utilize the “test” set predictions to calculate all the
performance metrics
Typically cross-validation is performed for multiple iterations,
selecting a different non-overlapping test and training set each time
45
Types of Cross-validation
46
Other Resources
• Performance Metrics for Graph Mining Tasks, Oak Ridge
National Laboratory
• http://www.saedsayad.com/naive_bayesian.htm
Example
• https://www.geeksforgeeks.org/naive-bayes-classifiers/
• https://www.machinelearningplus.com/predictive-modeling/how-
naive-bayes-algorithm-works-with-example-and-full-code/
• https://monkeylearn.com/blog/practical-explanation-naive-bayes-
classifier/