Module 2 - Bayesian Learning

AML

Uploaded by

himanshurautela87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Module 2 - Bayesian Learning

AML

Uploaded by

himanshurautela87

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Bayesian Learning

Some Terms to Understand Before delving into Bayesian learning:

Random variable (Stochastic variable) — In statistics, the random variable is a variable whose
possible values are a result of a random event. Therefore, each possible value of a random variable has
some probability attached to it to represent the likelihood of those values.
Probability distribution — The function that defines the probability of different outcomes/values of
a random variable. The continuous probability distributions are described using probability density
functions whereas discrete probability distributions can be represented using probability mass functions.
Conditional probability — This is a measure of probability P(A|B) of an event A given that another
event B has occurred.
Joint probability distribution -
Imagine a situation where your friend gives you a new coin and asks you the fairness of the coin (or the
probability of observing heads) without even flipping the coin once. In fact, you are also aware that
your friend has not made the coin biased. In general, you have seen that coins are fair, thus you expect
the probability of observing heads is 0.5. In the absence of any such observations, you assert the fairness
of the coin only using your past experiences or observations with coins.
Suppose that you are allowed to flip the coin 10 times in order to determine the fairness of the coin.
Your observations from the experiment will fall under one of the following cases:
Case 1: observing 5 heads and 5 tails.
Case 2: observing h heads and 10-h tails, where h ≠ 10−h.
If case 1 is observed, you are now more certain that the coin is a fair coin, and you will decide that the
probability of observing heads is 0.5 with more confidence.
If case 2 is observed, you can either:
1. Neglect your prior beliefs since now you have new data and decide the probability of observing heads
is h/10 by solely depending on recent observations.
2. Adjust your belief accordingly to the value of h that you have just observed, and decide the probability
of observing heads using your recent observations.
The first method suggests that we use the frequent method, where we omit our beliefs when making
decisions. However, the second method seems to be more convenient because 10 coins are insufficient
to determine the fairness of a coin.
Therefore, we can make better decisions by combining our recent observations and beliefs that we have
gained through our past experiences. It is this thinking model that uses our most recent observations
together with our beliefs or inclination for critical thinking that is known as Bayesian thinking.
Moreover, assume that your friend allows you to conduct another 10 coin flips. Then, we can use these
new observations to further update our beliefs. As we gain more data, we can incrementally update our
beliefs increasing the certainty of our conclusions. This is known as incremental learning, where you
update your knowledge incrementally with new evidence.
Bayesian learning comes into play on such occasions, where we are unable to use frequenters statistics
due to the drawbacks that we have discussed above. We can use Bayesian learning to address all these
drawbacks and even with additional capabilities (such as incremental updates of the posterior) when
testing a hypothesis to estimate unknown parameters of a machine learning models.
Bayesian learning uses Bayes' theorem to determine the conditional probability of a hypotheses given
some evidence or observations.
Bayes Theorem : Bayes' theorem describes how the conditional probability of an event or a hypothesis
can be computed using evidence and prior knowledge. It is similar to concluding that our code has no
bugs given the evidence that it has passed all the test cases, including our prior belief that we have rarely
observed any bugs in our code. However, this intuition goes beyond that simple hypothesis test where
there are multiple events or hypotheses involved (let us not worry about this for the moment). The
Bayes' theorem is given by:
𝑃(𝑋|𝛳) ∗ 𝑃(𝛳)
𝑃(𝛳|𝑋) =
𝑃(𝑋)

Where (𝛳│𝑋) = 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ; P(𝛳) = Prior Probability; P(X)= Likelihood of event and
𝑃(𝐷|𝛳) = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝐷 𝑔𝑖𝑣𝑒𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑖𝑠 𝛳.

Consider the hypothesis that there are no bugs in our code. 𝛳 and 𝑿 denote that our code is bug-free
and passes all the test cases respectively.
P(θ) — Prior Probability is the probability of the hypothesis θ being true before applying the Bayes'
theorem. Prior represents the beliefs that we have gained through past experience, which refers to either
common sense or an outcome of Bayes' theorem for some past observations.
For the example given, prior probability denotes the probability of observing no bugs in our code.
However, since this is the first time we are applying Bayes' theorem, we have to decide the priors using
other means (otherwise we could use the previous posterior as the new prior). Let us assume that it is
very unlikely to find bugs in our code because rarely have we observed bugs in our code in the past.
With our past experience of observing fewer bugs in our code, we can assign our prior P(θ) with a
higher probability. However, for now, let us assume that P(θ) = p.
P(X|θ) — Likelihood is the conditional probability of the evidence given a hypothesis. The likelihood
is mainly related to our observations or the data we have. If it is given that our code is bug-free, then
the probability of our code passing all test cases is given by the likelihood. Assuming we have
implemented these test cases correctly, if no bug is presented in our code, then it should pass all the test
cases. Therefore, the likelihood P(X|θ) = 1.
P(X) — Evidence term denotes the probability of evidence or data. This can be expressed as a
summation (or integral) of the probabilities of all possible hypotheses weighted by the likelihood of the
same. Therefore, we can write P(X) as:

ϴ is the set of all the hypotheses.

Maximum a Posteriori (MAP) : We can use MAP to determine the valid hypothesis from a set of
hypotheses. According to MAP, the hypothesis that has the maximum posterior probability is
considered as the valid hypothesis. Therefore, we can express the hypothesis θMAP that is concluded
using MAP as follows:

The argmaxθ operator estimates the event or hypothesis θi that maximizes the posterior probability
𝑃(𝜃𝑖 |𝑋).

Naive Bayes :
It is a simple technique for constructing classifiers models that assign class labels to problem instances,
represented as vectors of feature values, where the class labels are drawn from some finite set. There is
not a single algorithm for training such classifiers, but a family of algorithms based on a common
principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the
value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.
A naive Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of any possible correlations between the color, roundness, and
diameter features. For some types of probability models, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications, parameter estimation for
naive Bayes models uses the method of maximum likelihood; in other words, one can work with the
naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite
their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked
quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification
problem
showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that
Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An
advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification .
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, all of these
properties independently contribute to the probability that this fruit is an apple and that is why it is
known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes
theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at
the equation below:

How Naive Bayes algorithm works?

Let’s understand it using an example.
Below we have a training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing). Now, we need to classify whether players will play or not based on weather
condition.
Let’s follow the below steps to perform it.
Step 1: Convert the data set into a frequency table.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.

Problem:
Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny).

Solution : Here we have

P (Sunny |Yes) = 3/9 = 0.33,
P(Sunny) = 5/14 = 0.36,
P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60,

which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

What are the Pros and Cons of Naive Bayes?

Pros: It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other
models like logistic regression and you need less training data. It perform well in case of categorical
input variables compared to numerical variable(s). For numerical variable, normal distribution is
assumed (bell curve, which is a strong assumption).
Cons: If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often
known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest
smoothing techniques is called Laplace estimation. On the other side naive Bayes is also known as a
bad estimator, so the probability outputs from prediction probability are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost
impossible that we get a set of predictors which are completely independent.

Applications of Naive Bayes Algorithms :

Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-
mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.

What is a Bayesian Belief Network?

Bayesian Belief Network or Bayesian Network or Belief Network is a Probabilistic Graphical Model
(PGM) that represents conditional dependencies between random variables through a Directed Acyclic
Graph (DAG). Bayesian Networks are applied in many fields. For example, disease diagnosis,
optimized web search, spam filtering, gene regulatory networks, etc. And this list can be extended. The
main objective of these networks is trying to understand the structure of causality relations. To clarify
this, let’s consider a disease diagnosis problem. With given symptoms and their resulting disease, we
construct our Belief Network and when a new patient comes, we can infer which disease or diseases
may have the new patient by providing probabilities for each disease. Similarly, these causality relations
can be constructed for other problems and inference techniques can be applied to interesting results.
The probabilities are calculated in the belief networks by the following formula:

As you would understand from the formula, to be able to calculate the joint distribution we need to
have conditional probabilities indicated by the network. But further that if we have the joint distribution,
then we can start to ask interesting questions.
For example, in the first example, we ask for the probability of “RAIN” if “SEASON” is “WINTER”
and “DOG BARK” is “TRUE”.

What is Expectation-Maximization algorithm?

It can be used for the latent variables (variables that are not directly observable and are actually inferred
from the values of the other observed variables) too in order to predict their values with the condition
that the general form of probability distribution governing those latent variables is known to us. This
algorithm is actually at the base of many unsupervised clustering algorithms in the field of ML.
It is used to find the local maximum likelihood parameters of a statistical model in the cases where
latent variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset,
estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation (E) step
is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset
to estimate the missing data and then using that data to update the values of the parameters. Let us
understand the EM algorithm in detail. Initially, a set of initial values of the parameters are considered.
A set of incomplete observed data is given to the system with the assumption that the observed data
comes from a specific model. The next step is known as “Expectation” – step or E-step. In this step,
we use the observed data in order to estimate or guess the values of the missing or incomplete data. It
is basically used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis. Now, in the fourth step, it is checked whether the values are
converging or not,if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart of EM Algorithm is shown below:
Usage of EM algorithm – It can be used to fill the missing data in a sample. It can be used as the
basis of unsupervised learning of clusters. It can be used for the purpose of estimating the parameters
of Hidden Markov Model (HMM). It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
1. It has slow convergence.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical optimization requires only
forward probability).

Manual EA888 GEN 3 Motor
100% (4)
Manual EA888 GEN 3 Motor
34 pages
Bayes Theorem in Machine learning
No ratings yet
Bayes Theorem in Machine learning
37 pages
Unit Iii 1
No ratings yet
Unit Iii 1
20 pages
E-Note 14654 Content Document 20231228101425AM
No ratings yet
E-Note 14654 Content Document 20231228101425AM
10 pages
Naive Bayes
No ratings yet
Naive Bayes
29 pages
Unit I Probabilistic Reasoning I 9
No ratings yet
Unit I Probabilistic Reasoning I 9
20 pages
unit II AI PPT.pptx
No ratings yet
unit II AI PPT.pptx
43 pages
Baye's Theorem - Example
No ratings yet
Baye's Theorem - Example
7 pages
Unit 2
No ratings yet
Unit 2
20 pages
Bayes Learning
No ratings yet
Bayes Learning
37 pages
Naive Bayes
No ratings yet
Naive Bayes
21 pages
Naive Bayes
No ratings yet
Naive Bayes
36 pages
Aiml Iii
No ratings yet
Aiml Iii
28 pages
Bayesian Learning
No ratings yet
Bayesian Learning
41 pages
Artifical Intelligence Notes Part 7
No ratings yet
Artifical Intelligence Notes Part 7
49 pages
Naive Bayes Classifier: Coin Toss and Fair Dice Example
No ratings yet
Naive Bayes Classifier: Coin Toss and Fair Dice Example
16 pages
What Is Naive Bayes?
No ratings yet
What Is Naive Bayes?
6 pages
An Introduction to Naive Bayes Algorithm for Beginners
No ratings yet
An Introduction to Naive Bayes Algorithm for Beginners
11 pages
Bayesian Inference
No ratings yet
Bayesian Inference
20 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
Naive Bayes Classifier
No ratings yet
Naive Bayes Classifier
9 pages
ML Unit 1
No ratings yet
ML Unit 1
13 pages
Unit II Probabilistic Reasoning
No ratings yet
Unit II Probabilistic Reasoning
28 pages
Bayes Theorem
No ratings yet
Bayes Theorem
20 pages
Unit2_5 part 1
No ratings yet
Unit2_5 part 1
14 pages
ML Unit III
No ratings yet
ML Unit III
40 pages
Notes On ML
No ratings yet
Notes On ML
42 pages
Unit 3 Bayesian Concept Learning
No ratings yet
Unit 3 Bayesian Concept Learning
66 pages
8.-Naive-Bayes-Classifier
No ratings yet
8.-Naive-Bayes-Classifier
37 pages
ML Unit 3 Part 1
No ratings yet
ML Unit 3 Part 1
36 pages
What Is Naive Bayes Algorithm?
No ratings yet
What Is Naive Bayes Algorithm?
18 pages
Bayesian Inference
No ratings yet
Bayesian Inference
5 pages
Unit-4
No ratings yet
Unit-4
36 pages
ML UNIT-5 Notes PDF
No ratings yet
ML UNIT-5 Notes PDF
41 pages
ML Unit Iv
No ratings yet
ML Unit Iv
17 pages
Bayesian Inference
No ratings yet
Bayesian Inference
12 pages
CS-DM Module-4
No ratings yet
CS-DM Module-4
22 pages
Unit 3-2
No ratings yet
Unit 3-2
12 pages
25-27 Statistical Reasoning-Probablistic Model-Naive Bayes Classifier
No ratings yet
25-27 Statistical Reasoning-Probablistic Model-Naive Bayes Classifier
35 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
Intro-Bayes theory
No ratings yet
Intro-Bayes theory
17 pages
Unit-4 Naïve Bayes & Support Vector Machine
No ratings yet
Unit-4 Naïve Bayes & Support Vector Machine
79 pages
Lecture10 - Bayesian Classifier
No ratings yet
Lecture10 - Bayesian Classifier
40 pages
Data Mining - Module 7
No ratings yet
Data Mining - Module 7
8 pages
Aiml 2 3
No ratings yet
Aiml 2 3
51 pages
18. Naive Bayes Classifier
No ratings yet
18. Naive Bayes Classifier
46 pages
Naive bayes classification outlne
No ratings yet
Naive bayes classification outlne
12 pages
Unit Iii Bayesian Learning
No ratings yet
Unit Iii Bayesian Learning
5 pages
18CS71 Module 4
No ratings yet
18CS71 Module 4
30 pages
UNIT 2 AAM notes (1)
No ratings yet
UNIT 2 AAM notes (1)
38 pages
Bayesian_theory_daniel_restrepo
No ratings yet
Bayesian_theory_daniel_restrepo
8 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Data Analytics Unit-2 PPT Notes
No ratings yet
Data Analytics Unit-2 PPT Notes
190 pages
Baysian Modelling
No ratings yet
Baysian Modelling
16 pages
Classification-Alternative Techniques: Bayesian Classifiers
No ratings yet
Classification-Alternative Techniques: Bayesian Classifiers
7 pages
3 - Classification - Naive Bayes
No ratings yet
3 - Classification - Naive Bayes
30 pages
OVERVIEW OF PRINCIPLES OF STATISTICS
No ratings yet
OVERVIEW OF PRINCIPLES OF STATISTICS
8 pages
Unit 2 Probabilistic Reasoning
No ratings yet
Unit 2 Probabilistic Reasoning
21 pages
Unit III
No ratings yet
Unit III
19 pages
Bayesian Inference: Fundamentals and Applications
From Everand
Bayesian Inference: Fundamentals and Applications
Fouad Sabry
No ratings yet
Bayesian Methodology: an Overview With The Help Of R Software
From Everand
Bayesian Methodology: an Overview With The Help Of R Software
Editor IJSMI
No ratings yet
Electromagnetic Engine
No ratings yet
Electromagnetic Engine
2 pages
Bài tập làm thêm
No ratings yet
Bài tập làm thêm
2 pages
Summative in Physical Science 4th Q
100% (1)
Summative in Physical Science 4th Q
4 pages
Anisah, A., Winanda, V., & Herawan, E. (2023)
No ratings yet
Anisah, A., Winanda, V., & Herawan, E. (2023)
10 pages
CNET Final Exam (Solution) (Fall-2023)
No ratings yet
CNET Final Exam (Solution) (Fall-2023)
7 pages
Asa QB
No ratings yet
Asa QB
5 pages
01 JMS 312 GS-BL 1065 KW
No ratings yet
01 JMS 312 GS-BL 1065 KW
6 pages
Chapter 2 - Measures of Location and Spread
No ratings yet
Chapter 2 - Measures of Location and Spread
3 pages
Case Study Based Questions Class 11 Physics
No ratings yet
Case Study Based Questions Class 11 Physics
24 pages
C o A G U L A: Installation How It Works Painting Tools Making Sound Keyboard Shortcuts Contact
No ratings yet
C o A G U L A: Installation How It Works Painting Tools Making Sound Keyboard Shortcuts Contact
15 pages
Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices
No ratings yet
Bootstrap Confidence Interval of the Difference Between Two Process Capability Indices
8 pages
Xflate Format
No ratings yet
Xflate Format
24 pages
Cambridge International Examinations Cambridge International General Certificate of Secondary Education
No ratings yet
Cambridge International Examinations Cambridge International General Certificate of Secondary Education
20 pages
CNC Retrofit Kit
No ratings yet
CNC Retrofit Kit
4 pages
Computers & Education: Doo Young Lee, Mark R. Lehto
No ratings yet
Computers & Education: Doo Young Lee, Mark R. Lehto
16 pages
Appendix 4 - SPECIFICATION FOR STRUCTURAL STEEL MATERIAL FOR OFFSHORE STRUCTURES
100% (3)
Appendix 4 - SPECIFICATION FOR STRUCTURAL STEEL MATERIAL FOR OFFSHORE STRUCTURES
21 pages
Emerging Technology Iot and Ot Overview Security Threats Attacks and Countermeasures IJERTV10IS070060
No ratings yet
Emerging Technology Iot and Ot Overview Security Threats Attacks and Countermeasures IJERTV10IS070060
8 pages
EC-Lab - Application Note #67 04/2019 Introduction To Pulsed Voltammetric Techniques: DPV, NPV and SWV I - Introduction
No ratings yet
EC-Lab - Application Note #67 04/2019 Introduction To Pulsed Voltammetric Techniques: DPV, NPV and SWV I - Introduction
5 pages
Summative Test 1 - ELS
No ratings yet
Summative Test 1 - ELS
1 page
Q Learning
No ratings yet
Q Learning
38 pages
Anticoagulation in Hemodialysis: Arlene S. Munoz MD, FPCP, FPSN
100% (1)
Anticoagulation in Hemodialysis: Arlene S. Munoz MD, FPCP, FPSN
51 pages
Using The Isometric Mid-Thigh Pull in The Monitoring of Weightlifters: 25+ Years of Experience
No ratings yet
Using The Isometric Mid-Thigh Pull in The Monitoring of Weightlifters: 25+ Years of Experience
10 pages
ZEISS CALYPSO 2021 Flyer Action Software Options EN
No ratings yet
ZEISS CALYPSO 2021 Flyer Action Software Options EN
2 pages
Test Android Apk File With Robotium
No ratings yet
Test Android Apk File With Robotium
7 pages
5_6327852738277279154
No ratings yet
5_6327852738277279154
7 pages
ANSYS Example 1DHeat
No ratings yet
ANSYS Example 1DHeat
3 pages
01200-90605 Installation&Familiarization ScopeOfWork Rev52 003
No ratings yet
01200-90605 Installation&Familiarization ScopeOfWork Rev52 003
17 pages
Group F Assignment 1
No ratings yet
Group F Assignment 1
15 pages
14.1 Apendix 04 PROFIBUS PDF
No ratings yet
14.1 Apendix 04 PROFIBUS PDF
34 pages