Module 2 - Bayesian Learning
Module 2 - Bayesian Learning
Module 2 - Bayesian Learning
Where (𝛳│𝑋) = 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ; P(𝛳) = Prior Probability; P(X)= Likelihood of event and
𝑃(𝐷|𝛳) = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝐷 𝑔𝑖𝑣𝑒𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑖𝑠 𝛳.
Consider the hypothesis that there are no bugs in our code. 𝛳 and 𝑿 denote that our code is bug-free
and passes all the test cases respectively.
P(θ) — Prior Probability is the probability of the hypothesis θ being true before applying the Bayes'
theorem. Prior represents the beliefs that we have gained through past experience, which refers to either
common sense or an outcome of Bayes' theorem for some past observations.
For the example given, prior probability denotes the probability of observing no bugs in our code.
However, since this is the first time we are applying Bayes' theorem, we have to decide the priors using
other means (otherwise we could use the previous posterior as the new prior). Let us assume that it is
very unlikely to find bugs in our code because rarely have we observed bugs in our code in the past.
With our past experience of observing fewer bugs in our code, we can assign our prior P(θ) with a
higher probability. However, for now, let us assume that P(θ) = p.
P(X|θ) — Likelihood is the conditional probability of the evidence given a hypothesis. The likelihood
is mainly related to our observations or the data we have. If it is given that our code is bug-free, then
the probability of our code passing all test cases is given by the likelihood. Assuming we have
implemented these test cases correctly, if no bug is presented in our code, then it should pass all the test
cases. Therefore, the likelihood P(X|θ) = 1.
P(X) — Evidence term denotes the probability of evidence or data. This can be expressed as a
summation (or integral) of the probabilities of all possible hypotheses weighted by the likelihood of the
same. Therefore, we can write P(X) as:
Maximum a Posteriori (MAP) : We can use MAP to determine the valid hypothesis from a set of
hypotheses. According to MAP, the hypothesis that has the maximum posterior probability is
considered as the valid hypothesis. Therefore, we can express the hypothesis θMAP that is concluded
using MAP as follows:
The argmaxθ operator estimates the event or hypothesis θi that maximizes the posterior probability
𝑃(𝜃𝑖 |𝑋).
Naive Bayes :
It is a simple technique for constructing classifiers models that assign class labels to problem instances,
represented as vectors of feature values, where the class labels are drawn from some finite set. There is
not a single algorithm for training such classifiers, but a family of algorithms based on a common
principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the
value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.
A naive Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of any possible correlations between the color, roundness, and
diameter features. For some types of probability models, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications, parameter estimation for
naive Bayes models uses the method of maximum likelihood; in other words, one can work with the
naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite
their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked
quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification
problem
showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that
Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An
advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification .
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, all of these
properties independently contribute to the probability that this fruit is an apple and that is why it is
known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes
theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at
the equation below:
Problem:
Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny).
Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.
Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-
mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.
As you would understand from the formula, to be able to calculate the joint distribution we need to
have conditional probabilities indicated by the network. But further that if we have the joint distribution,
then we can start to ask interesting questions.
For example, in the first example, we ask for the probability of “RAIN” if “SEASON” is “WINTER”
and “DOG BARK” is “TRUE”.
The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset
to estimate the missing data and then using that data to update the values of the parameters. Let us
understand the EM algorithm in detail. Initially, a set of initial values of the parameters are considered.
A set of incomplete observed data is given to the system with the assumption that the observed data
comes from a specific model. The next step is known as “Expectation” – step or E-step. In this step,
we use the observed data in order to estimate or guess the values of the missing or incomplete data. It
is basically used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis. Now, in the fourth step, it is checked whether the values are
converging or not,if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart of EM Algorithm is shown below:
Usage of EM algorithm – It can be used to fill the missing data in a sample. It can be used as the
basis of unsupervised learning of clusters. It can be used for the purpose of estimating the parameters
of Hidden Markov Model (HMM). It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
1. It has slow convergence.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical optimization requires only
forward probability).