Module 2 - Bayesian Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Bayesian Learning

Some Terms to Understand Before delving into Bayesian learning:


Random variable (Stochastic variable) — In statistics, the random variable is a variable whose
possible values are a result of a random event. Therefore, each possible value of a random variable has
some probability attached to it to represent the likelihood of those values.
Probability distribution — The function that defines the probability of different outcomes/values of
a random variable. The continuous probability distributions are described using probability density
functions whereas discrete probability distributions can be represented using probability mass functions.
Conditional probability — This is a measure of probability P(A|B) of an event A given that another
event B has occurred.
Joint probability distribution -
Imagine a situation where your friend gives you a new coin and asks you the fairness of the coin (or the
probability of observing heads) without even flipping the coin once. In fact, you are also aware that
your friend has not made the coin biased. In general, you have seen that coins are fair, thus you expect
the probability of observing heads is 0.5. In the absence of any such observations, you assert the fairness
of the coin only using your past experiences or observations with coins.
Suppose that you are allowed to flip the coin 10 times in order to determine the fairness of the coin.
Your observations from the experiment will fall under one of the following cases:
Case 1: observing 5 heads and 5 tails.
Case 2: observing h heads and 10-h tails, where h ≠ 10−h.
If case 1 is observed, you are now more certain that the coin is a fair coin, and you will decide that the
probability of observing heads is 0.5 with more confidence.
If case 2 is observed, you can either:
1. Neglect your prior beliefs since now you have new data and decide the probability of observing heads
is h/10 by solely depending on recent observations.
2. Adjust your belief accordingly to the value of h that you have just observed, and decide the probability
of observing heads using your recent observations.
The first method suggests that we use the frequent method, where we omit our beliefs when making
decisions. However, the second method seems to be more convenient because 10 coins are insufficient
to determine the fairness of a coin.
Therefore, we can make better decisions by combining our recent observations and beliefs that we have
gained through our past experiences. It is this thinking model that uses our most recent observations
together with our beliefs or inclination for critical thinking that is known as Bayesian thinking.
Moreover, assume that your friend allows you to conduct another 10 coin flips. Then, we can use these
new observations to further update our beliefs. As we gain more data, we can incrementally update our
beliefs increasing the certainty of our conclusions. This is known as incremental learning, where you
update your knowledge incrementally with new evidence.
Bayesian learning comes into play on such occasions, where we are unable to use frequenters statistics
due to the drawbacks that we have discussed above. We can use Bayesian learning to address all these
drawbacks and even with additional capabilities (such as incremental updates of the posterior) when
testing a hypothesis to estimate unknown parameters of a machine learning models.
Bayesian learning uses Bayes' theorem to determine the conditional probability of a hypotheses given
some evidence or observations.
Bayes Theorem : Bayes' theorem describes how the conditional probability of an event or a hypothesis
can be computed using evidence and prior knowledge. It is similar to concluding that our code has no
bugs given the evidence that it has passed all the test cases, including our prior belief that we have rarely
observed any bugs in our code. However, this intuition goes beyond that simple hypothesis test where
there are multiple events or hypotheses involved (let us not worry about this for the moment). The
Bayes' theorem is given by:
𝑃(𝑋|𝛳) ∗ 𝑃(𝛳)
𝑃(𝛳|𝑋) =
𝑃(𝑋)

Where (𝛳│𝑋) = 𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 ; P(𝛳) = Prior Probability; P(X)= Likelihood of event and
𝑃(𝐷|𝛳) = 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑜𝑐𝑐𝑢𝑟𝑒𝑛𝑐𝑒 𝑜𝑓 𝑒𝑣𝑒𝑛𝑡 𝐷 𝑔𝑖𝑣𝑒𝑛 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑖𝑠 𝛳.

Consider the hypothesis that there are no bugs in our code. 𝛳 and 𝑿 denote that our code is bug-free
and passes all the test cases respectively.
P(θ) — Prior Probability is the probability of the hypothesis θ being true before applying the Bayes'
theorem. Prior represents the beliefs that we have gained through past experience, which refers to either
common sense or an outcome of Bayes' theorem for some past observations.
For the example given, prior probability denotes the probability of observing no bugs in our code.
However, since this is the first time we are applying Bayes' theorem, we have to decide the priors using
other means (otherwise we could use the previous posterior as the new prior). Let us assume that it is
very unlikely to find bugs in our code because rarely have we observed bugs in our code in the past.
With our past experience of observing fewer bugs in our code, we can assign our prior P(θ) with a
higher probability. However, for now, let us assume that P(θ) = p.
P(X|θ) — Likelihood is the conditional probability of the evidence given a hypothesis. The likelihood
is mainly related to our observations or the data we have. If it is given that our code is bug-free, then
the probability of our code passing all test cases is given by the likelihood. Assuming we have
implemented these test cases correctly, if no bug is presented in our code, then it should pass all the test
cases. Therefore, the likelihood P(X|θ) = 1.
P(X) — Evidence term denotes the probability of evidence or data. This can be expressed as a
summation (or integral) of the probabilities of all possible hypotheses weighted by the likelihood of the
same. Therefore, we can write P(X) as:

ϴ is the set of all the hypotheses.

Maximum a Posteriori (MAP) : We can use MAP to determine the valid hypothesis from a set of
hypotheses. According to MAP, the hypothesis that has the maximum posterior probability is
considered as the valid hypothesis. Therefore, we can express the hypothesis θMAP that is concluded
using MAP as follows:

The argmaxθ operator estimates the event or hypothesis θi that maximizes the posterior probability
𝑃(𝜃𝑖 |𝑋).

Naive Bayes :
It is a simple technique for constructing classifiers models that assign class labels to problem instances,
represented as vectors of feature values, where the class labels are drawn from some finite set. There is
not a single algorithm for training such classifiers, but a family of algorithms based on a common
principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the
value of any other feature, given the class variable.
For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter.
A naive Bayes classifier considers each of these features to contribute independently to the probability
that this fruit is an apple, regardless of any possible correlations between the color, roundness, and
diameter features. For some types of probability models, naive Bayes classifiers can be trained very
efficiently in a supervised learning setting. In many practical applications, parameter estimation for
naive Bayes models uses the method of maximum likelihood; in other words, one can work with the
naive Bayes model without accepting Bayesian probability or using any Bayesian methods. Despite
their naive design and apparently oversimplified assumptions, naive Bayes classifiers have worked
quite well in many complex real-world situations. In 2004, an analysis of the Bayesian classification
problem
showed that there are sound theoretical reasons for the apparently implausible efficacy of naive Bayes
classifiers. Still, a comprehensive comparison with other classification algorithms in 2006 showed that
Bayes classification is outperformed by other approaches, such as boosted trees or random forests. An
advantage of naive Bayes is that it only requires a small number of training data to estimate the
parameters necessary for classification .
For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter.
Even if these features depend on each other or upon the existence of the other features, all of these
properties independently contribute to the probability that this fruit is an apple and that is why it is
known as ‘Naive’.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods. Bayes
theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at
the equation below:

How Naive Bayes algorithm works?


Let’s understand it using an example.
Below we have a training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing). Now, we need to classify whether players will play or not based on weather
condition.
Let’s follow the below steps to perform it.
Step 1: Convert the data set into a frequency table.
Step 2: Create Likelihood table by finding the probabilities like Overcast probability = 0.29 and
probability of playing is 0.64.
Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each class. The
class with the highest posterior probability is the outcome of prediction.

Problem:
Players will play if weather is sunny. Is this statement is correct?
We can solve it using above discussed method of posterior probability.
P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny).

Solution : Here we have


P (Sunny |Yes) = 3/9 = 0.33,
P(Sunny) = 5/14 = 0.36,
P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60,


which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on various
attributes. This algorithm is mostly used in text classification and with problems having multiple
classes.

What are the Pros and Cons of Naive Bayes?


Pros: It is easy and fast to predict class of test data set. It also perform well in multi class prediction
When assumption of independence holds, a Naive Bayes classifier performs better compare to other
models like logistic regression and you need less training data. It perform well in case of categorical
input variables compared to numerical variable(s). For numerical variable, normal distribution is
assumed (bell curve, which is a strong assumption).
Cons: If categorical variable has a category (in test data set), which was not observed in training data
set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often
known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest
smoothing techniques is called Laplace estimation. On the other side naive Bayes is also known as a
bad estimator, so the probability outputs from prediction probability are not to be taken too seriously.
Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost
impossible that we get a set of predictors which are completely independent.

Applications of Naive Bayes Algorithms :

Real time Prediction: Naive Bayes is an eager learning classifier and it is sure fast. Thus, it could be
used for making predictions in real time.
Multi class Prediction: This algorithm is also well known for multi class prediction feature. Here we
can predict the probability of multiple classes of target variable.
Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers mostly used in text
classification (due to better result in multi class problems and independence rule) have higher success
rate as compared to other algorithms. As a result, it is widely used in Spam filtering (identify spam e-
mail) and Sentiment Analysis (in social media analysis, to identify positive and negative customer
sentiments)
Recommendation System: Naive Bayes Classifier and Collaborative Filtering together builds a
Recommendation System that uses machine learning and data mining techniques to filter unseen
information and predict whether a user would like a given resource or not.

What is a Bayesian Belief Network?


Bayesian Belief Network or Bayesian Network or Belief Network is a Probabilistic Graphical Model
(PGM) that represents conditional dependencies between random variables through a Directed Acyclic
Graph (DAG). Bayesian Networks are applied in many fields. For example, disease diagnosis,
optimized web search, spam filtering, gene regulatory networks, etc. And this list can be extended. The
main objective of these networks is trying to understand the structure of causality relations. To clarify
this, let’s consider a disease diagnosis problem. With given symptoms and their resulting disease, we
construct our Belief Network and when a new patient comes, we can infer which disease or diseases
may have the new patient by providing probabilities for each disease. Similarly, these causality relations
can be constructed for other problems and inference techniques can be applied to interesting results.
The probabilities are calculated in the belief networks by the following formula:

As you would understand from the formula, to be able to calculate the joint distribution we need to
have conditional probabilities indicated by the network. But further that if we have the joint distribution,
then we can start to ask interesting questions.
For example, in the first example, we ask for the probability of “RAIN” if “SEASON” is “WINTER”
and “DOG BARK” is “TRUE”.

What is Expectation-Maximization algorithm?


It can be used for the latent variables (variables that are not directly observable and are actually inferred
from the values of the other observed variables) too in order to predict their values with the condition
that the general form of probability distribution governing those latent variables is known to us. This
algorithm is actually at the base of many unsupervised clustering algorithms in the field of ML.
It is used to find the local maximum likelihood parameters of a statistical model in the cases where
latent variables are involved and the data is missing or incomplete.
Algorithm:
1. Given a set of incomplete data, consider a set of starting parameters.
2. Expectation step (E – step): Using the observed available data of the dataset,
estimate (guess) the values of the missing data.
3. Maximization step (M – step): Complete data generated after the expectation (E) step
is used in order to update the parameters.
4. Repeat step 2 and step 3 until convergence.

The essence of Expectation-Maximization algorithm is to use the available observed data of the dataset
to estimate the missing data and then using that data to update the values of the parameters. Let us
understand the EM algorithm in detail. Initially, a set of initial values of the parameters are considered.
A set of incomplete observed data is given to the system with the assumption that the observed data
comes from a specific model. The next step is known as “Expectation” – step or E-step. In this step,
we use the observed data in order to estimate or guess the values of the missing or incomplete data. It
is basically used to update the variables.
The next step is known as “Maximization”-step or M-step. In this step, we use the complete data
generated in the preceding “Expectation” – step in order to update the values of the parameters. It is
basically used to update the hypothesis. Now, in the fourth step, it is checked whether the values are
converging or not,if yes, then stop otherwise repeat step-2 and step-3 i.e. “Expectation” – step and
“Maximization” – step until the convergence occurs.
Flow chart of EM Algorithm is shown below:
Usage of EM algorithm – It can be used to fill the missing data in a sample. It can be used as the
basis of unsupervised learning of clusters. It can be used for the purpose of estimating the parameters
of Hidden Markov Model (HMM). It can be used for discovering the values of latent variables.
Advantages of EM algorithm –
1. It is always guaranteed that likelihood will increase with each iteration.
2. The E-step and M-step are often pretty easy for many problems in terms of implementation.
3. Solutions to the M-steps often exist in the closed form.
Disadvantages of EM algorithm –
1. It has slow convergence.
2. It makes convergence to the local optima only.
3. It requires both the probabilities, forward and backward (numerical optimization requires only
forward probability).

You might also like