0% found this document useful (0 votes)
12 views

Expectation Maximization (EM) Algorithm.pptx

The Expectation-Maximization (EM) algorithm is a statistical method used for maximum likelihood estimation in the presence of latent variables, involving iterative estimation and optimization steps. It is particularly effective for density estimation and clustering, such as in Gaussian Mixture Models, where it helps to infer hidden variables from observed data. While the EM algorithm has advantages like guaranteed likelihood increase, it also has drawbacks, including slow convergence and potential local optima issues.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Expectation Maximization (EM) Algorithm.pptx

The Expectation-Maximization (EM) algorithm is a statistical method used for maximum likelihood estimation in the presence of latent variables, involving iterative estimation and optimization steps. It is particularly effective for density estimation and clustering, such as in Gaussian Mixture Models, where it helps to infer hidden variables from observed data. While the EM algorithm has advantages like guaranteed likelihood increase, it also has drawbacks, including slow convergence and potential local optima issues.

Uploaded by

Asma Ayub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

EXPECTATION MAXIMIZATION (EM)

ALGORITHM
Parameter estimation is a fundamental concept in statistics and
machine learning, involving the process of determining the values of
parameters that best describe a statistical model or distribution. The
goal is to find the most likely values for the parameters based on
observed data.
In the real-world applications of machine
learning, it is very common that there are many
relevant features available for learning but only
a small subset of them are observable.

So, for the variables which are sometimes


observable and sometimes not, then we can use
the instances when that variable is visible is
observed for the purpose of learning and then
predict its value in the instances when it is not
observable.
Maximum likelihood estimation is an approach to
density estimation for a dataset by searching across
probability distributions and their parameters.

It is a general and effective approach that underlies


many machine learning algorithms, although it
requires that the training dataset is complete, e.g. all
relevant interacting random variables are present.

Maximum likelihood becomes intractable if there are


variables that interact with those in the dataset but
were hidden or not observed, so-called latent
variables. Intractable problem = a problem that can be solved in theory (e.g. given
large but finite resources, especially time), but for which in practice any
solution takes too many resources to be useful, is known as an intractable
MAXIMUM LIKELIHOOD ESTIMATION (MLE)

Maximum Likelihood Estimation (MLE) is a


method used for estimating the parameters of a
statistical model. The basic idea behind MLE is
to find the parameter values that maximize the
likelihood function, which measures how well the
model explains the observed data.
The expectation-maximization algorithm is an
approach for performing maximum likelihood
estimation in the presence of latent variables.

It does this by first estimating the values for the


latent variables, then optimizing the model, then
repeating these two steps until convergence.

It is an effective and general approach and is


most commonly used for density estimation with
missing data, such as clustering algorithms like
the Gaussian Mixture Model.
PROBLEM OF LATENT VARIABLES FOR MAXIMUM
LIKELIHOOD

A common modeling problem involves how to estimate a


joint probability distribution for a dataset.

Density estimation involves selecting a probability


distribution function and the parameters of that
distribution that best explain the joint probability
distribution of the observed data.

There are many techniques for solving this problem,


although a common approach is called maximum
likelihood estimation, or simply “maximum likelihood.”

Maximum Likelihood Estimation involves treating the


problem as an optimization or search problem, where we
seek a set of parameters that results in the best fit for the
joint probability of the data sample.
LATENT VARIABLES.

A limitation of maximum likelihood estimation is


that it assumes that the dataset is complete, or fully
observed.

This does not mean that the model has access to all
data; instead, it assumes that all variables that are
relevant to the problem are present.

This is not always the case. There may be datasets


where only some of the relevant variables can be
observed, and some cannot, and although they
influence other random variables in the dataset, they
remain hidden.

More generally, these unobserved or hidden variables


are referred to as latent variables.
PLATO’S ALLEGORY OF THE CAVE

Observable variable
Latent (hidden) variable

[8.67, 12.8564, 0.44875,


874.22, …]

[4.59, 13.2548, 1.14569,


148.25, …]

These people have some data that can be observed but this data actually comes from something that we cannot observe
that is of a higher representation of this data abstract representation of this data and we want to learn something about
this abstract representation
LIMITATION OF MAXIMUM LIKELIHOOD
ESTIMATION

Conventional maximum likelihood estimation does


not work well in the presence of latent variables.

If we have missing data and/or latent variables, then


computing the [maximum likelihood] estimate
becomes hard.

Instead, an alternate formulation of maximum


likelihood is required for searching for the
appropriate model parameters in the presence of
latent variables.

The Expectation-Maximization algorithm is one such


approach.
EXPECTATION-MAXIMIZATION ALGORITHM
The EM algorithm is an iterative approach that cycles between two
modes. The first mode attempts to estimate the missing or latent
variables, called the estimation-step or E-step.

The second mode attempts to optimize the parameters of the model to


best explain the data, called the maximization-step or M-step.

E-Step. Estimate the missing variables in the dataset.

M-Step. Maximize the parameters of the model in the presence of the


data.

The EM algorithm can be applied quite widely, although is perhaps


most well known in machine learning for use in unsupervised learning
problems, such as density estimation and clustering.

Perhaps the most discussed application of the EM algorithm is for


clustering with a mixture model.
GAUSSIAN MIXTURE MODEL AND THE EM
ALGORITHM

A mixture model is a model comprised of an unspecified


combination of multiple probability distribution functions.

A statistical procedure or learning algorithm is used to


estimate the parameters of the probability distributions to
best fit the density of a given training dataset.

The Gaussian Mixture Model, or GMM for short, is a


mixture model that uses a combination of Gaussian
(Normal) probability distributions and requires the
estimation of the mean and standard deviation
parameters for each.

There are many techniques for estimating the parameters


for a GMM, although a maximum likelihood estimate is
perhaps the most common.
Consider the case where a dataset is comprised of many
points that happen to be generated by two different
processes.

The points for each process have a Gaussian probability


distribution, but the data is combined and the
distributions are similar enough that it is not obvious to
which distribution a given point may belong.

The processes used to generate the data point represents a


latent variable, e.g. process 0 and process 1.

It influences the data but is not observable.

As such, the EM algorithm is an appropriate approach to


use to estimate the parameters of the distributions.
EM EXPLAINED
In the EM algorithm, the estimation-step would estimate a
value for the process latent variable for each data point, and the
maximization step would optimize the parameters of the
probability distributions in an attempt to best capture the
density of the data.

The process is repeated until a good set of latent values and a


maximum likelihood is achieved that fits the data.

E-Step. Estimate the expected value for each latent variable.

M-Step. Optimize the parameters of the distribution using


maximum likelihood.

We can imagine how this optimization procedure could be


constrained to just the distribution means, or generalized to a
mixture of many different Gaussian distributions.
On the other hand, Expectation-Maximization
algorithm can be used for the latent variables
(variables that are not directly observable and
are actually inferred from the values of the other
observed variables)

In order to predict their values with the condition


that the general form of probability distribution
governing those latent variables is known to us.

This algorithm is actually at the base of many


unsupervised clustering algorithms in the field of
machine learning.
ALGORITHM:

Given a set of incomplete data, consider a set of


starting parameters.

Expectation step (E – step): Using the


observed available data of the dataset, estimate
(guess) the values of the missing data.

Maximization step (M – step): Complete data


generated after the expectation (E) step is used
in order to update the parameters.

Repeat step 2 and step 3 until convergence.


The essence of Expectation-Maximization
algorithm is to use the available observed data of
the dataset to estimate the missing data and
then using that data to update the values of the
parameters.
EM ALGORITHM EXPLAINED
Initially, a set of initial values of the parameters are considered. A set
of incomplete observed data is given to the system with the
assumption that the observed data comes from a specific model.

The next step is known as “Expectation” – step or E-step. In this step,


we use the observed data in order to estimate or guess the values of
the missing or incomplete data. It is basically used to update the
variables.

The next step is known as “Maximization”-step or M-step. In this step,


we use the complete data generated in the preceding “Expectation” –
step in order to update the values of the parameters. It is basically
used to update the hypothesis.

Now, in the fourth step, it is checked whether the values are


converging or not, if yes, then stop otherwise
repeat step-2 and step-3 i.e. “Expectation” – step and “Maximization” –
step until the convergence occurs.
FLOWCHART
EXAMPLE OF GAUSSIAN MIXTURE MODEL

We can make the application of the EM algorithm to a Gaussian


Mixture Model concrete with a worked example.

First, let’s contrive a problem where we have a dataset where


points are generated from one of two Gaussian processes.

The points are one-dimensional, the mean of the first


distribution is 20, the mean of the second distribution is 40, and
both distributions have a standard deviation of 5.

We will draw 3,000 points from the first process and 7,000
points from the second process and mix them together.

We can then plot a histogram of the points to give an intuition


for the dataset. We expect to see a bimodal distribution with a
peak for each of the means of the two distributions.
EXAMPLE PLOT
1 # example of a bimodal constructed from two gaussian processes
2 from numpy import hstack
3 from numpy.random import normal
4 from matplotlib import pyplot
5 # generate a sample
6 X1 = normal(loc=20, scale=5, size=3000)
7 X2 = normal(loc=40, scale=5, size=7000)
8 X = hstack((X1, X2))
9 # plot the histogram
1 pyplot.hist(X, bins=50, density=True)
0 pyplot.show()
1
1
HISTOGRAM
Running the example creates the dataset and
then creates a histogram plot for the data points.

The plot clearly shows the expected bimodal


distribution with a peak for the first process
around 20 and a peak for the second process
around 40.

We can see that for many of the points in the


middle of the two peaks that it is ambiguous as to
which distribution they were drawn from.
We can model the problem of estimating the density
of this dataset using a Gaussian Mixture Model.

The GaussianMixture scikit-learn class can be used


to model this problem and estimate the parameters of
the distributions using the expectation-maximization
algorithm.

The class allows us to specify the suspected number


of underlying processes used to generate the data via
the n_components argument when defining the
model. We will set this to 2 for the two processes or
distributions.
If the number of processes was not known, a range of
different numbers of components could be tested and
the model with the best fit could be chosen, where
models could be evaluated using scores such as
Akaike or Bayesian Information Criterion (AIC or
BIC).

There are also many ways we can configure the


model to incorporate other information we may know
about the data, such as how to estimate initial values
for the distributions.

In this case, we will randomly guess the initial


parameters, by setting the init_params argument to
‘random’
EXAMPLE CODE
1 ...
2 # fit model
3 model = GaussianMixture(n_components=2,
4 init_params='random')
model.fit(X)
Once the model is fit, we can access the learned
parameters via arguments on the model, such as the
means, covariances, mixing weights, and more.

More usefully, we can use the fit model to estimate


the latent parameters for existing and new data
points.

For example, we can estimate the latent variable for


the points in the training dataset and we would
expect the first 3,000 points to belong to one process
(e.g. value=1) and the next 7,000 data points to
belong to a different process (e.g. value=0).
EXAMPLE CODE

1 ...
2 # predict latent values
3 yhat = model.predict(X)
4 # check latent value for first few points
5 print(yhat[:100])
6 # check latent value for last few points
7 print(yhat[-100:])

Running the example fits the Gaussian mixture


model on the prepared dataset using the EM
algorithm. Once fit, the model is used to predict
the latent variable values for the examples in the
training dataset.
EXAMPLE
EXAMPLE
USAGE OF EM ALGORITHM –

It can be used to fill the missing data in a


sample.

It can be used as the basis of unsupervised


learning of clusters.

It can be used for the purpose of estimating the


parameters of Hidden Markov Model (HMM).

It can be used for discovering the values of latent


variables.
ADVANTAGES OF EM ALGORITHM –

It is always guaranteed that likelihood will


increase with each iteration.

The E-step and M-step are often pretty easy for


many problems in terms of implementation.

Solutions to the M-steps often exist in the closed


form.
DISADVANTAGES OF EM ALGORITHM –

It has slow convergence.

It makes convergence to the local optima only.

It requires both the probabilities, forward and


backward (numerical optimization requires only
forward probability).

You might also like