Skip to content

[MRG+2] GSoC Final : Dirichlet Gaussian Mixture #7295

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

tguillemot
Copy link
Contributor

@tguillemot tguillemot commented Aug 30, 2016

This closes #7377, closes #7115, closes #2473, closes #2454, closes #1764 and closes #1637.

This is the last PR to remove completely the old GMM classes.

Here, you'll find the DirichletGaussianMixture class with the doc, examples and tests.

It will be easier to review when #6651 will be merged (and also there will be no conflits).

I have removed the example plot_gmm_sin.py because it's wasn't showing the properties of DPGMM correctly for me (it modifies the covariance_type through the experiment to obtain better results).

Instead of that I prefer introduce a similar exemple as the one I have introduce for the BayesianGaussianMixture.
dpgmm

@tguillemot tguillemot changed the title [WIP] GSoC Final : Dirichlet Gaussian Mixture [MRG] GSoC Final : Dirichlet Gaussian Mixture Aug 31, 2016

The examples above compare Gaussian mixtures models with fixed number of
components, to the Dirichlet Gaussian Mixtures models. **On the left** the GMM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GMM -> Gaussian mixtures

@tguillemot
Copy link
Contributor Author

@TomDLT Thanks for this first round of review.

@ogrisel
Copy link
Member

ogrisel commented Aug 31, 2016

I am in favor of keeping the "sin" example (adapted to use the new class). While I agree that it is a weird and artificial dataset, I also appreciate the facts that:

  • it's bad to break sacred links on the Holy Web,
  • it's good to show that models can in practice still be used (and be somewhat useful) even if the data violates the assumptions of the underlying generative model.

@tguillemot
Copy link
Contributor Author

Ok fair enough. So, I've modified the example to plot something that make sense (change only the beta concentration prior and not the covariance_type ).
gmm_sin

This class doesn't require the user to choose the number of
components, and at the expense of extra computational time the user
only needs to specify a loose upper bound on this number and a
concentration parameter.

.. |plot_gmm| image:: ../auto_examples/mixture/images/sphx_glr_plot_gmm_001.png
:target: ../auto_examples/mixture/plot_gmm.html
:scale: 48%
:scale: 31%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the rendered page is not as intended.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what will be the result and it's ugly :).
I'll change that.

@TomDLT
Copy link
Member

TomDLT commented Sep 1, 2016

This seems pretty clean to me

@tguillemot tguillemot force-pushed the GSoC-Dpgmm branch 2 times, most recently from bfedbe4 to 6ab1324 Compare September 2, 2016 08:20
@tguillemot
Copy link
Contributor Author

This is the new doc.

@agramfort @ogrisel Do you think we can merge that for 0.18 ???

The BIC criterion can be used to select the number of components in a Gaussian
Mixture in an efficient way. In theory, it recovers the true number of
components only in the asymptotic regime (i.e. if much data is available). Note
that using a :ref:`DirichletGaussianMixture <dpgmm>` avoids the specification of
Copy link
Member

@raghavrv raghavrv Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is not a class ref, could we have this as Dirichlet Gaussian Mixture?

model with the Dirichlet Process. In practice the approximate the Dirichlet
Process inference algorithm uses a truncated distribution with a fixed
maximum number of components (called the Stick-breaking representation),
but almost always the number of components actually used depends on the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cut the sentence:

representation). The number of components actually used almost always depends on the data.

@ogrisel
Copy link
Member

ogrisel commented Sep 8, 2016

CI is broken because of a broken import and a PEP8 issue.

Also could you please try to make the following test run in less than 1s by tweaking the training data size or the hyper parameters of the model?

sklearn.mixture.tests.test_bayesian_mixture.test_monotonic_likelihood: 4.5968s

@tguillemot tguillemot force-pushed the GSoC-Dpgmm branch 2 times, most recently from 8f0544d to 1b45753 Compare September 9, 2016 13:14
@tguillemot
Copy link
Contributor Author

Thanks @ogrisel.
I've changed the doc and add the missing tests.

:class:`GaussianMixture` and :class:`BayesianGaussianMixture` to fit a
sine wave.

* See :ref:`sphx_glr_auto_example_mixture_plot_concentration_prior.py`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reference is broken.

@tguillemot
Copy link
Contributor Author

I have a problem of the a ghost file in travis. @ogrisel can you remove the cache ?

@@ -111,7 +118,14 @@ class BayesianGaussianMixture(BaseMixture):
'kmeans' : responsibilities are initialized using kmeans.
'random' : responsibilities are initialized randomly.

dirichlet_concentration_prior : float | None, optional.
weight_concentration_prior_type : {'dirichlet_process',
'dirichlet_distribution'}, defaults to 'full'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two problems:

  • This new line breaks the sphinx rendering in classes.rst. I am not sure how this should be fixed.
  • The default value is 'dirichlet_process', not 'full'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to fix the first problem because if I put it on a unique line pep8 will not be happy.

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2016

I think the red travis builds were caused by old cached versions of deleted python modules and test files. I manually deleted the cache for this PR in travis and relaunched the build to check if that fixes it.

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2016

pyflakes has caught an unused variable:

./sklearn/mixture/tests/test_bayesian_mixture.py:403:19: F841 local variable 'n_features' is assigned to but never used
    n_components, n_features = 2 * rand_data.n_components, 2

process prior, however, show that the model can either learn a global structure
for the data (small ``weight_concentration_prior``) or easily interpolate to
finding relevant local structure (large ``weight_concentration_prior``), never
falling into the problems shown by the ``GaussianMixture`` class.
Copy link
Member

@ogrisel ogrisel Sep 9, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this analysis. Let me suggest the following instead:

This example demonstrates the behavior of Gaussian mixture models on data that was not generated by a mixture of Gaussian random variables. The dataset is formed by 100 points loosely spaced following a noisy sine curve. There is therefore no ground truth value for the number of Gaussian components.

The first model is a classical Gaussian Mixture Model with 10 components fit with the Expectation Maximization algorithm.

The second model is a Bayesian Gaussian Mixture Model with a Dirichlet process prior fit with variational inference. The low value of the concentration prior makes the model favor a lower number of active components. This models "decides" to focus its modeling power on the big picture of the structure of the datasets: groups of points with alternating directions modeled by non-spherical covariance matrices. Those alternating directions roughly capture the alternating nature of the original sine signal.

The third model is also Bayesian Gaussian Mixture Model with a Dirichlet process prior but this time the value of the concentration prior is higher giving the model more liberty to try to model the finer-grained structure of the data. The result is a mixture with a larger number of active components that is similar to the first model where we decided to fix the number of components to 10 arbitrarily.

Which model is the best is a matter of subjective judgement: do we want to favor models that only capture the big picture to summarize and explain most of the structure of the data while ignoring the details or do we prefer models that closely follow the high density regions of the signal?

The last two panels show how we can sample from the last two models. The resulting samples distributions do not look exactly like the original data distribution. The difference primarily stems from the approximation error we made by using a model that assumes that the data was generated by a finite number of Gaussian components instead of a continuous noisy sine curve.

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2016

This time the travis error is for real I think:

https://travis-ci.org/scikit-learn/scikit-learn/jobs/158774349#L2336

@ogrisel
Copy link
Member

ogrisel commented Sep 9, 2016

I think I am done with the review. +1 for merge once CI is green and my last comment on plot_gmm_sin.py has been taken into account. Thanks very much for bearing with me @tguillemot :)

@ogrisel
Copy link
Member

ogrisel commented Sep 10, 2016

I fixed the CI failure and addressed the doc of the example in #7386. If it's green, I will merge.

@ogrisel
Copy link
Member

ogrisel commented Sep 10, 2016

Merged as #7386 🍻

@ogrisel ogrisel closed this Sep 10, 2016
@tguillemot
Copy link
Contributor Author

@ogrisel Sorry I had to go yesterday. Thanks for taking care of that.

@tguillemot
Copy link
Contributor Author

Thanks everyone for your review and helps !!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment