[MRG+2] GSoC Final : Dirichlet Gaussian Mixture #7295

tguillemot · 2016-08-30T15:57:09Z

This closes #7377, closes #7115, closes #2473, closes #2454, closes #1764 and closes #1637.

This is the last PR to remove completely the old GMM classes.

Here, you'll find the DirichletGaussianMixture class with the doc, examples and tests.

~~It will be easier to review when #6651 will be merged (and also there will be no conflits).~~

~~I have removed the example plot_gmm_sin.py because it's wasn't showing the properties of DPGMM correctly for me (it modifies the covariance_type through the experiment to obtain better results).~~

Instead of that I prefer introduce a similar exemple as the one I have introduce for the BayesianGaussianMixture.

TomDLT · 2016-08-31T12:31:00Z

doc/modules/mixture.rst


+The examples above compare Gaussian mixtures models with fixed number of
+components, to the Dirichlet Gaussian Mixtures models. **On the left** the  GMM


GMM -> Gaussian mixtures

tguillemot · 2016-08-31T13:56:38Z

@TomDLT Thanks for this first round of review.

ogrisel · 2016-08-31T14:06:26Z

I am in favor of keeping the "sin" example (adapted to use the new class). While I agree that it is a weird and artificial dataset, I also appreciate the facts that:

it's bad to break sacred links on the Holy Web,
it's good to show that models can in practice still be used (and be somewhat useful) even if the data violates the assumptions of the underlying generative model.

tguillemot · 2016-09-01T13:11:21Z

Ok fair enough. So, I've modified the example to plot something that make sense (change only the beta concentration prior and not the covariance_type ).

TomDLT · 2016-09-01T14:40:12Z

doc/modules/mixture.rst

 This class doesn't require the user to choose the number of
 components, and at the expense of extra computational time the user
 only needs to specify a loose upper bound on this number and a
 concentration parameter.

 .. |plot_gmm| image:: ../auto_examples/mixture/images/sphx_glr_plot_gmm_001.png
   :target: ../auto_examples/mixture/plot_gmm.html
-   :scale: 48%
+   :scale: 31%


I guess the rendered page is not as intended.

I was wondering what will be the result and it's ugly :).
I'll change that.

TomDLT · 2016-09-01T14:41:51Z

This seems pretty clean to me

tguillemot · 2016-09-02T09:31:25Z

This is the new doc.

@agramfort @ogrisel Do you think we can merge that for 0.18 ???

raghavrv · 2016-09-02T09:40:41Z

doc/modules/mixture.rst

+The BIC criterion can be used to select the number of components in a Gaussian
+Mixture in an efficient way. In theory, it recovers the true number of
+components only in the asymptotic regime (i.e. if much data is available). Note
+that using a :ref:`DirichletGaussianMixture <dpgmm>` avoids the specification of


Since this is not a class ref, could we have this as Dirichlet Gaussian Mixture?

ogrisel · 2016-09-08T21:26:07Z

sklearn/mixture/bayesian_mixture.py

+    model with the Dirichlet Process. In practice the approximate the Dirichlet
+    Process inference algorithm uses a truncated distribution with a fixed
+    maximum number of components (called the Stick-breaking representation),
+    but almost always the number of components actually used depends on the


Cut the sentence:

representation). The number of components actually used almost always depends on the data.

ogrisel · 2016-09-08T21:52:27Z

CI is broken because of a broken import and a PEP8 issue.

Also could you please try to make the following test run in less than 1s by tweaking the training data size or the hyper parameters of the model?

sklearn.mixture.tests.test_bayesian_mixture.test_monotonic_likelihood: 4.5968s

tguillemot · 2016-09-09T13:15:15Z

Thanks @ogrisel.
I've changed the doc and add the missing tests.

ogrisel · 2016-09-09T13:31:49Z

doc/modules/mixture.rst

+      :class:`GaussianMixture` and :class:`BayesianGaussianMixture` to fit a
+      sine wave.
+
+    * See :ref:`sphx_glr_auto_example_mixture_plot_concentration_prior.py`


This reference is broken.

tguillemot · 2016-09-09T13:39:12Z

I have a problem of the a ghost file in travis. @ogrisel can you remove the cache ?

ogrisel · 2016-09-09T14:11:01Z

sklearn/mixture/bayesian_mixture.py

@@ -111,7 +118,14 @@ class BayesianGaussianMixture(BaseMixture):
        'kmeans' : responsibilities are initialized using kmeans.
        'random' : responsibilities are initialized randomly.

-    dirichlet_concentration_prior : float | None, optional.
+    weight_concentration_prior_type : {'dirichlet_process',
+            'dirichlet_distribution'}, defaults to 'full'.


Two problems:

This new line breaks the sphinx rendering in classes.rst. I am not sure how this should be fixed.

The default value is 'dirichlet_process', not 'full'.

I don't know how to fix the first problem because if I put it on a unique line pep8 will not be happy.

ogrisel · 2016-09-09T15:43:02Z

I think the red travis builds were caused by old cached versions of deleted python modules and test files. I manually deleted the cache for this PR in travis and relaunched the build to check if that fixes it.

ogrisel · 2016-09-09T15:44:30Z

pyflakes has caught an unused variable:

./sklearn/mixture/tests/test_bayesian_mixture.py:403:19: F841 local variable 'n_features' is assigned to but never used
    n_components, n_features = 2 * rand_data.n_components, 2

ogrisel · 2016-09-09T16:04:29Z

examples/mixture/plot_gmm_sin.py

+process prior, however, show that the model can either learn a global structure
+for the data (small ``weight_concentration_prior``) or easily interpolate to
+finding relevant local structure (large ``weight_concentration_prior``), never
+falling into the problems shown by the ``GaussianMixture`` class.


I don't agree with this analysis. Let me suggest the following instead:

This example demonstrates the behavior of Gaussian mixture models on data that was not generated by a mixture of Gaussian random variables. The dataset is formed by 100 points loosely spaced following a noisy sine curve. There is therefore no ground truth value for the number of Gaussian components.

The first model is a classical Gaussian Mixture Model with 10 components fit with the Expectation Maximization algorithm.

The second model is a Bayesian Gaussian Mixture Model with a Dirichlet process prior fit with variational inference. The low value of the concentration prior makes the model favor a lower number of active components. This models "decides" to focus its modeling power on the big picture of the structure of the datasets: groups of points with alternating directions modeled by non-spherical covariance matrices. Those alternating directions roughly capture the alternating nature of the original sine signal.

The third model is also Bayesian Gaussian Mixture Model with a Dirichlet process prior but this time the value of the concentration prior is higher giving the model more liberty to try to model the finer-grained structure of the data. The result is a mixture with a larger number of active components that is similar to the first model where we decided to fix the number of components to 10 arbitrarily.

Which model is the best is a matter of subjective judgement: do we want to favor models that only capture the big picture to summarize and explain most of the structure of the data while ignoring the details or do we prefer models that closely follow the high density regions of the signal?

The last two panels show how we can sample from the last two models. The resulting samples distributions do not look exactly like the original data distribution. The difference primarily stems from the approximation error we made by using a model that assumes that the data was generated by a finite number of Gaussian components instead of a continuous noisy sine curve.

ogrisel · 2016-09-09T16:09:00Z

This time the travis error is for real I think:

https://travis-ci.org/scikit-learn/scikit-learn/jobs/158774349#L2336

ogrisel · 2016-09-09T16:10:06Z

I think I am done with the review. +1 for merge once CI is green and my last comment on plot_gmm_sin.py has been taken into account. Thanks very much for bearing with me @tguillemot :)

ogrisel · 2016-09-10T12:49:52Z

I fixed the CI failure and addressed the doc of the example in #7386. If it's green, I will merge.

ogrisel · 2016-09-10T13:42:23Z

Merged as #7386 🍻

tguillemot · 2016-09-10T16:58:31Z

@ogrisel Sorry I had to go yesterday. Thanks for taking care of that.

tguillemot · 2016-09-10T16:58:56Z

Thanks everyone for your review and helps !!!

tguillemot force-pushed the GSoC-Dpgmm branch from f1671b7 to a561789 Compare August 31, 2016 08:45

tguillemot changed the title ~~[WIP] GSoC Final : Dirichlet Gaussian Mixture~~ [MRG] GSoC Final : Dirichlet Gaussian Mixture Aug 31, 2016

TomDLT reviewed Aug 31, 2016
View reviewed changes

TomDLT reviewed Sep 1, 2016
View reviewed changes

tguillemot force-pushed the GSoC-Dpgmm branch 2 times, most recently from bfedbe4 to 6ab1324 Compare September 2, 2016 08:20

raghavrv reviewed Sep 2, 2016
View reviewed changes

ogrisel reviewed Sep 8, 2016
View reviewed changes

amueller mentioned this pull request Sep 8, 2016

lnprob of DPGMM covariance_type='full' depends on number of items #7371

Closed

tguillemot force-pushed the GSoC-Dpgmm branch 2 times, most recently from 8f0544d to 1b45753 Compare September 9, 2016 13:14

ogrisel reviewed Sep 9, 2016
View reviewed changes

tguillemot force-pushed the GSoC-Dpgmm branch from 1b45753 to e9be320 Compare September 9, 2016 13:44

ogrisel reviewed Sep 9, 2016
View reviewed changes

tguillemot mentioned this pull request Sep 9, 2016

sklearn.mixture.GMM: 'random_state' makes number of initialisation ('n_init') meaningless #7377

Closed

Add tests + change mixture doc.

15afbec

tguillemot force-pushed the GSoC-Dpgmm branch from e9be320 to 15afbec Compare September 9, 2016 15:56

ogrisel reviewed Sep 9, 2016
View reviewed changes

ogrisel mentioned this pull request Sep 10, 2016

[MRG] Add Dirichlet process prior to BayesianGaussianMixture #7386

Merged

ogrisel closed this Sep 10, 2016

mpharrigan mentioned this pull request Oct 4, 2016

Fix compatibility with sklearn >= 0.18 msmbuilder/msmbuilder#915

Merged


		The examples above compare Gaussian mixtures models with fixed number of
		components, to the Dirichlet Gaussian Mixtures models. On the left the GMM

Uh oh!

[MRG+2] GSoC Final : Dirichlet Gaussian Mixture #7295

[MRG+2] GSoC Final : Dirichlet Gaussian Mixture #7295

Uh oh!

Conversation

tguillemot commented Aug 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomDLT Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

tguillemot commented Aug 31, 2016

Uh oh!

ogrisel commented Aug 31, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tguillemot commented Sep 1, 2016

Uh oh!

TomDLT Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

tguillemot Sep 1, 2016

Choose a reason for hiding this comment

Uh oh!

TomDLT commented Sep 1, 2016

Uh oh!

tguillemot commented Sep 2, 2016

Uh oh!

raghavrv Sep 2, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel Sep 8, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 8, 2016

Uh oh!

tguillemot commented Sep 9, 2016

Uh oh!

ogrisel Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

tguillemot commented Sep 9, 2016

Uh oh!

ogrisel Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

tguillemot Sep 9, 2016

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 9, 2016

Uh oh!

ogrisel commented Sep 9, 2016

Uh oh!

ogrisel Sep 9, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Sep 9, 2016

Uh oh!

ogrisel commented Sep 9, 2016

Uh oh!

ogrisel commented Sep 10, 2016

Uh oh!

ogrisel commented Sep 10, 2016

Uh oh!

tguillemot commented Sep 10, 2016

Uh oh!

tguillemot commented Sep 10, 2016

Uh oh!

Uh oh!

tguillemot commented Aug 30, 2016 •

edited

Loading

ogrisel commented Aug 31, 2016 •

edited

Loading

raghavrv Sep 2, 2016 •

edited

Loading

ogrisel Sep 9, 2016 •

edited

Loading