[RFC] GSoC2015 Improve GMM API #4802

xuewei4d · 2015-06-02T15:18:15Z

GSoC 2015 project, Improve GMM API

xuewei4d · 2015-06-02T15:23:36Z

I have not make KDE to be based on DensityMixin. @ogrisel @lesteve @amueller .

amueller · 2015-06-03T19:33:59Z

sklearn/mixture/gmm.py

        return logprob

+    def responsibility_samples(self, X):


what is this method for? Why not just use predict_proba? We would rather not add to many functions to the API.

Sorry. I didn't realize that we have predict_proba.

amueller · 2015-06-03T19:35:41Z

The thing with changing score_samples is that it is an incompatible change. So we have to a) accept that we break user code (rather not) or b) rename the estimator to GaussianMixtureModel or something, and make the old class behave like it did, deprecate it, and make the new class behave as we want.

xuewei4d · 2015-06-03T20:02:10Z

Sorry for my bone-headness. I will build a new Class GaussianMixtureModel. And we should add deprecate messages in score_samples right?

amueller · 2015-06-03T20:05:52Z

Rename the old class to the new name, then implement GMM by inheriting from the new class and overwriting score_samples. Then deprecate the whole "old" GMM. At least I think that was the consensus.

xuewei4d · 2015-06-04T20:30:50Z

Thanks for the advice. Did you mean to rename class GMM to class GaussianMixtureModel and implement GMM in the same file, i.e. gmm.py?

amueller · 2015-06-04T20:31:51Z

yes, that sounds good.

vene · 2015-06-05T19:21:43Z

sklearn/mixture/tests/test_gmm.py

+    assert_warns_message(DeprecationWarning,
+                         "The 'score_samples' method in GMM has been "
+                         "deprecated in 0.17 and will be removed in 0.19."
+                         "Use 'score_samples' method "


the 'score_samples' method, the 'log_probability', the 'predict_proba' method, the 'responsibility'. Same in all deprecation warnings.

Also I think 'of each sample' sounds better than 'for each sample' in this context.

xuewei4d · 2015-06-13T18:12:55Z

I am sorry I didn't get a very clear mind in the beginning, and there is a lot of work on refactoring GMM. So I removed the GaussianMixtureModel I created for the inconsistency problem about score, but I created a new class in an separated file, just an prototype. Here is that my thought.

added the parameters weights, means, covars in __init__ to allow users initialize the model
added methods to check user-provided initialization data
added methods that compute the sufficient statistics of the data, which is used in m-step. They are N_k, \bar{x}_k, S_k.
added estimation functions as abstract methods to the base class _MixtureModelBase such that GaussianMixtureModel, BayesianMixtureModel, DirichletProcessMixtureModel could implement their own updating functions.
added DensityMixin to compute score for all density estimators

The code is in the earliest version. I would like to hear about the comments about the definitions of these classes. Thanks.

vene · 2015-06-14T23:23:49Z

I may be a bit late to the party but if these classes get renamed, may I object to having "Model" in the name, and suggest simply GaussianMixture etc? I mean, most of our estimators are models of some sort, the word doesn't bring much, and the docstrings will still say "Gaussian Mixture Model" and "GMM" for searchability.

xuewei4d · 2015-06-15T14:18:01Z

@vene from my perspective, the name 'GaussianMixtureModel' is OK, because in the literature everybody uses this name. The word 'model' is kind of a part of it.

xuewei4d · 2015-06-17T18:03:57Z

I am confused with the recent PR #4593 @clorenz7 . How can zero-filled responsibility is a kind of initialization? Line571. The L1-norm of each responsibility must be 1. If we does need quick-initialization, we should initialize the model parameters, weights_, means_, covariances_, not responsibility.

self.n_iter == 0 occurs when using GMM within HMM

        # Need to make sure that there are responsibilities to output
        # Output zeros because it was just a quick initialization

I cannot find somewhere needs responsibilities, except self.fit_predict merged from the PR #4593.
HMM of sklearn is gone. hmmlearn does not set n_iter=0. _fit is a private method, which should not be called by other modules. So returning responsibility is only for self.fit_predict. However, this method directly use the responsibility of _fit, which will skip the last EM iteration, because the returned responsibility is computed with parameters optimized by the last two iteration.

And Line442 has a weird parameter do_prediction=False??

clorenz7 · 2015-06-17T18:58:03Z

Yes, do_prediction is not needed, I apologize for my oversight.

There was discussion in the PR pre-merge about the removal of HMM, and the
not-needed n_iter=0. It is not that another module would be calling
self._fit, but that a module which subclassed from it would make use of it.
The addition was largely defensive since responsibilities variable was set
in the for loop over n_iter. The responsibilities being all zeros as a
degenerate case seemed ok since a GMM fit with n_iter=0 is already
degenerate.

You are right about the responsibility returning only being for
fit_predict, and it skipping the last EM iteration. I wrote a comment about
that:

Warning: due to the final maximization step in the EM algorithm, with low
iterations the prediction may not be 100% accurate ```
During the development of the method, the feedback that I received was that
that the idea of the fit_predict method is to save time over someone just
calling ```model.fit``` then ```model.predict```.







On Wed, Jun 17, 2015 at 11:10 AM, Wei Xue <notifications@github.com> wrote:

> I am confused with the recent PR #4579
> <https://github.com/scikit-learn/scikit-learn/issues/4579> @clorenz7
> <https://github.com/clorenz7> . How can zero-filled responsibility is a
> kind of initialization? Line571
> <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/gmm.py#L571>.
> The L1-norm of each responsibility must be 1. If we does need
> quick-initialization, we should initialize the model parameters, weights_,
> means_, covariances_, not responsibility.
>
> # self.n_iter == 0 occurs when using GMM within HMM
> # Need to make sure that there are responsibilities to output
> # Output zeros because it was just a quick initialization
>
> I cannot find somewhere needs responsibilities, except self.fit_predict
> merged from the PR #4579
> <https://github.com/scikit-learn/scikit-learn/issues/4579>.
> HMM of sklearn is gone. hmmlearn does not set n_iter=0
> <https://github.com/hmmlearn/hmmlearn/blob/master/hmmlearn/hmm.py#L592>.
> _fit is a private method, which should not be called by other modules. So
> returning responsibility is only for self.fit_predict. However, this
> method directly use the responsibility of _fit, which will skip the last
> EM iteration, because the returned responsibility is computed with
> parameters optimized by the last two iteration.
>
> And Line442
> <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/gmm.py#L442>
> has a weird parameter do_prediction=False??
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/scikit-learn/scikit-learn/pull/4802#issuecomment-112896427>
> .
>

ogrisel · 2015-06-18T12:38:19Z

+1 for simplifying the code by taking the removal of HMM into account and not dealing with the n_iter=0 case that should never appear.

amueller · 2015-10-12T22:57:08Z

#5344 might be of interest

GaelVaroquaux · 2015-10-13T05:31:09Z

sklearn/mixture/__init__.py

@@ -7,10 +7,18 @@
 from .gmm import _validate_covars
 from .dpgmm import DPGMM, VBGMM

+from .gaussianmixture import GaussianMixture
+from .bayesianmixture import BayesianGaussianMixture
+from .dpgaussianmixture import DirichletProcessGaussianMixture


I would prefer the files to be named "gaussian_mixture", "bayesian_mixture" and "dp_gaussian_mixture".

xuewei4d · 2015-10-20T14:49:55Z

Just renamed the file names, fixed relative import issue, and fixed PEP8 problems in test files

raghavrv · 2015-12-01T16:04:38Z

Could you rebase? :)

Adds the new classes Updates new classes Fixes bugs and typos 1. completes check functions 2. completes initialization functions 3. corrects the time to compute log-likelihood 4. makes verbose messages and its displaying more clear 5. renames modules 6. removes hmm numerical stability fixes '+ 10*EPS' 7. simplifies 'fit_predict' 8. PEP8 1. Rename 2. Ajusts the verbose messages 1. Fixes bugs of checking methods 2. Completes methods for 'diag' cases 1. Fixes RNG issue 2. Addes the method name to initialize to the verbose message 3. Completes the updating functions for tied, spherical. Slightly optimized. 1. Improves the flow of ```fit``` 2. Doc and test 3. Fixes bugs 1. Fixes bugs 2. Adds test cases 1. Addes more tests 2. Fixes bugs 3. Docs 1. Fixes bugs 2. Adds more tests 3. Docs Minor fix 1. Refactor ```MixtureBase``` 2. Adds various methods into ```BayesianGaussianMixture``` 3. Removes ```fit_predict``` 4. Fixes ```DensityMixin``` 1. Fixes bugs 2. Complete BayesianGaussianMixture for 'full' case 1. Fixes bugs 2. Completes the computation of lower bound Split gaussianmixture.py Refactoring 1. snapshot method for debug 2. DirichletProcessGaussianMixture BayesianGaussianMixture for tied covariance/precision 1 Bug fixes 2 ```BayesianGaussianMixture``` for diag covariance lower bound for dpgmm BayesianGaussianMixture for spherical covariance AIC, BIC, score for GaussianMixture score for BayesianGaussianMixture fixes typoes rename parameters cosmit simplify _estimate_suffstat COSMIT bug fixes and test cases bug fixes and test cases bug fixed and test cases test cases test cases test cases test cases for dpgmm and rename prior parameters verbose msg PEP8 and PEP257 fixes score, replace lowerbound with log probabilities in BGM remove comments doc and fix bugs fix some numerical issue doc

superbobry · 2015-12-11T16:47:14Z

I've looked through the current documentation (and derivation PDF) and I think it's missing the explanation of different covariance types.

We had a similar issue in hmmlearn recently, so feel free to use the explanation in hmmlearn/hmmlearn#60.

xuewei4d · 2015-12-14T16:43:49Z

@superbobry Thanks!

…he GSoC 2015 (scikit-learn#4802). Define new class _MixtureBase. GMM derives from _MixtureBase. Introduce reg_covar, max_iter to replace min_covar, n_iter. Add _check_X function. Add the DensityMixin class into base.py Add new abstract methods _m_step, _e_step in _MixtureBase Add new test file for MixtureBase class.

ogrisel · 2016-02-22T13:59:33Z

@xuewei4d did you get the opportunity to understand what's wrong with the gamma_prior of your implementation of DP mixture models not working as expected (no impact on the number of components)?

In your notebook, you should seed the random_state=0 parameter of the model to make different runs easier to compare. Also the debug snapshot should call .copy() on all the arrays it snapshot. However this is not the cause of the problem.

tguillemot · 2016-04-12T15:00:48Z

sklearn/mixture/bayesian_mixture.py

+            mk_sol[k] = np.sum(np.square(sol))
+        temp2 = self.mu_beta_prior_ * self.lambda_nu_ * mk_sol
+        temp_mu = (.5 * np.sum(temp1 - temp2) +
+                   self._log_gaussian_norm_prior)


My previous comment wasn't clear enough. I give more details here to be sure, we are ok.
I thought :

temp_mu = (.5 * np.sum(temp1 - temp2) + self.n_components * self._log_gaussian_norm_prior)

According to Line532, self._log_gaussian_norm_prior = D ln (beta_0 / (2pi))
According to the formula 7.9 of your report, you do : sum_{k=1^K} (D ln ( beta/(2 pi))
Normaly on Line789, it should be : self.n_components * self._log_gaussian_norm_prior
Am I right ?

@xuewei4d if you have some spare time could you please confirm this?

Oh, YES. @tguillemot . There should be self.n_components * before self._log_gaussian_norm_prior

Ok it was sure I haven't miss something. Thx @xuewei4d .

xuewei4d · 2016-04-22T15:18:10Z

I just come out the possible reason for the lower bound of VBGMM goes up and down. In VBGMM and DPGMM, some clusters may vanish during fitting, so the actual the number of clusters would not be the parameter self.n_components. Maybe the value should be adjusted when some clusters are gone. I not quite sure. What I mean 'the clusters are gone' is that these clusters do not have enough data points to make its covariance positive definite. @tguillemot

superbobry · 2016-04-22T16:52:46Z

@xuewei4d in my experience oscillating lower bound almost always means a bug in the M-step (or posterior update in case of VB-type algorithms).

What I mean 'the clusters are gone' is that these clusters do not have enough data points to make its covariance positive definite.

Wouldn't the covariance be completely defined by the prior in this case?

amueller · 2016-05-12T16:22:15Z

This might actually provide a good reference: https://github.com/teodor-moldovan/dpcluster

tguillemot · 2016-05-12T16:57:56Z

@amueller Thanks I will have a look to that.

ogrisel · 2016-09-06T09:42:37Z

Closing in favor of #7295. Let's finish the review there.

amueller reviewed Jun 3, 2015
View reviewed changes

xuewei4d force-pushed the gmm_api branch 3 times, most recently from 04a1df9 to 20bc698 Compare June 5, 2015 14:47

vene reviewed Jun 5, 2015
View reviewed changes

xuewei4d force-pushed the gmm_api branch from f7c3638 to ede7966 Compare June 7, 2015 22:58

xuewei4d changed the title ~~[WIP] Improve GMM API~~ [MRG] Improve GMM API: new names and score_samples Jun 8, 2015

xuewei4d changed the title ~~[MRG] Improve GMM API: new names and score_samples~~ [WIP] Improve GMM API: new names and score_samples Jun 12, 2015

xuewei4d force-pushed the gmm_api branch 3 times, most recently from 27557c6 to 5627394 Compare June 13, 2015 18:12

xuewei4d force-pushed the gmm_api branch 3 times, most recently from ef19d4d to e88036c Compare June 14, 2015 22:55

xuewei4d force-pushed the gmm_api branch from 3e8520d to 83dee1d Compare June 15, 2015 14:19

xuewei4d changed the title ~~[WIP] Improve GMM API: new names and score_samples~~ [WIP] Improve GMM API Jun 15, 2015

xuewei4d force-pushed the gmm_api branch from fd90253 to 94427bf Compare June 18, 2015 00:57

xuewei4d force-pushed the gmm_api branch from 0634a59 to f5864aa Compare August 21, 2015 21:55

This was referenced Sep 13, 2015

Gradient of Mixture of Gaussians (GMM) (and possibly, list of modes?) #4688 #5264

Closed

[MRG] Fix gamma update in DPGMM #4175

Closed

[MRG+1] Give dpgmm a covars_ property, allows for sampling #4182

Closed

GaelVaroquaux reviewed Oct 13, 2015
View reviewed changes

xuewei4d force-pushed the gmm_api branch from 641132f to f8054e8 Compare October 20, 2015 14:48

xuewei4d added 3 commits December 1, 2015 15:36

minor improvement

6e5581f

rename file names, fix relative import, PEP8

769f7a2

xuewei4d force-pushed the gmm_api branch from f8054e8 to 769f7a2 Compare December 1, 2015 20:37

superbobry mentioned this pull request Dec 11, 2015

clarify meaning of covariance_type for GaussianHMM hmmlearn/hmmlearn#60

Closed

DOC: add the comments on the types of covariances, fix typo

2caab4d

superbobry mentioned this pull request Jan 5, 2016

Critical bug in GMMHMM EM algorithm hmmlearn/hmmlearn#78

Closed

tguillemot mentioned this pull request Feb 19, 2016

[MRG] Integration of GSoC2015 Gaussian Mixture (first step) #6407

Closed

tguillemot reviewed Apr 12, 2016
View reviewed changes

ogrisel closed this Sep 6, 2016

Uh oh!

[RFC] GSoC2015 Improve GMM API #4802

[RFC] GSoC2015 Improve GMM API #4802

Uh oh!

Conversation

xuewei4d commented Jun 2, 2015

Uh oh!

xuewei4d commented Jun 2, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

xuewei4d commented Jun 3, 2015

Uh oh!

amueller commented Jun 3, 2015

Uh oh!

xuewei4d commented Jun 4, 2015

Uh oh!

amueller commented Jun 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuewei4d commented Jun 13, 2015

Uh oh!

vene commented Jun 14, 2015

Uh oh!

xuewei4d commented Jun 15, 2015

Uh oh!

xuewei4d commented Jun 17, 2015

self.n_iter == 0 occurs when using GMM within HMM

Uh oh!

clorenz7 commented Jun 17, 2015

Uh oh!

ogrisel commented Jun 18, 2015

Uh oh!

amueller commented Oct 12, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuewei4d commented Oct 20, 2015

Uh oh!

raghavrv commented Dec 1, 2015

Uh oh!

superbobry commented Dec 11, 2015

Uh oh!

xuewei4d commented Dec 14, 2015

Uh oh!

ogrisel commented Feb 22, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xuewei4d commented Apr 22, 2016

Uh oh!

superbobry commented Apr 22, 2016

Uh oh!

amueller commented May 12, 2016

Uh oh!

tguillemot commented May 12, 2016

Uh oh!

ogrisel commented Sep 6, 2016

Uh oh!

Uh oh!