Skip to content

[RFC] GSoC2015 Improve GMM API #4802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

xuewei4d
Copy link
Contributor

@xuewei4d xuewei4d commented Jun 2, 2015

GSoC 2015 project, Improve GMM API

  1. Completes derivation report.
  2. Adds new classes. One abstract class _BasesMixture. Three derived classes GaussianMixture, BayesianGaussianMixture, DirichletProcessGaussianMixture
  3. Decouples large functions, especially in DirichletProcessGaussianMixture and BayesianGaussianMixture
  4. Removes numerical stability fixes for HMM. It seems that whenever there is a numerical issue, the code always adds 10_EPS in the computation. I think in some cases there is a better way to address the problem, such as normalization the extremely small variables earlier, or we just simply remove 10_EPS which is only for HMM.
  5. Writes updating functions for BayesianGaussianMixtureModel and DirichletProcessGaussianMixtureModel according to the report.
  6. Provides methods that allow users to initialize the model with user-provided data
  7. Corrects kmeans initialization. It is weird when using kmeans initialization, only means is initialized. The weights and covariances are initialized by averaging.
  8. Writes several checking functions for the initialization data
  9. Adjusts the verbose messages. When verbose>1, it display log-likelihood and time used in the same line of the message Iteration x
  10. Remove fit_predict
  11. Adds warning for params!='wmc'
  12. Studies and contrasts the convergence of classical MLE / EM GMM with Bayesian GMM against the number of samples and the number of components
  13. Friendly warning and error messages, or automatically addressing if possible (e.g. random re-init of singular components)
  14. Examples that shows how models can over-fit by comparing likelihood on training and validation sets (normalized by
    the number of samples). For instance extend the BIC score example with a cross-validated likelihood plot
  15. Testing on 1-D dimensions
  16. Testing on Degenerating cases
  17. [optional] add a partial_fit function for incremental / out-of-core fitting of (classical) GMM, for instance http://arxiv.org/abs/0712.4273
  18. [optional] ledoit_wolf covariance estimation or PCA(n_components).fit(X_train).get_covariance()

@xuewei4d
Copy link
Contributor Author

xuewei4d commented Jun 2, 2015

I have not make KDE to be based on DensityMixin. @ogrisel @lesteve @amueller .

return logprob

def responsibility_samples(self, X):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this method for? Why not just use predict_proba? We would rather not add to many functions to the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I didn't realize that we have predict_proba.

@amueller
Copy link
Member

amueller commented Jun 3, 2015

The thing with changing score_samples is that it is an incompatible change. So we have to a) accept that we break user code (rather not) or b) rename the estimator to GaussianMixtureModel or something, and make the old class behave like it did, deprecate it, and make the new class behave as we want.

@xuewei4d
Copy link
Contributor Author

xuewei4d commented Jun 3, 2015

Sorry for my bone-headness. I will build a new Class GaussianMixtureModel. And we should add deprecate messages in score_samples right?

@amueller
Copy link
Member

amueller commented Jun 3, 2015

Rename the old class to the new name, then implement GMM by inheriting from the new class and overwriting score_samples. Then deprecate the whole "old" GMM. At least I think that was the consensus.

@xuewei4d
Copy link
Contributor Author

xuewei4d commented Jun 4, 2015

Thanks for the advice. Did you mean to rename class GMM to class GaussianMixtureModel and implement GMM in the same file, i.e. gmm.py?

@amueller
Copy link
Member

amueller commented Jun 4, 2015

yes, that sounds good.

@xuewei4d xuewei4d force-pushed the gmm_api branch 3 times, most recently from 04a1df9 to 20bc698 Compare June 5, 2015 14:47
assert_warns_message(DeprecationWarning,
"The 'score_samples' method in GMM has been "
"deprecated in 0.17 and will be removed in 0.19."
"Use 'score_samples' method "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'score_samples' method, the 'log_probability', the 'predict_proba' method, the 'responsibility'. Same in all deprecation warnings.

Also I think 'of each sample' sounds better than 'for each sample' in this context.

@xuewei4d xuewei4d changed the title [WIP] Improve GMM API [MRG] Improve GMM API: new names and score_samples Jun 8, 2015
@xuewei4d xuewei4d changed the title [MRG] Improve GMM API: new names and score_samples [WIP] Improve GMM API: new names and score_samples Jun 12, 2015
@xuewei4d xuewei4d force-pushed the gmm_api branch 3 times, most recently from 27557c6 to 5627394 Compare June 13, 2015 18:12
@xuewei4d
Copy link
Contributor Author

I am sorry I didn't get a very clear mind in the beginning, and there is a lot of work on refactoring GMM. So I removed the GaussianMixtureModel I created for the inconsistency problem about score, but I created a new class in an separated file, just an prototype. Here is that my thought.

  • added the parameters weights, means, covars in __init__ to allow users initialize the model
  • added methods to check user-provided initialization data
  • added methods that compute the sufficient statistics of the data, which is used in m-step. They are N_k, \bar{x}_k, S_k.
  • added estimation functions as abstract methods to the base class _MixtureModelBase such that GaussianMixtureModel, BayesianMixtureModel, DirichletProcessMixtureModel could implement their own updating functions.
  • added DensityMixin to compute score for all density estimators

The code is in the earliest version. I would like to hear about the comments about the definitions of these classes. Thanks.

@xuewei4d xuewei4d force-pushed the gmm_api branch 3 times, most recently from ef19d4d to e88036c Compare June 14, 2015 22:55
@vene
Copy link
Member

vene commented Jun 14, 2015

I may be a bit late to the party but if these classes get renamed, may I object to having "Model" in the name, and suggest simply GaussianMixture etc? I mean, most of our estimators are models of some sort, the word doesn't bring much, and the docstrings will still say "Gaussian Mixture Model" and "GMM" for searchability.

@xuewei4d
Copy link
Contributor Author

@vene from my perspective, the name 'GaussianMixtureModel' is OK, because in the literature everybody uses this name. The word 'model' is kind of a part of it.

@xuewei4d xuewei4d changed the title [WIP] Improve GMM API: new names and score_samples [WIP] Improve GMM API Jun 15, 2015
@xuewei4d
Copy link
Contributor Author

I am confused with the recent PR #4593 @clorenz7 . How can zero-filled responsibility is a kind of initialization? Line571. The L1-norm of each responsibility must be 1. If we does need quick-initialization, we should initialize the model parameters, weights_, means_, covariances_, not responsibility.

self.n_iter == 0 occurs when using GMM within HMM

        # Need to make sure that there are responsibilities to output
        # Output zeros because it was just a quick initialization

I cannot find somewhere needs responsibilities, except self.fit_predict merged from the PR #4593.
HMM of sklearn is gone. hmmlearn does not set n_iter=0. _fit is a private method, which should not be called by other modules. So returning responsibility is only for self.fit_predict. However, this method directly use the responsibility of _fit, which will skip the last EM iteration, because the returned responsibility is computed with parameters optimized by the last two iteration.

And Line442 has a weird parameter do_prediction=False??

@clorenz7
Copy link

Yes, do_prediction is not needed, I apologize for my oversight.

There was discussion in the PR pre-merge about the removal of HMM, and the
not-needed n_iter=0. It is not that another module would be calling
self._fit, but that a module which subclassed from it would make use of it.
The addition was largely defensive since responsibilities variable was set
in the for loop over n_iter. The responsibilities being all zeros as a
degenerate case seemed ok since a GMM fit with n_iter=0 is already
degenerate.

You are right about the responsibility returning only being for
fit_predict, and it skipping the last EM iteration. I wrote a comment about
that:

Warning: due to the final maximization step in the EM algorithm, with low
iterations the prediction may not be 100% accurate ```
During the development of the method, the feedback that I received was that
that the idea of the fit_predict method is to save time over someone just
calling ```model.fit``` then ```model.predict```.







On Wed, Jun 17, 2015 at 11:10 AM, Wei Xue <notifications@github.com> wrote:

> I am confused with the recent PR #4579
> <https://github.com/scikit-learn/scikit-learn/issues/4579> @clorenz7
> <https://github.com/clorenz7> . How can zero-filled responsibility is a
> kind of initialization? Line571
> <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/gmm.py#L571>.
> The L1-norm of each responsibility must be 1. If we does need
> quick-initialization, we should initialize the model parameters, weights_,
> means_, covariances_, not responsibility.
>
> # self.n_iter == 0 occurs when using GMM within HMM
> # Need to make sure that there are responsibilities to output
> # Output zeros because it was just a quick initialization
>
> I cannot find somewhere needs responsibilities, except self.fit_predict
> merged from the PR #4579
> <https://github.com/scikit-learn/scikit-learn/issues/4579>.
> HMM of sklearn is gone. hmmlearn does not set n_iter=0
> <https://github.com/hmmlearn/hmmlearn/blob/master/hmmlearn/hmm.py#L592>.
> _fit is a private method, which should not be called by other modules. So
> returning responsibility is only for self.fit_predict. However, this
> method directly use the responsibility of _fit, which will skip the last
> EM iteration, because the returned responsibility is computed with
> parameters optimized by the last two iteration.
>
> And Line442
> <https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/mixture/gmm.py#L442>
> has a weird parameter do_prediction=False??
>
> —
> Reply to this email directly or view it on GitHub
> <https://github.com/scikit-learn/scikit-learn/pull/4802#issuecomment-112896427>
> .
>

@ogrisel
Copy link
Member

ogrisel commented Jun 18, 2015

+1 for simplifying the code by taking the removal of HMM into account and not dealing with the n_iter=0 case that should never appear.

@amueller
Copy link
Member

#5344 might be of interest

@@ -7,10 +7,18 @@
from .gmm import _validate_covars
from .dpgmm import DPGMM, VBGMM

from .gaussianmixture import GaussianMixture
from .bayesianmixture import BayesianGaussianMixture
from .dpgaussianmixture import DirichletProcessGaussianMixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer the files to be named "gaussian_mixture", "bayesian_mixture" and "dp_gaussian_mixture".

@xuewei4d
Copy link
Contributor Author

Just renamed the file names, fixed relative import issue, and fixed PEP8 problems in test files

@raghavrv
Copy link
Member

raghavrv commented Dec 1, 2015

Could you rebase? :)

Adds the new classes

Updates new classes

Fixes bugs and typos

1. completes check functions
2. completes initialization functions
3. corrects the time to compute log-likelihood
4. makes verbose messages and its displaying more clear
5. renames modules
6. removes hmm numerical stability fixes '+ 10*EPS'
7. simplifies 'fit_predict'
8. PEP8

1. Rename
2. Ajusts the verbose messages

1. Fixes bugs of checking methods
2. Completes methods for 'diag' cases

1. Fixes RNG issue
2. Addes the method name to initialize to the verbose message
3. Completes the updating functions for tied, spherical. Slightly optimized.

1. Improves the flow of ```fit```
2. Doc and test
3. Fixes bugs

1. Fixes bugs
2. Adds test cases

1. Addes more tests
2. Fixes bugs
3. Docs

1. Fixes bugs
2. Adds more tests
3. Docs

Minor fix

1. Refactor ```MixtureBase```
2. Adds various methods into ```BayesianGaussianMixture```
3. Removes ```fit_predict```
4. Fixes ```DensityMixin```

1. Fixes bugs
2. Complete BayesianGaussianMixture for 'full' case

1. Fixes bugs
2. Completes the computation of lower bound

Split gaussianmixture.py

Refactoring

1. snapshot method for debug
2. DirichletProcessGaussianMixture

BayesianGaussianMixture for tied covariance/precision

1 Bug fixes
2 ```BayesianGaussianMixture``` for diag covariance

lower bound for dpgmm

BayesianGaussianMixture for spherical covariance

AIC, BIC, score for GaussianMixture
score for BayesianGaussianMixture

fixes typoes
rename parameters

cosmit
simplify _estimate_suffstat

COSMIT

bug fixes and test cases

bug fixes and test cases

bug fixed and test cases

test cases

test cases

test cases

test cases for dpgmm and rename prior parameters

verbose msg

PEP8 and PEP257

fixes score, replace lowerbound with log probabilities in BGM

remove comments

doc and fix bugs

fix some numerical issue

doc
@superbobry
Copy link
Contributor

I've looked through the current documentation (and derivation PDF) and I think it's missing the explanation of different covariance types.

We had a similar issue in hmmlearn recently, so feel free to use the explanation in hmmlearn/hmmlearn#60.

@xuewei4d
Copy link
Contributor Author

@superbobry Thanks!

tguillemot added a commit to tguillemot/scikit-learn that referenced this pull request Feb 19, 2016
…he GSoC 2015 (scikit-learn#4802).

Define new class _MixtureBase.

GMM derives from _MixtureBase.

Introduce reg_covar, max_iter to replace min_covar, n_iter.

Add _check_X function.

Add the DensityMixin class into base.py

Add new abstract methods _m_step, _e_step in _MixtureBase

Add new test file for MixtureBase class.
tguillemot added a commit to tguillemot/scikit-learn that referenced this pull request Feb 19, 2016
…he GSoC 2015 (scikit-learn#4802).

Define new class _MixtureBase.

GMM derives from _MixtureBase.

Introduce reg_covar, max_iter to replace min_covar, n_iter.

Add _check_X function.

Add the DensityMixin class into base.py

Add new abstract methods _m_step, _e_step in _MixtureBase

Add new test file for MixtureBase class.
@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2016

@xuewei4d did you get the opportunity to understand what's wrong with the gamma_prior of your implementation of DP mixture models not working as expected (no impact on the number of components)?

In your notebook, you should seed the random_state=0 parameter of the model to make different runs easier to compare. Also the debug snapshot should call .copy() on all the arrays it snapshot. However this is not the cause of the problem.

mk_sol[k] = np.sum(np.square(sol))
temp2 = self.mu_beta_prior_ * self.lambda_nu_ * mk_sol
temp_mu = (.5 * np.sum(temp1 - temp2) +
self._log_gaussian_norm_prior)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous comment wasn't clear enough. I give more details here to be sure, we are ok.
I thought :

temp_mu = (.5 * np.sum(temp1 - temp2) +
                        self.n_components * self._log_gaussian_norm_prior)

According to Line532, self._log_gaussian_norm_prior = D ln (beta_0 / (2pi))
According to the formula 7.9 of your report, you do : sum_{k=1^K} (D ln ( beta/(2 pi))
Normaly on Line789, it should be : self.n_components * self._log_gaussian_norm_prior
Am I right ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xuewei4d if you have some spare time could you please confirm this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, YES. @tguillemot . There should be self.n_components * before self._log_gaussian_norm_prior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok it was sure I haven't miss something. Thx @xuewei4d .

@xuewei4d
Copy link
Contributor Author

I just come out the possible reason for the lower bound of VBGMM goes up and down. In VBGMM and DPGMM, some clusters may vanish during fitting, so the actual the number of clusters would not be the parameter self.n_components. Maybe the value should be adjusted when some clusters are gone. I not quite sure. What I mean 'the clusters are gone' is that these clusters do not have enough data points to make its covariance positive definite. @tguillemot

@superbobry
Copy link
Contributor

@xuewei4d in my experience oscillating lower bound almost always means a bug in the M-step (or posterior update in case of VB-type algorithms).

What I mean 'the clusters are gone' is that these clusters do not have enough data points to make its covariance positive definite.

Wouldn't the covariance be completely defined by the prior in this case?

@amueller
Copy link
Member

This might actually provide a good reference: https://github.com/teodor-moldovan/dpcluster

@tguillemot
Copy link
Contributor

@amueller Thanks I will have a look to that.

@ogrisel
Copy link
Member

ogrisel commented Sep 6, 2016

Closing in favor of #7295. Let's finish the review there.

@ogrisel ogrisel closed this Sep 6, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.