Skip to content

Scaling kills DPGMM [was: mixture.DPGMM not fitting to data] #2454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
caofan opened this issue Sep 18, 2013 · 6 comments
Closed

Scaling kills DPGMM [was: mixture.DPGMM not fitting to data] #2454

caofan opened this issue Sep 18, 2013 · 6 comments
Labels
Milestone

Comments

@caofan
Copy link

caofan commented Sep 18, 2013

I am trying out the Gaussian mixture models in the package. I tried to model a mixture with two Components, G(1000,500^2), and G(2000,600^2). The following is the code:

data = np.random.normal(1000,500,1000)
data2 = np.random.normal(2000,600,1000)
data = list(data) + list(data2)
model = mixture.DPGMM(n_components=10,alpha=10,n_iter=10000)
model.fit(data)
print model.means_

And I got the following means of the components.
[[ 0.13436485]
[ 0.13199086]
[ 0.11750537]
[ 0.10560644]
[ 0.12162311]
[ 0.00204134]
[ 0.12058521]
[ 0.11997703]
[ 0.11944384]
[ 0.11890694]]

It seems the model does not fit properly to the data. Is it a bug or I have got something wrong in the application of the model?

Thanks.
Fan

@arjoly arjoly added the Bug label May 11, 2014
@amueller amueller added this to the 0.15.1 milestone Jul 18, 2014
@amueller
Copy link
Member

This looks pretty bad :-/

@amueller
Copy link
Member

My explanation for this is: the model assumes a N(0, 1) prior on the means [and also a fixed prior on the covariance], which is not reasonable for your data. To make this work, the data should be scaled to have zero mean and unit variance. Then the result would be much more sensible.

I have to little experience in these kind of models to say what a good solution would be.
Possible candidates:

  • prescale the data (and adjust precision and mean that are estimated accordingly)
  • raise a warning?
  • use a hierarchical Baysian approach?
  • make the priors parameters of the estimator
  • estimate the priors from the data (which is probably the same as just rescaling the data)
  • use a much wider (or non-informative) prior on the means

Ps: any Baysian should feel free to hit me and implement the hierarchical approach.

@amueller amueller changed the title mixture.DPGMM not fitting to data Scaling kills DPGMM [was: mixture.DPGMM not fitting to data] Jan 28, 2015
@amueller
Copy link
Member

Thinking about it, I'm not sure if 1000 samples shouldn't be enough to overcome the prior... hum...

@amueller
Copy link
Member

The derivation of the mean http://scikit-learn.org/dev/modules/dp-derivation.html#the-updates is quite different from the one listed in Bishop's or Murphy's book. In particular, in the books the variational mean parameters don't depend on the variational precision parameters, which they do in the derivation in the docs (which is odd).
I'm a bit tempted to replace the implementation by a close correspondence to Bishop and see how that goes.

@GaelVaroquaux
Copy link
Member

I'm a bit tempted to replace the implementation by a close correspondence to
Bishop and see how that goes.

I am not very attached to our implementation. It has given us a lot of
problems in the past.

@ogrisel
Copy link
Member

ogrisel commented Sep 10, 2016

Closing: the new Dirichlet process GMM re-write has been merged in master. It is not affected by this bug.

@ogrisel ogrisel closed this as completed Sep 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants