Skip to content

incorrect estimated means lead to non positive definite covariance in GMM #4429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
xuewei4d opened this issue Mar 20, 2015 · 18 comments
Closed

Comments

@xuewei4d
Copy link
Contributor

I dig into #3945 and #2640. PR #3945 claims

Due to round off errors, the formula used to compute the weighted covariance matrix may lead to matrices with large negative eigenvalues.

Actually, the reason why the sample code given by @AlexisMignon in #2640 raises an exception is that the simplified formula C = (\sum_i w_i x_i x_i') - \mu \mu' does not apply when \mu is not the mean of X.

The gmm created in this line

gmm = mixt.GMM(2, params="wc", covariance_type="full", min_covar=1e-3)

does not estimate mean in M-step. The gmm.means_ is initialized by k-means, and does not change in EM. When the code in m_step before #3945 merged is executed, in one of component the mean is initialized to [[ 3.42118979 3.53824541]] by kmeans. However, the real mean is [0.02670257 0.02945094] before the exception is raised. Using C = (\sum_i w_i x_i x_i') - \mu \mu', of course, the estimated covariance is not positive definite. If

gmm = mixt.GMM(2, params="wmc", covariance_type="full", min_covar=1e-3)

We don't have the exception at all.

In #2640, without the snippet resulting the error, I also cannot figure out why the covariance is not positive definite. Anyway, I think the method log_multivariate_normal_density in gmm.py could be called by methods in other modules, so it is better to check the covariance first before Cholesky decomposition.

Besides, when using parameter params='wc' to initialize a gmm, we always have non-positive-definite covariance problems, with full and tied type covariance. You could try it with any random generated data. I think it is strange to build GMM without estimating means. It would be better to deprecate params initialization other than wmc.

Finally, the original code using equation C = (\sum_i w_i x_i x_i') - \mu \mu' before #3945 merged, I think, it is OK as long as we could guarantee \mu is the mean of X. But I found the code in Matlab use the way as #3945 does. I cannot see other reasons to use that way except preventing round-off errors, which I really doubt.

Ping @amueller

@xuewei4d xuewei4d changed the title incorrect estimated means lead to non positive definite covariance incorrect estimated means lead to non positive definite covariance in GMM Mar 20, 2015
@amueller
Copy link
Member

Thanks for digging into this. You are right, not estimating the means is a problem.
I also think it is strange, and we should deprecate the usage.

So you say matlab uses the same way we currently do it (after the PR)? I looked at vlfeat and they also do the same. I guess it is for numerical reasons. I don't see a strong reason to change it back. Do you?

I don't understand your statement "In #2640, without the snippet resulting the error, I also cannot figure out why the covariance is not positive definite. "
Why would you check the covariance before the cholesky? You could also only raise an error, right?

@xuewei4d
Copy link
Contributor Author

Besides the deprecation, I don't know why there is init_params and params, which both accept 'wmc' initialization format. It seems redundant.

Yes, I agree with you. I don't think we need to change it back in _covar_mstep_full, but do you think we need to change _covar_mstep_tied? Matlab does as we do now after the PR for full covariance type.

Sorry I don't make that clean. I just don't know why @arendu has non positive definite covariance error in #2640 . I think it would be better to catch numpy.linalg.linalg.LinAlgError: in _log_multivariate_normal_density_full.

@amueller
Copy link
Member

I see. I agree it would be nice to catch the error. I thought it was more informative.
I think the intention behind the "duplication" between init_params and params is that you might want to a) give a custom initialization, but want to keep on training (init_params allows that), or want to set some fixed parameters and not train them at all (params allows that).
I think we want to deprecate params because estimating some parameters and not others can lead to trouble.

For init_params it is not something we usually support in scikit-learn, but I can see how it could be useful.
A different way to do this would be to actually pass the initialization as parameters, which is what we do in different places, like means_init=some_array, covars_init=some_array etc.

@xuewei4d
Copy link
Contributor Author

Yeah. We should provide an explicit way to let users initialize the means, covers, weights with their own initialization. And do you think we need to check users initialize them properly? Such as positive definite covariance, the sum of component weights should be 1.

I will work on the params deprecation and catch numpy.linalg.linalg.LinAlgError:.

@amueller
Copy link
Member

Yes, we should check that. I'm +0.5 on changing the way users can initialize. The current way is different from the standard scikit-learn way, but deprecating things is always painful for users.

@amueller
Copy link
Member

Thanks for working on this :)

@alexis-mignon
Copy link
Contributor

Hello,

Thanks @xuewei4d for the correct explanation of the reason why the covariance matrix was not positive definite.

However, I ran into the problem because I was in a situation where I did want to keep the mean unchanged after initilialization. Basically the idea was to keep the kmeans initialization because it tiles the data cloud while the plain GMM solutions was tending to have confused centers and different covariances to catch non Gaussian tails. I do not claim it is a recommanded way to use GMMs, however, it did the job pretty well.

I think it is convenient to keep 'init_params' and 'params' separate and assume that users know what they do when changing those values. At most, I would add a note in the class documentation to warn that it is an unrecommanded trick.

@xuewei4d
Copy link
Contributor Author

@alexis-mignon I understand, so let users to worry about how to use init_params and params correctly.

@amueller I will add to my GSoC proposal how to design the interface for passing initialization of mean, covariance, weights in GMM.

@amueller
Copy link
Member

thanks :)

@alexis-mignon I think it is not a good idea to keep params, because it is a very special use-case, and we need to make sure that all seven possible combinations work correctly. They did (do?) not, and it looks to me like this is quite a hassle for a very specific use-case.
Also, I'm not even sure what the settings mean. What is the mean without adjusting the prior on the mixture components? Do you just compute the responsibilities and use those to update the mean, but then discard the responsibilities and not update the prior?

@alexis-mignon
Copy link
Contributor

I see some interest in being able to keep some parameter fixed:

  • You may know the mean values from a phyical model but need to estimate the process noise (using a tied model for instance)
  • You may now your noise model, but need to estimate the mean locations (equivalent to deblurring but with discrete samples)

However, I am not sure the corresponding estimators in those settings are exactly the same as the one used in the present EM algorithm. The existence of this paper, Gaussian mixture parameter estimation with known means and unknown class-dependent variances, suggests that there might be some specific treatments when some parameters are known. However, I don't have access to science direct and could not read it.

@amueller
Copy link
Member

Hurray for closed science!
While I agree that there is an application for that, I would claim it is outside the normal sklearn scope. Sklearn is pretty black-boxy and not really good for plugging in known quantities into models.
I think you'd be better served with a probabilistic modelling framework for that.

@alexis-mignon
Copy link
Contributor

What framework would you suggest ?

@amueller
Copy link
Member

I'm not sure what the best in Python would be. I'd think of something like factory. I think there is some variational inference machines in python, too. Maybe https://github.com/bayespy/bayespy which also has a list of related packages?

@xuewei4d
Copy link
Contributor Author

@amueller Besides param problem, I am curious about four covariance type spherical, tied,full,diag, since I never saw that before, at most the case where the covariance is fully estimated, but shared by each component. I found a comment in the gmm code points to a Murphy's 1998 paper, Fitting a Conditional Linear Gaussian Distribution (8 citations). It elaborates all the updating functions for four covariance type. However I didn't find similar material in his MLAPP book. I checked out Matlab implement. There are only two types diag and full, both of them could be shared (tied) or not shared. I think spherical is rarely used. I am just asking is that necessary to refactor covariance updating to that as Matlab does?

@amueller
Copy link
Member

spherical is the simplest one, right? I don't see a reason to remove it.
Having tied diagonal might be nice, but no-one ever asked for it, so maybe not so important?

@amueller
Copy link
Member

@tguillemot this is fixed in the new GMM, right?

@tguillemot
Copy link
Contributor

Sorry I haven't seen this issue.
This is fixed by #6666.

@thomasjpfan
Copy link
Member

Closing this PR based on #4429 (comment). Also the issue is about the original GMM class which is no longer in scikit-learn.

Repository owner moved this from Todo📬 to Done🚀 in Quansight's scikit-learn Project Board Jul 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants