-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
incorrect estimated means lead to non positive definite covariance in GMM #4429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for digging into this. You are right, not estimating the means is a problem. So you say matlab uses the same way we currently do it (after the PR)? I looked at vlfeat and they also do the same. I guess it is for numerical reasons. I don't see a strong reason to change it back. Do you? I don't understand your statement "In #2640, without the snippet resulting the error, I also cannot figure out why the covariance is not positive definite. " |
Besides the deprecation, I don't know why there is Yes, I agree with you. I don't think we need to change it back in Sorry I don't make that clean. I just don't know why @arendu has non positive definite covariance error in #2640 . I think it would be better to catch |
I see. I agree it would be nice to catch the error. I thought it was more informative. For |
Yeah. We should provide an explicit way to let users initialize the means, covers, weights with their own initialization. And do you think we need to check users initialize them properly? Such as positive definite covariance, the sum of component weights should be 1. I will work on the |
Yes, we should check that. I'm +0.5 on changing the way users can initialize. The current way is different from the standard scikit-learn way, but deprecating things is always painful for users. |
Thanks for working on this :) |
Hello, Thanks @xuewei4d for the correct explanation of the reason why the covariance matrix was not positive definite. However, I ran into the problem because I was in a situation where I did want to keep the mean unchanged after initilialization. Basically the idea was to keep the kmeans initialization because it tiles the data cloud while the plain GMM solutions was tending to have confused centers and different covariances to catch non Gaussian tails. I do not claim it is a recommanded way to use GMMs, however, it did the job pretty well. I think it is convenient to keep 'init_params' and 'params' separate and assume that users know what they do when changing those values. At most, I would add a note in the class documentation to warn that it is an unrecommanded trick. |
@alexis-mignon I understand, so let users to worry about how to use @amueller I will add to my GSoC proposal how to design the interface for passing initialization of mean, covariance, weights in GMM. |
thanks :) @alexis-mignon I think it is not a good idea to keep |
Hurray for closed science! |
What framework would you suggest ? |
I'm not sure what the best in Python would be. I'd think of something like factory. I think there is some variational inference machines in python, too. Maybe https://github.com/bayespy/bayespy which also has a list of related packages? |
@amueller Besides |
|
@tguillemot this is fixed in the new GMM, right? |
Sorry I haven't seen this issue. |
Closing this PR based on #4429 (comment). Also the issue is about the original GMM class which is no longer in scikit-learn. |
I dig into #3945 and #2640. PR #3945 claims
Actually, the reason why the sample code given by @AlexisMignon in #2640 raises an exception is that the simplified formula C = (\sum_i w_i x_i x_i') - \mu \mu' does not apply when \mu is not the mean of X.
The gmm created in this line
does not estimate
mean
in M-step. Thegmm.means_
is initialized by k-means, and does not change in EM. When the code in m_step before #3945 merged is executed, in one of component the mean is initialized to[[ 3.42118979 3.53824541]]
by kmeans. However, the real mean is[0.02670257 0.02945094]
before the exception is raised. Using C = (\sum_i w_i x_i x_i') - \mu \mu', of course, the estimated covariance is not positive definite. IfWe don't have the exception at all.
In #2640, without the snippet resulting the error, I also cannot figure out why the covariance is not positive definite. Anyway, I think the method
log_multivariate_normal_density
in gmm.py could be called by methods in other modules, so it is better to check the covariance first before Cholesky decomposition.Besides, when using parameter
params='wc'
to initialize a gmm, wealwayshave non-positive-definite covariance problems, withfull
andtied
type covariance. You could try it with any random generated data. I think it is strange to build GMM without estimating means. It would be better to deprecateparams
initialization other thanwmc
.Finally, the original code using equation C = (\sum_i w_i x_i x_i') - \mu \mu' before #3945 merged, I think, it is OK as long as we could guarantee \mu is the mean of X. But I found the code in Matlab use the way as #3945 does. I cannot see other reasons to use that way except preventing round-off errors, which I really doubt.
Ping @amueller
The text was updated successfully, but these errors were encountered: