-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] GSoC2015 Improve GMM API #4802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return logprob | ||
|
||
def responsibility_samples(self, X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this method for? Why not just use predict_proba
? We would rather not add to many functions to the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. I didn't realize that we have predict_proba
.
The thing with changing |
Sorry for my bone-headness. I will build a new Class |
Rename the old class to the new name, then implement GMM by inheriting from the new class and overwriting score_samples. Then deprecate the whole "old" GMM. At least I think that was the consensus. |
Thanks for the advice. Did you mean to rename |
yes, that sounds good. |
04a1df9
to
20bc698
Compare
assert_warns_message(DeprecationWarning, | ||
"The 'score_samples' method in GMM has been " | ||
"deprecated in 0.17 and will be removed in 0.19." | ||
"Use 'score_samples' method " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'score_samples' method, the 'log_probability', the 'predict_proba' method, the 'responsibility'. Same in all deprecation warnings.
Also I think 'of each sample' sounds better than 'for each sample' in this context.
score_samples
score_samples
score_samples
27557c6
to
5627394
Compare
I am sorry I didn't get a very clear mind in the beginning, and there is a lot of work on refactoring
The code is in the earliest version. I would like to hear about the comments about the definitions of these classes. Thanks. |
ef19d4d
to
e88036c
Compare
I may be a bit late to the party but if these classes get renamed, may I object to having "Model" in the name, and suggest simply |
@vene from my perspective, the name 'GaussianMixtureModel' is OK, because in the literature everybody uses this name. The word 'model' is kind of a part of it. |
score_samples
I am confused with the recent PR #4593 @clorenz7 . How can zero-filled responsibility is a kind of initialization? Line571. The L1-norm of each responsibility must be 1. If we does need quick-initialization, we should initialize the model parameters, weights_, means_, covariances_, not responsibility.
I cannot find somewhere needs responsibilities, except And Line442 has a weird parameter |
Yes, do_prediction is not needed, I apologize for my oversight. There was discussion in the PR pre-merge about the removal of HMM, and the You are right about the responsibility returning only being for
|
+1 for simplifying the code by taking the removal of HMM into account and not dealing with the |
#5344 might be of interest |
@@ -7,10 +7,18 @@ | |||
from .gmm import _validate_covars | |||
from .dpgmm import DPGMM, VBGMM | |||
|
|||
from .gaussianmixture import GaussianMixture | |||
from .bayesianmixture import BayesianGaussianMixture | |||
from .dpgaussianmixture import DirichletProcessGaussianMixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer the files to be named "gaussian_mixture", "bayesian_mixture" and "dp_gaussian_mixture".
Just renamed the file names, fixed relative import issue, and fixed PEP8 problems in test files |
Could you rebase? :) |
Adds the new classes Updates new classes Fixes bugs and typos 1. completes check functions 2. completes initialization functions 3. corrects the time to compute log-likelihood 4. makes verbose messages and its displaying more clear 5. renames modules 6. removes hmm numerical stability fixes '+ 10*EPS' 7. simplifies 'fit_predict' 8. PEP8 1. Rename 2. Ajusts the verbose messages 1. Fixes bugs of checking methods 2. Completes methods for 'diag' cases 1. Fixes RNG issue 2. Addes the method name to initialize to the verbose message 3. Completes the updating functions for tied, spherical. Slightly optimized. 1. Improves the flow of ```fit``` 2. Doc and test 3. Fixes bugs 1. Fixes bugs 2. Adds test cases 1. Addes more tests 2. Fixes bugs 3. Docs 1. Fixes bugs 2. Adds more tests 3. Docs Minor fix 1. Refactor ```MixtureBase``` 2. Adds various methods into ```BayesianGaussianMixture``` 3. Removes ```fit_predict``` 4. Fixes ```DensityMixin``` 1. Fixes bugs 2. Complete BayesianGaussianMixture for 'full' case 1. Fixes bugs 2. Completes the computation of lower bound Split gaussianmixture.py Refactoring 1. snapshot method for debug 2. DirichletProcessGaussianMixture BayesianGaussianMixture for tied covariance/precision 1 Bug fixes 2 ```BayesianGaussianMixture``` for diag covariance lower bound for dpgmm BayesianGaussianMixture for spherical covariance AIC, BIC, score for GaussianMixture score for BayesianGaussianMixture fixes typoes rename parameters cosmit simplify _estimate_suffstat COSMIT bug fixes and test cases bug fixes and test cases bug fixed and test cases test cases test cases test cases test cases for dpgmm and rename prior parameters verbose msg PEP8 and PEP257 fixes score, replace lowerbound with log probabilities in BGM remove comments doc and fix bugs fix some numerical issue doc
I've looked through the current documentation (and derivation PDF) and I think it's missing the explanation of different covariance types. We had a similar issue in |
@superbobry Thanks! |
…he GSoC 2015 (scikit-learn#4802). Define new class _MixtureBase. GMM derives from _MixtureBase. Introduce reg_covar, max_iter to replace min_covar, n_iter. Add _check_X function. Add the DensityMixin class into base.py Add new abstract methods _m_step, _e_step in _MixtureBase Add new test file for MixtureBase class.
…he GSoC 2015 (scikit-learn#4802). Define new class _MixtureBase. GMM derives from _MixtureBase. Introduce reg_covar, max_iter to replace min_covar, n_iter. Add _check_X function. Add the DensityMixin class into base.py Add new abstract methods _m_step, _e_step in _MixtureBase Add new test file for MixtureBase class.
@xuewei4d did you get the opportunity to understand what's wrong with the In your notebook, you should seed the |
mk_sol[k] = np.sum(np.square(sol)) | ||
temp2 = self.mu_beta_prior_ * self.lambda_nu_ * mk_sol | ||
temp_mu = (.5 * np.sum(temp1 - temp2) + | ||
self._log_gaussian_norm_prior) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My previous comment wasn't clear enough. I give more details here to be sure, we are ok.
I thought :
temp_mu = (.5 * np.sum(temp1 - temp2) +
self.n_components * self._log_gaussian_norm_prior)
According to Line532, self._log_gaussian_norm_prior = D ln (beta_0 / (2pi))
According to the formula 7.9 of your report, you do : sum_{k=1^K} (D ln ( beta/(2 pi))
Normaly on Line789, it should be : self.n_components * self._log_gaussian_norm_prior
Am I right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xuewei4d if you have some spare time could you please confirm this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, YES. @tguillemot . There should be self.n_components *
before self._log_gaussian_norm_prior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok it was sure I haven't miss something. Thx @xuewei4d .
I just come out the possible reason for the lower bound of VBGMM goes up and down. In VBGMM and DPGMM, some clusters may vanish during fitting, so the actual the number of clusters would not be the parameter |
@xuewei4d in my experience oscillating lower bound almost always means a bug in the M-step (or posterior update in case of VB-type algorithms).
Wouldn't the covariance be completely defined by the prior in this case? |
This might actually provide a good reference: https://github.com/teodor-moldovan/dpcluster |
@amueller Thanks I will have a look to that. |
Closing in favor of #7295. Let's finish the review there. |
GSoC 2015 project, Improve GMM API
_BasesMixture
. Three derived classesGaussianMixture
,BayesianGaussianMixture
,DirichletProcessGaussianMixture
DirichletProcessGaussianMixture
andBayesianGaussianMixture
10_EPS
which is only for HMM.BayesianGaussianMixtureModel
andDirichletProcessGaussianMixtureModel
according to the report.verbose>1
, it display log-likelihood and time used in the same line of the messageIteration x
fit_predict
params!='wmc'
the number of samples). For instance extend the BIC score example with a cross-validated likelihood plot
PCA(n_components).fit(X_train).get_covariance()