Skip to content

[MRG] Integration of GSoC2015 Gaussian Mixture (first step) #6407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

tguillemot
Copy link
Contributor

@xuewei4d has done a huge work for the GM classes.
Nevertheless it seems that #4802 is not integrated yet. I thought it was because of the length of the contribution and the difficulty to measure the efficiency of the new classes with respect to the previous ones.
I propose to integrate this work step by step in several pull requests (400 lines max each time).

This pull request introduces the new abstract class _MixtureBase with some common GMM attributes and methods. Now, GMM derives from _MixtureBase.

@ogrisel @amueller @agramfort

@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch from 4476dc3 to 2f1200e Compare February 19, 2016 15:48
-------
X : array, (n_samples, n_features)
"""
X = check_array(X, dtype=[np.float64, np.float32], ensure_2d=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not having ensure_2d=True, allow_nd=False ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also if you check X here, you can remove the check in GMM._fit

@xuewei4d
Copy link
Contributor

I don't understand why you work on gmm.py, because we decided to use its full name gaussian_mixture as the file name.

@TomDLT
Copy link
Member

TomDLT commented Feb 19, 2016

I think the goal is to make the diff readable, and then to do a different PR with a simple file renaming

@xuewei4d
Copy link
Contributor

Well, the new classes in my PR have huge differences. That's not a small amount of diff.

@tguillemot
Copy link
Contributor Author

@TomDLT Thanks for your comments, I will correct all those things fastly.

I think the goal is to make the diff readable, and then to do a different PR with a simple file renaming

Indeed. The only thing I've done for now is to add the class _MixtureBase and modify the old GMM class to derive from it. I plan next to do another PR to move _MixtureBase into base.py.

@xuewei4d Maybe I didn't explain my goal correctly and I'm sorry for that.

Of course your PR has huge differences and obviously I have noticed it.
I think you've done a really good job with your PR. Nevertheless, it is still not included in scikit-learn.
Your PR is ~4500 lines and has huge differences with original GMM class. I imagine it is not merged because it's too big to be analyse in detail by the reviewers (maybe my supposition is wrong ?).

My purpose is to merge all your work. For me, the best solution is to merge it step by step. It is what I've proposed with this PR. Of course, it's just a little part of your work but it's easier to analyse it in that way.
If the reviewers want to merge it, I will propose more PR with more of your code until we totally merge it. If it is not a good solution to merge your work, I will just close the PR.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2016

The problem is to deprecate the old GMM class in favor of @xuewei4d's new GaussianMixture without deprecating the old VBGMM and DPGMM subclasses for now.

So I would suggest to:

  • Rename the old GMM class as _GMMBase (without changing any line of the code) in the sklearn/mixture/gmm.py file and make VBGMM and DPGMM derive from _GMMBase instead of GMM,
  • Create new GMM class that derives from_GMMBase, is in the same file but with a deprecation warning in the __init__ (and no other overridden method).

This _GMMBase class should not be used by any of the new gaussian mixture models. It's just a way to deprecate the old classes one at a time to make the pull request small and easier to review for incremental merge to master.

@ogrisel
Copy link
Member

ogrisel commented Feb 22, 2016

And include @xuewei4d GaussianMixture code in its own file as done in #4802: sklearn/mixture/dp_gaussian_mixture.py.

@tguillemot
Copy link
Contributor Author

In fact, I thought add _GMMBase and move _MixtureBase from gmm.py to base.py at the next PR but it's easier to do it now.

@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch 2 times, most recently from e7841e7 to a1cea27 Compare February 24, 2016 11:08
@@ -112,7 +114,7 @@ def sample_gaussian(mean, covar, covariance_type='diag', n_samples=1,
return (rand.T + mean).T


class GMM(BaseEstimator):
class _GMMBase(_MixtureBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not inherit from the new _MixtureBase class. Just BaseEstimator as previously. We do not want to change the implementation and behavior of the old GMM class in any way besides issuing a deprecation warning in the init.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I have to send all the code of _MixtureBase and GaussianMixture in the same PR.
To simplify the review, I can propose to split the code in different commit. I don't see a better way of not pushing 1000 lines at once.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but just the code for GaussianMixture (not the variational Bayes sub classes).

@ogrisel
Copy link
Member

ogrisel commented Mar 14, 2016

@tguillemot any news on this PR?

@tguillemot
Copy link
Contributor Author

Sorry for being late. I'll send everything before the end of the day.

@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch 7 times, most recently from a7922ff to a878b60 Compare March 16, 2016 10:01
return logsumexp(self._estimate_weighted_log_prob(X), axis=1)

def score(self, X, y=None):
"""Compute the log-likelihood of the model on the given data X.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring is not correct: this is the per-sample average log-likelihood: as we use p.mean() instead of p.sum().

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2016

To have a fair comparison between them, you need to have the same tolerance criterium.
Moreover, classically in the EM algorithm, we use a tolerance which is independent of the number of component.

Alright but I don't understand why dividing tol by n_components would make the stopping criterion (based on a change in the total log likelihood of the training set) independent of the number of components.

Also shouldn't we divide tol the log-likelihood change by n_samples to get a stopping criterion that is less dependent on the size of the training set?

@tguillemot
Copy link
Contributor Author

My mistake on the old class it's not divide by n_components but by n_samples indeed.
On the GMM the current_log_likelihood is define as mean and not as a sum.
Forget all I've said about the n_components. Sorry for the noise.
I change that.

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2016

Ok great that makes sense to me then. It would be good to check how other packages check convergence for GMM trained with EM:

http://www.mixmod.org
https://cran.r-project.org/web/views/Cluster.html (bgmm for instance)

for cov_type in COVARIANCE_TYPE:
gmm = GaussianMixture(n_components=n_samples, covariance_type=cov_type,
reg_covar=0)
assert_raise_message(ValueError,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not introduce a dedicated exception such as DivergenceError in sklearn.exceptions but maybe we can stick to ValueError for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do another PR for that once we have finished with the GMM, ok ?

@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch 2 times, most recently from 2930805 to 13a8ccf Compare April 14, 2016 09:29
@tguillemot
Copy link
Contributor Author

Ok I need help to solve the travis problem.
It seems that there is a problem with the test_base_mixture.py file which doesn't exist.
More interesting, the test are passing for Python-2.6 but not for the other ?
On my local branch Travis works without problems.

Do you have any explination ?
At the beginning of this branch I have created a file call test_base_mixture.py but it was removed a long time ago.

@TomDLT
Copy link
Member

TomDLT commented Apr 14, 2016

At the beginning of this branch I have created a file call test_base_mixture.py but it was removed a long time ago.

This is weird. Maybe try to squash and rebase on top of master.

@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch 2 times, most recently from b0eed8a to 628e6a5 Compare April 14, 2016 11:28
@tguillemot
Copy link
Contributor Author

Another solution could be to propose a new PR with a new branch.
I'll do that only if you don't see another solution.

@TomDLT
Copy link
Member

TomDLT commented Apr 14, 2016

or

# delete your branch
git checkout GSoC-integration-step1
git checkout -b GSoC-integration-step1_backup
git branch -D GSoC-integration-step1
# create new branch with same name
git checkout master
git checkout -b GSoC-integration-step1
# cherry-pick and push
git cherry-pick 628e6a56c861aa582cf7f6d4032c9bff24bd07de
git push -f origin GSoC-integration-step1
git branch -D GSoC-integration-step1_backup

Maybe do some make clean also, it might help

@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch 5 times, most recently from 9f7ece7 to 2201ef7 Compare April 14, 2016 14:34
Depreciation of the GMM class.

Modification of the GaussianMixture class.
  Some functions from the original GSoC code have been removed, renamed or simplified.
  Some new functions have been introduced (as the 'check_parameters' function).
  Some parameters names have been changed :
   - covars_ -> covariances_ : to be coherent with sklearn/covariances
  Addition of the parameter 'warm_start' allowing to fit data by using the previous computation.

The old examples have been modified to replace the deprecated GMM class by the new GaussianMixture class.
Every exemple use the eigenvectors norm to solve the scale ellipse problem (Issues 6548).

Correction of all commentaries from the PR
- Rename MixtureBase -> BaseMixture
- Remove n_features_
- Fix some problems
- Add some tests

Correction of the bic/aic test.

Fix the test_check_means and test_check_covariances.

Remove all references to the deprecated GMM class.

Remove initialized_.
Add and correct docstring.

Correct the order of random_state.

Fix small typo.

Some fix in prevision of the integration of the new BayesianGaussianMixture class.

Modification in preparation of the integration of the BayesianGaussianMixture class.
Add 'best_n_iter' attribute.
Fix some bugs and tests.

Change the parameter order in the documentation.

Change best_n_iter_ name to n_iter_.

Fix of the warm_start problem.
Fix the divergence error message.
Correction of the random state init in the test file.
Fix the testing problems.
@tguillemot tguillemot force-pushed the GSoC-integration-step1 branch from 2201ef7 to 12d66fc Compare April 14, 2016 14:37
@tguillemot
Copy link
Contributor Author

Ok I've tried your solution and also try to do it manually and it still doesn't work.
Thanks for your solutions @TomDLT .

@TomDLT
Copy link
Member

TomDLT commented Apr 14, 2016

Interestingly you solved the probem by adding the file, but unsolved it by removing the file (only py35 though)

@tguillemot
Copy link
Contributor Author

To solve the problem with the strange travis bug, I've created another PR #6666.

@tguillemot tguillemot closed this Apr 16, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants