[MRG] Fix Random initialisation of GMM should consider data magnitude #10850 #11101

g-walsh · 2018-05-16T16:53:29Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Added a third option for initialisation 'rand_data' which samples points from the data set for initialisation. It does this by assigning zero to all responsibilities except for the sampled points which are assigned responsibility of 1 to a given component.

When the init_params are calculated when calling the gmm, this resp will produce inital means at the sampled points.

Any other comments?

I think this has added the desired function from the original PR #10741 but I'm not sure if it is a useful addition so I did a bit of extra looking. I've added gmm_test2.py (which I would not include in an eventual merge) to show how I produced the following.

Here (seed 1234) the sampling works fine for all three methods. The original data is 4 sets of Gaussian data. Orange crosses are the initial_mean values and the colouring is the labelling of the data by the gmm.

But here (seed 10)

I think the fact that two of the sampled points are close together gives a poor fit for rand_data. I've labelled this as WIP to see if you think this is worth investigating further as a feature? I know that things like documentation (and proper testing) would need to be updated in addition to this.

n.b Also worth pointing out is that the current 'random' does take data magnitude into account in some respect as it will produce inital centres very close to the mean of the data set (along all dimensions) as seen here.

jnothman · 2018-05-16T23:50:38Z

without yet reviewing the approach, could you take a look at the kmeans++ initialisation procedure? it might be appropriate here too.

g-walsh · 2018-05-17T08:02:55Z

@jnothman Sure, will take a look at that next.

g-walsh · 2018-05-18T15:06:42Z

@jnothman I've taken a look at the kmeans++ and it seem to be just a better version of random data selection. It randomly selects an initial center and then selects subsequent ones with a bias against being close to the existing points. It seems to solve the problems I had with random before. See below

I've added a 'k-means++' option to PR on base.py. It feels like a bit inelegant though as the existing k-means++ initialisation function _k_init (in the kmeans module) is an internal function and outputs the points themselves rather than the responsibilities. This means that I have to lookup the centres in the original data, create a resp to create the means again. Let me know what you think.

jnothman · 2018-05-21T08:31:55Z

I don't think it would hurt to modify the existing kmeans++ implementation to also keep track of the corresponding indices instead of just the points.

g-walsh · 2018-05-22T14:19:43Z

@jnothman Thanks, I've now added tracking of indices to the implementation of km++ and now this is used in the initialisation of gmm.

jnothman

The implementation looks good! Do you think rand_data is still helpful?

You need to update:

The tests: add a test of _k_init to ensure the two outputs are consistent. Or change the implementation to only output indices and test that.
The tests: Otherwise at least check that each approach is different in gmm, and ideally some properties thereof
The parameter docstring in the gmm classes
Any examples that would be a better illustration if k-means++ initialisation were used

g-walsh · 2018-05-23T11:47:47Z

Thanks! No, I don't think rand_data is useful at all. I only left it in as it was the suggestion of the original Issue and PR. I'll remove it.

I'm not too experienced with making tests but I'll take a look at how you guys implement (and docs) them and then have a go.

amueller · 2018-05-24T01:12:23Z

I think rand_data is useful, the complexity of kmeans++ is terrible for large n_clusters! It often takes longer than running k-means.

g-walsh · 2018-05-30T11:26:58Z

@jnothman I've just taken a look and I have a couple of questions.

I think that with this patch only two functions use _k_init, the first is k_means which already has tests for kmeans++ initialisation (as well as random initialisation) so would I need to add any extra tests for that? If the existing tests pass then can I assume that my addition of indices hasn't affected the existing implementation and interaction of kmeans/kmeans++ ?
The new addition of the kmeans++ to mixture models obviously needs a test. I currently can't find any tests for consistency for the existing initialisation methods (random and kmeans), do they exist? I can try and write tests for all four potential initialisations (2 existing: random and kmeans and 2 new: kmeans++ and rand_data) if that works.

I'll take a look at the documentation and examples after the testing looks good.

@amueller I'm happy to leave rand_data in but as I showed in my example above it could lead to poor fitting. Should we add a warning to it, perhaps with an example similar to above?

jnothman · 2018-06-05T00:23:56Z

Thanks for just giving it a go. You have some pep8 issues (at a glance I saw you needed spaces after commas).

@amueller knows what he's talking about when it comes to KMeans initialisation ;) Let's leave rand_data in as an option and just document their caveats.

jnothman

It might be good to illustrate the quality and variation due to initialisation in an example, but perhaps it's not essential. The user guide (i.e. mixture.rst) might be a good place to note the pros and cons of each method.

sklearn/mixture/tests/test_gaussian_mixture.py

jnothman

The main thing this is missing is documentation or example of when to use which setting

g-walsh · 2018-06-08T12:44:40Z

I've had a first pass at adding to the documentation and have added an example that I feel shows the different initializations. This is my first attempt at this kind of documentation so feedback would be very helpful for me!

jnothman

Nice work! I think it'd be helpful to indicate on the example the number of iterations to convergence. It would also be interesting to record the amount of time spent in initialisation, but I assume it's too tiny to be meaningful on this example

examples/mixture/plot_gmm_init.py

g-walsh · 2018-06-09T17:45:29Z

Thanks, I've dealt with the camel case and I've added a bit of evaluation of the methods.

It's a bit simple but I've just used the built in timing functions to time how long the initialization takes and then add it to the plot as a relative timing. i.e. the time taken to do the initialization is given as a multiple of the time it took to do the random initialization.

I also used the verbose output on the gmm to come up with some idea of how many iterations of the Gaussian Mixture would take on my desktop machine. It might be better to try and read out how many iterations it takes into a variable and then display it on the plot.

jnothman · 2018-06-09T22:11:46Z

you can't use the n_iter_ attribute?

jnothman · 2018-06-09T22:12:52Z

You can change this from WIP to MRG when ready

g-walsh · 2018-06-10T13:40:55Z

Thanks! I hadn't seen the n_iter_ attribute before. I've added that to the plot now.

jnothman

Otherwise, this looks great!

examples/mixture/plot_gmm_init.py

jnothman · 2018-06-10T22:58:36Z

examples/mixture/plot_gmm_init.py

+from sklearn.cluster import KMeans
+from sklearn.mixture import GaussianMixture
+from sklearn.utils.extmath import row_norms
+from sklearn.cluster.k_means_ import _k_init


We really shouldn't be using private API here...

It's a pity that this is the only substantial problem I have with the example. @amueller, what do you think of making kmeans++ public?

Alternatively, we could consider:

storing the initial means on the GMM estimator

allowing GMM to work with max_iter=0

This is a blocker, unfortunately.

Do you intend to try fixing this?

Yes, I was waiting to see if you guys thought making kmeans++ public was feasible or not. If not then I can certainly look into an alternative method.

@amueller what do you think making kmeans++ public?
But I don't mind allowing max_iter=0 either as a diagnostic tool

I've added allowing max_iter=0 and changed the plot to not include the private kmeans. Do you think I should add a warning about running it with max_iter=0 or do you think the convergence warning is enough?

g-walsh · 2018-06-11T07:08:15Z

I've fixed all of these except for the private _k_init awaiting a decision.

…n_improvement

jnothman

Very nice!

Please add an entry to the change log at doc/whats_new/v0.21.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

cmarmo · 2020-07-09T14:31:04Z

@g-walsh thanks for your patience! Do you think you will find some time to fix conflicts with upstream? Then maybe @jeremiedbb could check? Thanks!

g-walsh · 2020-07-09T14:55:54Z

@g-walsh thanks for your patience! Do you think you will find some time to fix conflicts with upstream? Then maybe @jeremiedbb could check? Thanks!

@cmarmo yep, I should have some time soon to go over and update this. Thanks for following it up!

…m_initialisation_improvement

g-walsh · 2020-07-10T17:21:12Z

@g-walsh thanks for your patience! Do you think you will find some time to fix conflicts with upstream? Then maybe @jeremiedbb could check? Thanks!

@cmarmo yep, I should have some time soon to go over and update this. Thanks for following it up!

@cmarmo Any chance you could help with the errors for these failing checks? It seems to be during install/build and I'm not sure where to start debugging. My local builds seem to work just fine.

The install error that seems most common across the builds is:

distutils.errors.DistutilsModuleError: invalid command 'develop'

Seen this before?

jeremiedbb · 2020-07-10T17:22:26Z

Merging master will fix the ci

…m_initialisation_improvement

jeremiedbb · 2022-04-06T11:45:52Z

Continued and finished in #20408. Thanks @g-walsh !

jnothman reviewed May 22, 2018

View reviewed changes

jnothman reviewed Jun 5, 2018

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture.py Show resolved Hide resolved

jnothman reviewed Jun 5, 2018

View reviewed changes

sklearn/mixture/tests/test_gaussian_mixture.py Outdated Show resolved Hide resolved

jnothman reviewed Jun 7, 2018

View reviewed changes

jnothman reviewed Jun 9, 2018

View reviewed changes

examples/mixture/plot_gmm_init.py Outdated Show resolved Hide resolved

g-walsh changed the title ~~[WIP] Fix Random initialisation of GMM should consider data magnitude #10850~~ [MRG] Fix Random initialisation of GMM should consider data magnitude #10850 Jun 10, 2018

jnothman reviewed Jun 10, 2018

View reviewed changes

g-walsh added 8 commits December 5, 2018 10:22

Add data samples as initialisation. test to see if it is useful

3f74e32

move the WIP test script

b411f72

Adjustment for PEP

2e95883

Add kmeans++ initialisation from kmeans module.

c09af7f

Track indices in kmeans++ and edit kmeans and gmm implementation.

1e97c8b

remove test file

47263e8

replace original .gitignore

d92ef40

Update for Flake8

a190afd

g-walsh added 2 commits January 18, 2019 10:03

Resolve upstream conflict

b349473

Merge remote-tracking branch 'upstream/master' into gmm_initialisatio…

0f88e8c

…n_improvement

jnothman approved these changes Jan 19, 2019

View reviewed changes

g-walsh added 3 commits January 21, 2019 09:58

Add contibution to whats_new

eb6db99

Typo

7cf05c2

Keep doc descriptions to one line.

03dc4d3

amueller added the Waiting for Reviewer label Aug 6, 2019

This was referenced Dec 2, 2019

GaussianMixture model each attributes scaling issue #14398

Closed

[MRG] Fix GaussianMixture model each attributes scaling issue #15744

Closed

github-actions bot added module:cluster module:mixture labels Mar 2, 2020

g-walsh added 4 commits July 9, 2020 18:08

Merge branch 'master' of github.com:scikit-learn/scikit-learn into gm…

4f6277c

…m_initialisation_improvement

Merge branch 'master' of github.com:scikit-learn/scikit-learn into gm…

b33d249

…m_initialisation_improvement

update tests

338aff9

Fix linting errors and renamed import

e7f4c76

g-walsh added 2 commits July 10, 2020 18:24

Merge branch 'master' of github.com:scikit-learn/scikit-learn into gm…

3d26c11

…m_initialisation_improvement

Add test for max_iter=0

cbd0ce0

jeremiedbb mentioned this pull request Jul 13, 2020

K-means++ initialization: return indices #17915

Closed

g-walsh mentioned this pull request Jul 16, 2020

[MRG] Create public kmeans_plusplus including index output #17937

Merged

3 tasks

cmarmo removed the Waiting for Reviewer label Aug 25, 2020

Base automatically changed from master to main January 22, 2021 10:50

cmarmo added the Enhancement label Mar 10, 2021

alceballosa mentioned this pull request Jun 26, 2021

Solve Gaussian Mixture Model sample/kmeans++-based initialisation merge conflicts #11101 #20408

Merged

cmarmo added the Superseded PR has been replace by a newer PR label Nov 17, 2021

jeremiedbb closed this Apr 6, 2022

Uh oh!

[MRG] Fix Random initialisation of GMM should consider data magnitude #10850 #11101

[MRG] Fix Random initialisation of GMM should consider data magnitude #10850 #11101

Uh oh!

Conversation

g-walsh commented May 16, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman commented May 16, 2018 via email

Uh oh!

g-walsh commented May 17, 2018

Uh oh!

g-walsh commented May 18, 2018

Uh oh!

jnothman commented May 21, 2018 via email

Uh oh!

g-walsh commented May 22, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

g-walsh commented May 23, 2018

Uh oh!

amueller commented May 24, 2018

Uh oh!

g-walsh commented May 30, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Jun 5, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

g-walsh commented Jun 8, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

g-walsh commented Jun 9, 2018

Uh oh!

jnothman commented Jun 9, 2018 via email

Uh oh!

jnothman commented Jun 9, 2018

Uh oh!

g-walsh commented Jun 10, 2018

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jnothman Jun 10, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 12, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 17, 2018

Choose a reason for hiding this comment

Uh oh!

g-walsh Dec 17, 2018

Choose a reason for hiding this comment

Uh oh!

jnothman Dec 18, 2018

Choose a reason for hiding this comment

Uh oh!

g-walsh Jan 3, 2019

Choose a reason for hiding this comment

Uh oh!

g-walsh commented May 30, 2018 •

edited

Loading