Skip to content

[MRG] Fix parallelisation of kmeans clustering #12955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 16, 2019

Conversation

nixphix
Copy link
Contributor

@nixphix nixphix commented Jan 11, 2019

Reference Issues/PRs

Fixes #12949

What does this implement/fix? Explain your changes.

Fixes parallelisation of kmean clustering

@nixphix nixphix changed the title Fix parallelisation of kmeans clustering [MGR] Fix parallelisation of kmeans clustering Jan 11, 2019
@nixphix
Copy link
Contributor Author

nixphix commented Jan 11, 2019

@jnothman kmeans doctest fails because the cluster centroid changed, following is the depiction of doctest clusters before and after the fix.
image

should I update the expected cluster result in doctest per the new centroid or try different test data?

@nixphix nixphix changed the title [MGR] Fix parallelisation of kmeans clustering [MRG] Fix parallelisation of kmeans clustering Jan 12, 2019
@jnothman
Copy link
Member

Strange that this doctest result changes... the doctest hasn't been changed in 40e6c43 or subsequently :|

@jnothman
Copy link
Member

But it's only failing on the latest numpy and scipy... so it's probably due to an upstream change. Can you change the example to make it deterministic? For example, move the top-right point further right?

@jeremiedbb
Copy link
Member

I can reproduce the issue and I confirm that this PR fixes it.

@nixphix
Copy link
Contributor Author

nixphix commented Jan 15, 2019

I have updated the doctest to be more deterministic by moving a set of cluster points far apart from the other.

@jnothman
Copy link
Member

Please add an entry to the change log for version 0.20.3 at doc/whats_new/v0.20.rst. Like the other entries there, please reference this pull request with :issue: and credit yourself (and other contributors if applicable) with :user:

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @nixphix I agree that we don't need a test here.
please add a what's new entry.
Btw, why do current code use all the cores when n_clusters=1 (and not when n_clusters>1)?

@qinhanmin2014
Copy link
Member

And seems that we have a similar problem

if effective_n_jobs(n_jobs) or algorithm == 'threshold':

Feel free to fix it in another PR, otherwise we'll open an issue.

@jeremiedbb
Copy link
Member

Btw, why do current code use all the cores when n_clusters=1 (and not when n_clusters>1)?

I can't reproduce that. However, it might be that the multithreading comes from MKL in that case. Running the snippet with the env variable MKL_NUM_THREADS=1 makes things easier to check. In that case only one core is used in any situation (before this PR). And all cores are used in any situation after this PR.

@qinhanmin2014
Copy link
Member

I can't reproduce that. However, it might be that the multithreading comes from MKL in that case. Running the snippet with the env variable MKL_NUM_THREADS=1 makes things easier to check. In that case only one core is used in any situation (before this PR). And all cores are used in any situation after this PR.

Thanks, I don't have time to reproduce but your version seems more reasonable. According to the code, before this PR, there's no parallelism at scikit-learn level. I guess the reporter make some mistakes accidentally.

@nixphix
Copy link
Contributor Author

nixphix commented Jan 16, 2019

And seems that we have a similar problem
scikit-learn/sklearn/decomposition/dict_learning.py

Line 303 in ff46f6e

if effective_n_jobs(n_jobs) or algorithm == 'threshold':

Feel free to fix it in another PR, otherwise we'll open an issue.

Sure will fix it in another PR

@jnothman jnothman merged commit 8a604f7 into scikit-learn:master Jan 16, 2019
@jnothman
Copy link
Member

Thanks @nixphix

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
@jnothman jnothman mentioned this pull request Feb 19, 2019
17 tasks
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KMeans not running in parallel when init='random'
4 participants