Skip to content

MNT avoid thread limit for non nested parts of KMeans #16499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jeremiedbb
Copy link
Member

follow up on #11950. I just realized that the initialization was included in the threadpoolctl context, which can degrade perfs if the init is 'k-means++'.

I moved the context manager as close as possible to the nested prans/BLAS part of the code.

ping @ogrisel @NicolasHug it should still be fresh in your minds :)

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

can degrade perfs if the init is 'k-means++'

Is this because k-means++ uses blas? (I haven't checked)

failure looks like something fun to debug ^^

# Threadpoolctl wrappers to limit the number of threads in second level of
# nested parallelism (i.e. BLAS) to avoid oversubsciption.
def elkan_iter_chunked_dense(*args, **kwargs):
with threadpool_limits(limits=1, user_api="blas"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts regarding using the CM that deep in the code, versus doing it sooner e.g. in _kmeans_single_lloyd after _init_centroids is called?

I guess we may miss oversubscription issues if one day we decide to re-write _inertia_*

Only curious what you think

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we may miss oversubscription issues if one day we decide to re-write inertia*

True, but right now it uses no prange and no blas so I don't see it changing soon :)
And more importantly, we should think of the implications each time we want to add a prange somewhere.

Thoughts regarding using the CM that deep in the code, versus doing it sooner e.g. in _kmeans_single_lloyd after _init_centroids is called?

I wanted to put it where we're sure there will be no BLAS call took inside the context manager. That way we make sure that we can change the python code more safely.

Besides, In _kmeans_single_elkan you have:

for i in range(max_iter):
    elkan_iter(...)
    ...
    euclidean_distances(...)  # BLAS

we don't want to include euclidean distances in the CM. So we have to put the CM around elkan_iter.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Feb 21, 2020

The failure is fixed in #16514 #16506
It was not so hard to debug :)

@jeremiedbb
Copy link
Member Author

Is this because k-means++ uses blas? (I haven't checked)

Yes, to compute some distances.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but can you please do a quick benchmark on a multicore machine to check that we regain approximately the expected scalability during the init ? E.g. with KMeans with max_iter=1 and a large enough number of clusters + features and samples to get maximum BLAS parallelism during k-means++.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Feb 27, 2020

Here's a benchmark. I used taskset to simulate different #cores machines.
This PR restores the lost of scalability of k-means++ during fit

image

The difference between now and before 11950 might be due to how I made the benchmark. Since i can't benchmark k-means++ alone (the issue comes form using k-means++ in the fit), the reported timings are actually the difference between a 1 iter run with 'k-means++' init and a 1 iter run with 'random' init.

@ogrisel
Copy link
Member

ogrisel commented Feb 27, 2020

Thanks for the confirmation benchmark. The scalability is not very good beyond 4 threads but it's already a nice net improvement. Still +1 for merge.

@ogrisel
Copy link
Member

ogrisel commented Feb 27, 2020

Maybe we should explore a prange on n_samples to get better scalability but let's do that in a latter PR.

@jeremiedbb jeremiedbb added this to the 0.23 milestone Mar 3, 2020
@NicolasHug NicolasHug changed the title [MRG] Move threadpoolctl context in KMeans to avoid limiting non nested parts MNT avoid thread limit for non nested parts of KMeans Mar 31, 2020
@NicolasHug NicolasHug merged commit ab01816 into scikit-learn:master Mar 31, 2020
@NicolasHug
Copy link
Member

Thanks @jeremiedbb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants