-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[DRAFT] Engine plugin API and engine entry point for Lloyd's KMeans #24497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Note: this is still a work in progress and the Engine API is not set in stones. In particular, I broken the MBKMeans tests. But this should be enough to get started. |
max_iter=self.max_iter, | ||
verbose=self.verbose, | ||
tol=self._tol, | ||
n_threads=self._n_threads, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that currently we delegate the full Lloyd loop to the engine for the sake of simplicity for this first iteration.
However my plan would rather be to only delegate one Lloyd iteration at a time instead and move the loop and convergence check back into the estimator class (probably in a new private method).
Moving this loop back into the estimators class will be required to properly integrate @jeremiedbb's work on callbacks for instance: #22000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While looking at the KMeans
in cuml and sklearn I noticed that my test data set that took ~2.5s for sklearn took only 25ms in cuml. I assume with numba-dpex the speed up will be similar. So the main point is "something that takes long now might be super fast in a plugin".
On the one hand max_iter
is typically(?) only O(100) so the overhead of making a few hundred additional function calls is not that great. On the other hand, what data would the callbacks need to make decisions (beyond just progressbar updates, which you could argue you don't need for fast stuff) and how much does it cost to transfer that from device to device?
My question is: do you/someone know what happens to the performance of alternative implementations of kmeans when they have to be implemented as "one iteration at a time"? I don't know enough about how kmeans on something like a GPU works, but I hope the answer is "nothing happens, this is totally doable" because that would make life easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it will be implemented "one interation at a time" (exemple with the plugin we're working on: https://github.com/soda-inria/sklearn-numba-dpex/blob/main/sklearn_numba_dpex/kmeans/drivers.py#L342, or likewise with the current cython implementation in sklearn: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_kmeans.py#L599) and the loop on max_iter
is in python, it is completely negligible.
So the "simplicity" @ogrisel refers to here is only about the design of the engine api itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect that that for most algorithms in scikit-learn, the actual computation time for one fit iteration would last 10x or more the overhead of a GPU dispatch (assuming we leave the data and temporary datastructures on device memory between consecutive iterations).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the plan regarding this change? I think it would make sense to already go for the "one call per iteration" (what you propose) now instead of waiting. If that works for you I'll send a PR.
A reason to tackle it now is that the centroids and labels will be arrays of different types for different engines, so the convergence checking needs to deal with this. (unless we decide the estimator attributes should always be numpy arrays?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will also help sharpen the API for when data is passed in to an engine. My proposal would to settle on:
prepare_fit(X, y, sample_weight)
will be called every time whenKMeans.fit()
is called. The engine should keep hold of the data as it won't be provided again- the single iteration method becomes
kmeans_single(current_centroids)
.X
and friends isn't passed again.
In particular when we call kmeans_single
once per iteration we need a way to provide the data to the engine once, so that it can convert it or decide that it would like to raise NotImplemented
, etc. If X
and friends are passed to kmeans_single
on each iteration, the engine should in principle be prepared to convert it at each iteration because there is no promise that it hasn't change since the last time kmeans_single
was called.
Something that would be harder to do in that case is re-use things from the Cython engine. For example currently I use:
def init_centroids(self, X):
return cp.asarray(super().init_centroids(X))
because there is not much benefit in re-implementing the initial centroid selection. In a super strict world where the above proposal is adopted, then init_centroids
should also not be passed X
again, but it should use what was previously provided in prepare_fit()
.
The point being, I'd support switching to the per-iteration API so we can work these things out with code (instead of our heads).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agree on the implementation strategy here so you're welcome to send a PR 👍 and let's iterate on the details from there if needed, I think @ogrisel @jjerphan and I can be available for review.
Regarding:
because there is not much benefit in re-implementing the initial centroid selection.
Actually I can think of two reasons:
-
initial centroid selection includes k-means++ which might benefit from optimizations
-
it's not likely that
super().init_centroids(X)
will work ifX
is not a numpy or scipy array, and it could be a good thing for thefit
to accept other type of inputs when used with other engines, e.gcupy
arrays for a a cuda-based engine, to save data loading to and from the targeted device. So,super().init_centroids(X)
might need to be rewritten for cases whereX
is a different object.
I adapted the plugin in soda-inria/sklearn-numba-dpex#74 |
Co-authored-by: Franck Charras <29153872+fcharras@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Rename cluster counting method
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Last round of merges broke the tests 🔴 apparently some issue with the engine getter raising |
@fcharras suggested that the last merge broke the CI without us realizing it because ogrisel#13 targeted the I think I should push the Any opinion or better name name suggestion @scikit-learn/core-devs? /cc @betatim |
When doing so I will probably squash the existing history as most of the original commits have non-informative wip commit messages. Let me know if you are opposed to this plan. Shared co-authorship of the squashed commit should be preserved. |
This looks appropriate to me. 👍 Nit: do you think |
Sounds like a good idea and it will be nice to have the benefit of robots checking our work :) No opinion on the branch name. |
Closing in favor of the newly created #25535. Let's stop using the |
This is a draft pull-request to allow third-party packages such as https://github.com/soda-inria/sklearn-numba-dpex to contribute alternative implementations of core computational routines of CPU-intensive scikit-learn estimators.
This would be particularly useful to experiment with GPU-optimized alternatives to our CPU-optimized Cython code.
This PR will serve as a design experiment to tackle #22438.