-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH Scalable MiniBatchKMeans plus cln / fixes / refactoring #17622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
We indeed need a changelog entry for this PR (thanks @cmarmo for the automated check ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jeremiedbb.
This is my first pass on this PR: I left a few comments and questions but still I need to spend more time inspecting some of the logic.
free(indices) | ||
|
||
|
||
cdef void update_center_sparse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a naive question: would it be possible to factor some parts of this functions with update_center_sparse
with update_center_dense
's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would really be not trivial. The signature is not the same (for sparse we pass X.data, X.indices, X.indptr) and looping through csr matrix is not the same as looping through a dense array.
Even if we managed to somehow factor some part of the code I think it would be more complicated and less readable.
free(indices) | ||
|
||
|
||
cdef void update_center_dense( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update_center_dense
and update_center_sparse
be defined as a private functions even if they are not accessible from Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole module is private. I think we agreed once that functions from a private module are private (the underscore is just a convention).
free(indices) | ||
|
||
|
||
cdef void update_center_dense( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update_center_dense
and update_center_sparse
be defined as private functions even if they are not accessible from Python?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Puh. One complete pass. Some nitpicks. Overall very clean. Thank you very much, @jeremiedbb.
np.import_array() | ||
|
||
|
||
def _minibatch_update_dense( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now, it is very clean code, dense and sparse are separated. Just as suggestion to potentially reduce code repetitions:
Have only one function
def _minibatch_update(
X, # IN
floating[::1] sample_weight, # IN
...
)
and then make the distinction between dense and sparse X
before the two with nogil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's something I'd like to do (actually I tried to do that for kmeans as well when I re-implemented it) but there's a limitation of cython that makes it not smooth at all: you can't cdef something inside an if statement.
For sparse we need to define 3 memoryviews while we need 1 pointer for dense. We could always define 3 memoryviews and 1 pointer and then leave the not corresponding ones unused depending on the type of X but I find it kind of ugly :)
Also, the nogil part actually contains all the code in the function besides the cdefs. I find it a bit weird to have a function for which the whole code is:
def f(x):
if condition:
do something
else:
do something else
To me it means that it's actually 2 functions, but that's a personal opinion :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, one can define before the if statement and assign after:
def function(X):
cdef floating* X_dense
cdef floating[::1] X_data
if is_sparse(X):
X_data = X.data
with nogil:
...
else:
X_dense = &X[0, 0]
with nogil:
....
But let's stay with the current code. It's clean and easy to read.
Thanks @jjerphan and @lorentzenchr for the reviews. I think I adressed your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I just have noticed a typo to fix.
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. @jeremiedbb Great PR. Let's merge as soon a merge conflicts are resolved.
The initial goal of this PR was to improve scalability of MiniBatchKMeans on multi-core machines. Working on that I noticed several inconsistencies between kmeans and minibatch kmeans, dense and sparse, fit and partial_fit, ... so this PR also refactors some parts for more consistency and better exploit of the inheritance of minibatchkmeans from kmeans. I also reworked a lot of tests (parametrize as much as possible and fix some broken tests).
Here's a list of what this PR actually changes:
Scalability on multi-core machine
expand
The computation intensive part of MiniBatchKMeans is, like KMeans, the computation of the labels and the update of the centers. Computation of the labels is the most costly of the 2 and has been optimized in [MRG] new K-means implementation for improved performances #11950. Update of centers was sequential and becomes the bottleneck on multi-core machines (see profile below on a 16 cores machine).

Another thing to notice is that the computation of inertia takes a significant proportion of _labels_inertia while it shouldn't because it's sequential. It does not impact kmeans because we only compute inertia once at the end but we compute it for each minibatch in minibatchkmeans.
What changed:
It gives an overall speed-up of 3x on this machine for this dataset.
Further improvements ? the validation of X (assert_all_finite) and computation of norms are sequential and begin to take a non negligible proportion. Parallelizing it could improve many other parts of sklearn. Extracting minibatches is sequential take a bigger proportion as the batch_size increases.
Sparse and sample_weights
expand
bug fix: sample_weights were ignored in sparse center update. Was not tested. Test added.
Parameter validation
expand
Put all parameter validation in a dedicated method (_check_params) and added some missing validation. Grouped all tests for param validation in a single parametrized test.
Likewise, move tests for the warnings in dedicated tests.
Sample weight normalization
expand
Done in MNT Don't normalize sample weights in KMeans #17848
Sample weights were normalized to sum up to n_samples. Doing this normalization has absolutely no impact on the clustering or on public attributes and even was causing a bug. I removed it.
Fixes Documentation is wrong about KMeans.inertia_ #16594
Centers initialization
expand
Factorized some of the code for the initialization of the centers to do validation only once and for better consistency between kmeans and minibatchkmeans.
Also, init in minibatchkmeans was actually doing a minibatch_step on a batch of size "init_size" which is usually bigger than "batch_size". I consider it a bug so I removed that minibatch step
Fit and partial_fit
expand
Better consistency between fit and partial_fit, in particular for data and parameter validation.
Convergence criterion
expand
There are 2 ways to declare convergence. One when the norm between old-centers and new-centers becomes smaller than a tolerance and one when the batch inertia does not decrease after a certain number of iterations. More precisely both were computing an exponentially weighted average to lower impact of the noise.
I think it does not makes sense to use exponentially weighted average for the norm of the diff of the centers since it will take an extremely long time to become smaller than tol. I made some tests and the convergence on inertia is always declared much much earlier than this one. Thus I removed the ewa and only check if the norm of the diff of the centers is smaller than tol. This way both declare convergence at comparable time.
Tests
expand
sample weights vs repeated samples
In both kmeans and minibatch, we can't guarantee that the results are the same when init is not provided centers due to rng in sampling indices. Moreover in minibatch it's not guaranteed even with provided centers because rng in minibatch extraction. The existing test was passing because of luck. I updated it to only cover the guaranteed case.
init centers not mutated
There was a test to check that centers provided as init are not mutated: test_k_means_init_centers. The test was wrong because a copy was made in between. I fixed it and renamed it test_centers_not_mutated.
assert_almost_equal
removed all remaining assert_almost_equal, assert_raise, assert_warns, ...
simplified and parametrized many tests. Also grouped them by concerning kmeans and minibatch, only kmeans and only minibatch.
Public vs private attributes
expand
Done in DOC document and deprecate missing attributes in MiniBatchKMeans #17864
minibatchkmeans had 2 private attributes with a trailing underscore: counts_ and random_state_. I made them explicitly private: _counts and _random_state. I wonder if this requires a deprecation cycle since both attributes were not documented in the first place.
k-means++
expand
Since the deprecation cycle of making a lot of files private has ended, I renamed the _k_init function _kmeansplusplus which is more explicit
order of operations in minibatch step
expand
Previously the order was:
The update then is done based on the labels computed before reassignment. I changed it to: