Skip to content

ENH Scalable MiniBatchKMeans plus cln / fixes / refactoring #17622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 93 commits into from
Apr 15, 2021

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Jun 17, 2020

The initial goal of this PR was to improve scalability of MiniBatchKMeans on multi-core machines. Working on that I noticed several inconsistencies between kmeans and minibatch kmeans, dense and sparse, fit and partial_fit, ... so this PR also refactors some parts for more consistency and better exploit of the inheritance of minibatchkmeans from kmeans. I also reworked a lot of tests (parametrize as much as possible and fix some broken tests).

Here's a list of what this PR actually changes:

  • Scalability on multi-core machine

    expand

    The computation intensive part of MiniBatchKMeans is, like KMeans, the computation of the labels and the update of the centers. Computation of the labels is the most costly of the 2 and has been optimized in [MRG] new K-means implementation for improved performances #11950. Update of centers was sequential and becomes the bottleneck on multi-core machines (see profile below on a 16 cores machine).
    profile_master

    Another thing to notice is that the computation of inertia takes a significant proportion of _labels_inertia while it shouldn't because it's sequential. It does not impact kmeans because we only compute inertia once at the end but we compute it for each minibatch in minibatchkmeans.

    What changed:

    • inertia is now parallel
    • moved centers update (sparse and dense) in cython in dedicated file _k_means_minibatch.pyx, and made it parallel

    profile_pr
    It gives an overall speed-up of 3x on this machine for this dataset.

    Further improvements ? the validation of X (assert_all_finite) and computation of norms are sequential and begin to take a non negligible proportion. Parallelizing it could improve many other parts of sklearn. Extracting minibatches is sequential take a bigger proportion as the batch_size increases.

  • Sparse and sample_weights

    expand

    bug fix: sample_weights were ignored in sparse center update. Was not tested. Test added.

    for k in range(X_indptr[sample_idx], X_indptr[sample_idx + 1]):
        centers[center_idx, X_indices[k]] += X_data[k]
    
  • Parameter validation

    expand

    Put all parameter validation in a dedicated method (_check_params) and added some missing validation. Grouped all tests for param validation in a single parametrized test.
    Likewise, move tests for the warnings in dedicated tests.

  • Sample weight normalization

    expand

    Done in MNT Don't normalize sample weights in KMeans #17848
    Sample weights were normalized to sum up to n_samples. Doing this normalization has absolutely no impact on the clustering or on public attributes and even was causing a bug. I removed it.
    Fixes Documentation is wrong about KMeans.inertia_ #16594

  • Centers initialization

    expand

    Factorized some of the code for the initialization of the centers to do validation only once and for better consistency between kmeans and minibatchkmeans.

    Also, init in minibatchkmeans was actually doing a minibatch_step on a batch of size "init_size" which is usually bigger than "batch_size". I consider it a bug so I removed that minibatch step

  • Fit and partial_fit

    expand

    Better consistency between fit and partial_fit, in particular for data and parameter validation.

  • Convergence criterion

    expand

    There are 2 ways to declare convergence. One when the norm between old-centers and new-centers becomes smaller than a tolerance and one when the batch inertia does not decrease after a certain number of iterations. More precisely both were computing an exponentially weighted average to lower impact of the noise.

    I think it does not makes sense to use exponentially weighted average for the norm of the diff of the centers since it will take an extremely long time to become smaller than tol. I made some tests and the convergence on inertia is always declared much much earlier than this one. Thus I removed the ewa and only check if the norm of the diff of the centers is smaller than tol. This way both declare convergence at comparable time.

  • Tests

    expand
    • sample weights vs repeated samples
      In both kmeans and minibatch, we can't guarantee that the results are the same when init is not provided centers due to rng in sampling indices. Moreover in minibatch it's not guaranteed even with provided centers because rng in minibatch extraction. The existing test was passing because of luck. I updated it to only cover the guaranteed case.

    • init centers not mutated
      There was a test to check that centers provided as init are not mutated: test_k_means_init_centers. The test was wrong because a copy was made in between. I fixed it and renamed it test_centers_not_mutated.

    • assert_almost_equal
      removed all remaining assert_almost_equal, assert_raise, assert_warns, ...

    • simplified and parametrized many tests. Also grouped them by concerning kmeans and minibatch, only kmeans and only minibatch.

  • Public vs private attributes

    expand

    Done in DOC document and deprecate missing attributes in MiniBatchKMeans #17864
    minibatchkmeans had 2 private attributes with a trailing underscore: counts_ and random_state_. I made them explicitly private: _counts and _random_state. I wonder if this requires a deprecation cycle since both attributes were not documented in the first place.

  • k-means++

    expand

    Since the deprecation cycle of making a lot of files private has ended, I renamed the _k_init function _kmeansplusplus which is more explicit

  • order of operations in minibatch step

    expand

    Previously the order was:

    • compute labels
    • reassign some centers
    • update centers

    The update then is done based on the labels computed before reassignment. I changed it to:

    • compute labels
    • update centers
    • reassign

@ogrisel ogrisel added this to the 1.0 milestone Mar 12, 2021
@ogrisel
Copy link
Member

ogrisel commented Mar 12, 2021

We indeed need a changelog entry for this PR (thanks @cmarmo for the automated check ;)

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jeremiedbb.

This is my first pass on this PR: I left a few comments and questions but still I need to spend more time inspecting some of the logic.

free(indices)


cdef void update_center_sparse(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a naive question: would it be possible to factor some parts of this functions with update_center_sparse with update_center_dense's?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would really be not trivial. The signature is not the same (for sparse we pass X.data, X.indices, X.indptr) and looping through csr matrix is not the same as looping through a dense array.

Even if we managed to somehow factor some part of the code I think it would be more complicated and less readable.

free(indices)


cdef void update_center_dense(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update_center_dense and update_center_sparse be defined as a private functions even if they are not accessible from Python?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole module is private. I think we agreed once that functions from a private module are private (the underscore is just a convention).

free(indices)


cdef void update_center_dense(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update_center_dense and update_center_sparse be defined as private functions even if they are not accessible from Python?

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Puh. One complete pass. Some nitpicks. Overall very clean. Thank you very much, @jeremiedbb.

np.import_array()


def _minibatch_update_dense(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, it is very clean code, dense and sparse are separated. Just as suggestion to potentially reduce code repetitions:
Have only one function

def _minibatch_update(
    X,                            # IN
    floating[::1] sample_weight,  # IN
    ...
)

and then make the distinction between dense and sparse X before the two with nogil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's something I'd like to do (actually I tried to do that for kmeans as well when I re-implemented it) but there's a limitation of cython that makes it not smooth at all: you can't cdef something inside an if statement.

For sparse we need to define 3 memoryviews while we need 1 pointer for dense. We could always define 3 memoryviews and 1 pointer and then leave the not corresponding ones unused depending on the type of X but I find it kind of ugly :)

Also, the nogil part actually contains all the code in the function besides the cdefs. I find it a bit weird to have a function for which the whole code is:

def f(x):
    if condition:
        do something
    else:
        do something else

To me it means that it's actually 2 functions, but that's a personal opinion :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, one can define before the if statement and assign after:

def function(X):
    cdef floating* X_dense
    cdef floating[::1] X_data

    if is_sparse(X):
        X_data = X.data    
        with nogil:
            ...
    else:
        X_dense = &X[0, 0]
        with nogil:
            ....

But let's stay with the current code. It's clean and easy to read.

@jeremiedbb
Copy link
Member Author

jeremiedbb commented Apr 7, 2021

Thanks @jjerphan and @lorentzenchr for the reviews. I think I adressed your comments.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I just have noticed a typo to fix.

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @jeremiedbb Great PR. Let's merge as soon a merge conflicts are resolved.

@lorentzenchr lorentzenchr changed the title Scalable MiniBatchKMeans plus cln / fixes / refactoring ENH Scalable MiniBatchKMeans plus cln / fixes / refactoring Apr 15, 2021
@lorentzenchr lorentzenchr merged commit 8c4589b into scikit-learn:main Apr 15, 2021
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this pull request Apr 19, 2021
@glemaitre glemaitre mentioned this pull request Apr 22, 2021
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Documentation is wrong about KMeans.inertia_
6 participants