ENH Scalable MiniBatchKMeans plus cln / fixes / refactoring #17622

jeremiedbb · 2020-06-17T16:20:21Z

The initial goal of this PR was to improve scalability of MiniBatchKMeans on multi-core machines. Working on that I noticed several inconsistencies between kmeans and minibatch kmeans, dense and sparse, fit and partial_fit, ... so this PR also refactors some parts for more consistency and better exploit of the inheritance of minibatchkmeans from kmeans. I also reworked a lot of tests (parametrize as much as possible and fix some broken tests).

Here's a list of what this PR actually changes:

Scalability on multi-core machine
expand

The computation intensive part of MiniBatchKMeans is, like KMeans, the computation of the labels and the update of the centers. Computation of the labels is the most costly of the 2 and has been optimized in [MRG] new K-means implementation for improved performances #11950. Update of centers was sequential and becomes the bottleneck on multi-core machines (see profile below on a 16 cores machine).

Another thing to notice is that the computation of inertia takes a significant proportion of _labels_inertia while it shouldn't because it's sequential. It does not impact kmeans because we only compute inertia once at the end but we compute it for each minibatch in minibatchkmeans.

What changed:
- inertia is now parallel
- moved centers update (sparse and dense) in cython in dedicated file _k_means_minibatch.pyx, and made it parallel
It gives an overall speed-up of 3x on this machine for this dataset.

Further improvements ? the validation of X (assert_all_finite) and computation of norms are sequential and begin to take a non negligible proportion. Parallelizing it could improve many other parts of sklearn. Extracting minibatches is sequential take a bigger proportion as the batch_size increases.
Sparse and sample_weights
expand

bug fix: sample_weights were ignored in sparse center update. Was not tested. Test added.
```
for k in range(X_indptr[sample_idx], X_indptr[sample_idx + 1]):
    centers[center_idx, X_indices[k]] += X_data[k]
```
Parameter validation

expand

Put all parameter validation in a dedicated method (_check_params) and added some missing validation. Grouped all tests for param validation in a single parametrized test.
Likewise, move tests for the warnings in dedicated tests.
Sample weight normalization

expand

Done in MNT Don't normalize sample weights in KMeans #17848
Sample weights were normalized to sum up to n_samples. Doing this normalization has absolutely no impact on the clustering or on public attributes and even was causing a bug. I removed it.
Fixes Documentation is wrong about KMeans.inertia_ #16594
Centers initialization

expand

Factorized some of the code for the initialization of the centers to do validation only once and for better consistency between kmeans and minibatchkmeans.

Also, init in minibatchkmeans was actually doing a minibatch_step on a batch of size "init_size" which is usually bigger than "batch_size". I consider it a bug so I removed that minibatch step
Fit and partial_fit

expand

Better consistency between fit and partial_fit, in particular for data and parameter validation.
Convergence criterion

expand

There are 2 ways to declare convergence. One when the norm between old-centers and new-centers becomes smaller than a tolerance and one when the batch inertia does not decrease after a certain number of iterations. More precisely both were computing an exponentially weighted average to lower impact of the noise.

I think it does not makes sense to use exponentially weighted average for the norm of the diff of the centers since it will take an extremely long time to become smaller than tol. I made some tests and the convergence on inertia is always declared much much earlier than this one. Thus I removed the ewa and only check if the norm of the diff of the centers is smaller than tol. This way both declare convergence at comparable time.
Tests
expand
- sample weights vs repeated samples
  In both kmeans and minibatch, we can't guarantee that the results are the same when init is not provided centers due to rng in sampling indices. Moreover in minibatch it's not guaranteed even with provided centers because rng in minibatch extraction. The existing test was passing because of luck. I updated it to only cover the guaranteed case.
- init centers not mutated
  There was a test to check that centers provided as init are not mutated: test_k_means_init_centers. The test was wrong because a copy was made in between. I fixed it and renamed it test_centers_not_mutated.
- assert_almost_equal
  removed all remaining assert_almost_equal, assert_raise, assert_warns, ...
- simplified and parametrized many tests. Also grouped them by concerning kmeans and minibatch, only kmeans and only minibatch.
Public vs private attributes

expand

Done in DOC document and deprecate missing attributes in MiniBatchKMeans #17864
minibatchkmeans had 2 private attributes with a trailing underscore: counts_ and random_state_. I made them explicitly private: _counts and _random_state. I wonder if this requires a deprecation cycle since both attributes were not documented in the first place.
k-means++

expand

Since the deprecation cycle of making a lot of files private has ended, I renamed the _k_init function _kmeansplusplus which is more explicit
order of operations in minibatch step
expand

Previously the order was:
- compute labels
- reassign some centers
- update centers
The update then is done based on the labels computed before reassignment. I changed it to:
- compute labels
- update centers
- reassign

ogrisel · 2021-03-12T15:38:25Z

We indeed need a changelog entry for this PR (thanks @cmarmo for the automated check ;)

jjerphan

Thanks @jeremiedbb.

This is my first pass on this PR: I left a few comments and questions but still I need to spend more time inspecting some of the logic.

sklearn/cluster/_kmeans.py

jjerphan · 2021-04-02T10:16:29Z