Skip to content

[MRG+1] Adding support for sample weights to K-Means #10933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
May 22, 2018

Conversation

jnhansen
Copy link
Contributor

@jnhansen jnhansen commented Apr 7, 2018

Reference Issues/PRs

See also #3998.

What does this implement/fix? Explain your changes.

This branch adds support for sample weights to the K-Means algorithm (as well as Minibatch K-Means). This is done by adding the optional parameter sample_weights to KMeans.fit, KMeans.partial_fit, KMeans.predict, KMeans.fit_predict, KMeans.fit_transform, as well as k_means.

Full backwards compatibility of the public methods of the class KMeans and MiniBatchKMeans is maintained.

Any other comments?

@jnhansen jnhansen changed the title Adding support for sample weights to K-Means [MRG] Adding support for sample weights to K-Means Apr 7, 2018
@jnothman
Copy link
Member

jnothman commented Apr 9, 2018

This is exciting! Before a proper review, can you please benchmark the effect on runtime when passed no weights?

@jnhansen
Copy link
Contributor Author

jnhansen commented Apr 9, 2018

Yes, will do.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment.
I need to do a second pass, but this is already very nice.

sample_weights = _check_sample_weights(X, sample_weights)

# verify that the number of samples is equal to the number of weights
if _num_samples(X) != len(sample_weights):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use sklearn.utils.validation.check_consistent_length.
Also, this check does not seem to be tested.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could use sklearn.utils.validation.check_consistent_length, but then the error message will be very generic. Would you still prefer this?

I'll add tests for the check method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I have no strong feeling about it, you can keep it this way.

@jnhansen
Copy link
Contributor Author

jnhansen commented Apr 9, 2018

Here's a first idea of the change in performance.

I have this in benchmark_k_means.py:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
km = KMeans(n_clusters=3, random_state=42)

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.8 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 36.4 msec per loop

As expected, adding this feature will slow down the performance very slightly when not using sample_weights. I may have a look and see if I can still improve the performance in the case without sample weights.

centers = np.zeros((n_clusters, n_features), dtype=dtype)
weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype)

for i in range(n_samples):
Copy link
Member

@TomDLT TomDLT Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if this is slow since Cython is compiled, yet you may try if np.add.at is faster:

weights_sum_in_cluster = np.zeros((n_clusters, ), dtype=dtype)
np.add.at(weights_sum_in_cluster, labels, sample_weights)
empty_clusters = np.where(weights_sum_in_cluster == 0)[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, that is considerably slower:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 57.5 msec per loop

@jnhansen
Copy link
Contributor Author

Given that the performance drop isn't altogether that noticeable, I'd probably leave it as is. Any additional improvement for the case where sample_weights is None would make the code rather ugly because of the need for quite a lot of code duplication. Any thoughts?

@TomDLT
Copy link
Member

TomDLT commented Apr 10, 2018

I did a bit of line profiling on _kmeans_single_elkan.
It seems that the extra multiplication in the inertia is the main cause of the 10% increase in computation of your benchmark.
To remove it, maybe you could do something like:

checked_sample_weights = _check_sample_weights(X, sample_weights)
centers, labels, n_iter = k_means_elkan(X, checked_sample_weights,
                                        n_clusters, centers, tol=tol,
                                        max_iter=max_iter, verbose=verbose)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
    sq_distances *= np.expand_dims(checked_sample_weights, axis=-1)
inertia = np.sum(sq_distances, dtype=np.float64)

I did not look at _kmeans_single_lloyd.

@jnhansen
Copy link
Contributor Author

jnhansen commented Apr 10, 2018

Good catch. After implementing your suggestion I cannot make out any statistically significant difference.

For algorithm="full" a ~10% difference remains. I thought it was due to _labels_inertia, but the bulk of the computational effort (~95%) there traces back to sklearn.metrics.pairwise.pairwise_distances_argmin_min, which doesn't even involve the sample weights.

inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
sq_distances *= checked_sample_weights[:, np.newaxis]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, on second thought, the operation is expensive because of a useless broadcasting.
Instead of n_samples * n_features multiplications, we can do only n_samples multiplications with:

if sample_weights is not None:
    sq_distances = np.sum(sq_distances, axis=1, dtype=np.float64)
    sq_distances *= checked_sample_weights

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that this makes the computation faster if sample_weights is not None. It obviously doesn't make a difference when no sample weights are passed.

I should clarify that when I said above that "I cannot make out any statistically significant difference" I was referring to the difference between the weighted_k_means and master branches, not before and after the change in _kmeans_single_elkan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but removing the unnecessary broadcasting should speed up the sample_weights is not None case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I'm testing it right now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, new benchmarks, adding:

w = np.ones(X.shape[0])

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop

All subject to ~ 1 msec variation.

@jnhansen
Copy link
Contributor Author

In case it gets lost in the collapsed comments above, here are again the current benchmarks.

In benchmark_k_means.py:

import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
w = np.ones(X.shape[0])
km = KMeans(n_clusters=3, random_state=42)

On branch weighted_k_means:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop

On branch master:

$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop

All subject to ~ 1 msec variation between benchmarks.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I hope i'm able to review this soon, but the next couple of weeks are full up!

return centers[sort_index, :], sorted_labels


def test_k_means_weighted_vs_repeated():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please parametrize or loop these tests to avoid repeating code for KMeans and MBKMeans?

est_weighted.cluster_centers_, np.repeat(est_weighted.labels_,
sample_weights))
assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0)
if not isinstance(estimator, MiniBatchKMeans):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you are wondering why this test is not done for MiniBatchKMeans, that's because it fails. That is not due to the changes in this PR, however. I did a quick test on the master branch, comparing the cluster_centers_ of KMeans and MiniBatchKMeans, and they are not the same.

@jnhansen
Copy link
Contributor Author

I have no idea why the latest commit fails the lgtm check. I only updated the tests and they all check out.

@jnothman
Copy link
Member

jnothman commented Apr 16, 2018 via email

@jhelie
Copy link
Contributor

jhelie commented Apr 16, 2018

Hi guys, sorry about this - it seems that this particular base commit couldn't be built due to an outdated pip cache. We're working on fixing the "retry analysis" link to provide a way to overcome this in the future.

@jnhansen
Copy link
Contributor Author

Hi guys, has anyone had a chance to review this yet?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically cosmetic. Good work!

@@ -34,6 +34,7 @@ np.import_array()
@cython.wraparound(False)
@cython.cdivision(True)
cpdef DOUBLE _assign_labels_array(np.ndarray[floating, ndim=2] X,
np.ndarray[floating, ndim=1] sample_weights,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use singluar sample_weight for local and global consistency

Copy link
Contributor Author

@jnhansen jnhansen May 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So just to confirm, sample_weight instead of sample_weights throughout? I had been consistently using the plural everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. That is consistent with the rest of the library

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, will do. Although personally, the singular makes me think that a scalar is requested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite common to use singular nomenclature for vectors... though we are very inconsistent!

@@ -167,9 +172,10 @@ cpdef DOUBLE _assign_labels_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
def _mini_batch_update_csr(X, np.ndarray[floating, ndim=1] sample_weights,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(argh! Plurals! Do what you like here, I suppose)


dtype = np.float32 if floating is float else np.float64
centers = np.zeros((n_clusters, n_features), dtype=dtype)
weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weight_in_cluster would be a sufficient name

@@ -271,6 +277,7 @@ def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms,
@cython.wraparound(False)
@cython.cdivision(True)
def _centers_dense(np.ndarray[floating, ndim=2] X,
np.ndarray[floating, ndim=1] sample_weights,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, though, we should prefer singular

@@ -103,7 +103,9 @@ cdef update_labels_distances_inplace(
upper_bounds[sample] = d_c


def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, int n_clusters,
def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_,
np.ndarray[floating, ndim=1, mode='c'] sample_weights,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_weight

return np.ones(n_samples, dtype=X.dtype)
else:
# verify that the number of samples is equal to the number of weights
if n_samples != len(sample_weights):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a helper called check_consistent_length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tom made the same remark earlier, and I replied that the error message will then be very generic. I'd prefer the more specific error message, but happy to change.

precompute_distances=self.precompute_distances,
tol=self.tol, random_state=random_state, copy_x=self.copy_x,
n_jobs=self.n_jobs, algorithm=self.algorithm,
return_n_iter=True)
return self

def fit_predict(self, X, y=None):
def fit_predict(self, X, y=None, sample_weights=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample_weight

est_1.cluster_centers_, est_1.labels_)
centers_2, labels_2 = _sort_cluster_centers_and_labels(
est_2.cluster_centers_, est_2.labels_)
assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should be able to get the perfect v_measure even without sorting. This makes for a red herring when reading the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, the sorting of the labels was before I had discovered v_measure_score and is now redundant

sample_weights = None
checked_sample_weights = _check_sample_weights(X, sample_weights)
assert_equal(_num_samples(X), _num_samples(checked_sample_weights))
assert_equal(X.dtype, checked_sample_weights.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're doing these checks, why no check that the output sums to 1?

# repetition of the sample
sample_weights = np.random.randint(1, 5, size=n_samples)
X_repeat = np.repeat(X, sample_weights, axis=0)
for estimator in [KMeans(n_clusters=n_clusters, random_state=42),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to test the different init approaches as well

# verify that the number of samples is equal to the number of weights
if n_samples != len(sample_weight):
raise ValueError("n_samples=%d should be == len(sample_weight)=%d"
% (n_samples, len(sample_weight)))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't change this to check_consistent_length, because that would result in a very generic error message. Let me know if you'd still like me to change it.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jnothman jnothman changed the title [MRG] Adding support for sample weights to K-Means [MRG+1] Adding support for sample weights to K-Means May 15, 2018
@jnhansen
Copy link
Contributor Author

@TomDLT?

@TomDLT
Copy link
Member

TomDLT commented May 16, 2018

Nice work ! I am a bit busy this week, I'll take a closer look next week.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TomDLT TomDLT merged commit 4b24fbe into scikit-learn:master May 22, 2018
@TomDLT
Copy link
Member

TomDLT commented May 22, 2018

Oops I forgot the whats_new.
Could you add an entry in doc/whats_new/v0.20.rst in a new PR?

@jnhansen
Copy link
Contributor Author

Sure I'll do that later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants