-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Adding support for sample weights to K-Means #10933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is exciting! Before a proper review, can you please benchmark the effect on runtime when passed no weights? |
Yes, will do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor comment.
I need to do a second pass, but this is already very nice.
sklearn/cluster/k_means_.py
Outdated
sample_weights = _check_sample_weights(X, sample_weights) | ||
|
||
# verify that the number of samples is equal to the number of weights | ||
if _num_samples(X) != len(sample_weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also use sklearn.utils.validation.check_consistent_length
.
Also, this check does not seem to be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could use sklearn.utils.validation.check_consistent_length
, but then the error message will be very generic. Would you still prefer this?
I'll add tests for the check method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I have no strong feeling about it, you can keep it this way.
Here's a first idea of the change in performance. I have this in from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
km = KMeans(n_clusters=3, random_state=42) On branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.8 msec per loop On branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 36.4 msec per loop As expected, adding this feature will slow down the performance very slightly when not using |
sklearn/cluster/_k_means.pyx
Outdated
centers = np.zeros((n_clusters, n_features), dtype=dtype) | ||
weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype) | ||
|
||
for i in range(n_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know if this is slow since Cython is compiled, yet you may try if np.add.at
is faster:
weights_sum_in_cluster = np.zeros((n_clusters, ), dtype=dtype)
np.add.at(weights_sum_in_cluster, labels, sample_weights)
empty_clusters = np.where(weights_sum_in_cluster == 0)[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, that is considerably slower:
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 57.5 msec per loop
Given that the performance drop isn't altogether that noticeable, I'd probably leave it as is. Any additional improvement for the case where |
I did a bit of line profiling on checked_sample_weights = _check_sample_weights(X, sample_weights)
centers, labels, n_iter = k_means_elkan(X, checked_sample_weights,
n_clusters, centers, tol=tol,
max_iter=max_iter, verbose=verbose)
sq_distances = (X - centers[labels]) ** 2
if sample_weights is not None:
sq_distances *= np.expand_dims(checked_sample_weights, axis=-1)
inertia = np.sum(sq_distances, dtype=np.float64) I did not look at |
Good catch. After implementing your suggestion I cannot make out any statistically significant difference. For |
sklearn/cluster/k_means_.py
Outdated
inertia = np.sum((X - centers[labels]) ** 2, dtype=np.float64) | ||
sq_distances = (X - centers[labels]) ** 2 | ||
if sample_weights is not None: | ||
sq_distances *= checked_sample_weights[:, np.newaxis] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, on second thought, the operation is expensive because of a useless broadcasting.
Instead of n_samples * n_features
multiplications, we can do only n_samples
multiplications with:
if sample_weights is not None:
sq_distances = np.sum(sq_distances, axis=1, dtype=np.float64)
sq_distances *= checked_sample_weights
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that this makes the computation faster if sample_weights
is not None
. It obviously doesn't make a difference when no sample weights are passed.
I should clarify that when I said above that "I cannot make out any statistically significant difference" I was referring to the difference between the weighted_k_means
and master
branches, not before and after the change in _kmeans_single_elkan
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but removing the unnecessary broadcasting should speed up the sample_weights is not None
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, I'm testing it right now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, new benchmarks, adding:
w = np.ones(X.shape[0])
On branch weighted_k_means
:
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop
On branch master
:
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop
All subject to ~ 1 msec variation.
…weights is not None
In case it gets lost in the collapsed comments above, here are again the current benchmarks. In import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10000, n_features=5, centers=3, random_state=42)
w = np.ones(X.shape[0])
km = KMeans(n_clusters=3, random_state=42) On branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km,w' 'km.fit(X,sample_weights=w)'
100 loops, best of 3: 40.1 msec per loop
$ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 39.4 msec per loop On branch $ python -m timeit -n 100 -s 'from benchmark_k_means import X,km' 'km.fit(X)'
100 loops, best of 3: 38 msec per loop All subject to ~ 1 msec variation between benchmarks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I hope i'm able to review this soon, but the next couple of weeks are full up!
return centers[sort_index, :], sorted_labels | ||
|
||
|
||
def test_k_means_weighted_vs_repeated(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please parametrize or loop these tests to avoid repeating code for KMeans and MBKMeans?
est_weighted.cluster_centers_, np.repeat(est_weighted.labels_, | ||
sample_weights)) | ||
assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0) | ||
if not isinstance(estimator, MiniBatchKMeans): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case you are wondering why this test is not done for MiniBatchKMeans
, that's because it fails. That is not due to the changes in this PR, however. I did a quick test on the master branch, comparing the cluster_centers_
of KMeans
and MiniBatchKMeans
, and they are not the same.
I have no idea why the latest commit fails the lgtm check. I only updated the tests and they all check out. |
looks like a temporary outage to me
|
Hi guys, sorry about this - it seems that this particular base commit couldn't be built due to an outdated pip cache. We're working on fixing the "retry analysis" link to provide a way to overcome this in the future. |
Hi guys, has anyone had a chance to review this yet? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically cosmetic. Good work!
sklearn/cluster/_k_means.pyx
Outdated
@@ -34,6 +34,7 @@ np.import_array() | |||
@cython.wraparound(False) | |||
@cython.cdivision(True) | |||
cpdef DOUBLE _assign_labels_array(np.ndarray[floating, ndim=2] X, | |||
np.ndarray[floating, ndim=1] sample_weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use singluar sample_weight for local and global consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So just to confirm, sample_weight
instead of sample_weights
throughout? I had been consistently using the plural everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think so. That is consistent with the rest of the library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, will do. Although personally, the singular makes me think that a scalar is requested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite common to use singular nomenclature for vectors... though we are very inconsistent!
sklearn/cluster/_k_means.pyx
Outdated
@@ -167,9 +172,10 @@ cpdef DOUBLE _assign_labels_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms, | |||
@cython.boundscheck(False) | |||
@cython.wraparound(False) | |||
@cython.cdivision(True) | |||
def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms, | |||
def _mini_batch_update_csr(X, np.ndarray[floating, ndim=1] sample_weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(argh! Plurals! Do what you like here, I suppose)
sklearn/cluster/_k_means.pyx
Outdated
|
||
dtype = np.float32 if floating is float else np.float64 | ||
centers = np.zeros((n_clusters, n_features), dtype=dtype) | ||
weights_sum_in_cluster = np.zeros((n_clusters,), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weight_in_cluster would be a sufficient name
sklearn/cluster/_k_means.pyx
Outdated
@@ -271,6 +277,7 @@ def _mini_batch_update_csr(X, np.ndarray[DOUBLE, ndim=1] x_squared_norms, | |||
@cython.wraparound(False) | |||
@cython.cdivision(True) | |||
def _centers_dense(np.ndarray[floating, ndim=2] X, | |||
np.ndarray[floating, ndim=1] sample_weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, though, we should prefer singular
sklearn/cluster/_k_means_elkan.pyx
Outdated
@@ -103,7 +103,9 @@ cdef update_labels_distances_inplace( | |||
upper_bounds[sample] = d_c | |||
|
|||
|
|||
def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, int n_clusters, | |||
def k_means_elkan(np.ndarray[floating, ndim=2, mode='c'] X_, | |||
np.ndarray[floating, ndim=1, mode='c'] sample_weights, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_weight
sklearn/cluster/k_means_.py
Outdated
return np.ones(n_samples, dtype=X.dtype) | ||
else: | ||
# verify that the number of samples is equal to the number of weights | ||
if n_samples != len(sample_weights): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a helper called check_consistent_length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tom made the same remark earlier, and I replied that the error message will then be very generic. I'd prefer the more specific error message, but happy to change.
sklearn/cluster/k_means_.py
Outdated
precompute_distances=self.precompute_distances, | ||
tol=self.tol, random_state=random_state, copy_x=self.copy_x, | ||
n_jobs=self.n_jobs, algorithm=self.algorithm, | ||
return_n_iter=True) | ||
return self | ||
|
||
def fit_predict(self, X, y=None): | ||
def fit_predict(self, X, y=None, sample_weights=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample_weight
est_1.cluster_centers_, est_1.labels_) | ||
centers_2, labels_2 = _sort_cluster_centers_and_labels( | ||
est_2.cluster_centers_, est_2.labels_) | ||
assert_almost_equal(v_measure_score(labels_1, labels_2), 1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to get the perfect v_measure even without sorting. This makes for a red herring when reading the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, the sorting of the labels was before I had discovered v_measure_score and is now redundant
sample_weights = None | ||
checked_sample_weights = _check_sample_weights(X, sample_weights) | ||
assert_equal(_num_samples(X), _num_samples(checked_sample_weights)) | ||
assert_equal(X.dtype, checked_sample_weights.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're doing these checks, why no check that the output sums to 1?
# repetition of the sample | ||
sample_weights = np.random.randint(1, 5, size=n_samples) | ||
X_repeat = np.repeat(X, sample_weights, axis=0) | ||
for estimator in [KMeans(n_clusters=n_clusters, random_state=42), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to test the different init approaches as well
…d clutter in tests.
sklearn/cluster/k_means_.py
Outdated
# verify that the number of samples is equal to the number of weights | ||
if n_samples != len(sample_weight): | ||
raise ValueError("n_samples=%d should be == len(sample_weight)=%d" | ||
% (n_samples, len(sample_weight))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't change this to check_consistent_length
, because that would result in a very generic error message. Let me know if you'd still like me to change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Nice work ! I am a bit busy this week, I'll take a closer look next week. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Oops I forgot the whats_new. |
Sure I'll do that later today. |
Reference Issues/PRs
See also #3998.
What does this implement/fix? Explain your changes.
This branch adds support for sample weights to the K-Means algorithm (as well as Minibatch K-Means). This is done by adding the optional parameter
sample_weights
toKMeans.fit
,KMeans.partial_fit
,KMeans.predict
,KMeans.fit_predict
,KMeans.fit_transform
, as well ask_means
.Full backwards compatibility of the public methods of the class
KMeans
andMiniBatchKMeans
is maintained.Any other comments?