added sample_weight support to K-Means and Minibatch K-means #4218

ghost · 2015-02-07T12:53:17Z

This addresses issue #3998 raised by @jnothman and adds sample_weight support to two more clustering algorithms.

Centeroids in the K-Means algorithm are now given by the center of mass coordinates of the (weighted) samples and the algorithm now optimizes the weighted inertia.

\sum_{i=0}^{n_clusters}\min_{\mu_j \in C} w_j (||x_j - \mu_i||^2)

where the centers are given by the center of mass of the clusters

\mu_j = \sum_{x_i \in C_j }^{n} w_i x_i / \sum_{x_i \in C_j} w_i

The signature of fit methods have been modified to fit(self, X, y=None, sample_weight=None) to be consistent elsewhere. sample_weight should be a numpy array of length n_samples containing the weight assignment of each sample. If sample_weight is None, the default, trivial weight of 1 is assigned to all the samples.

In addition, tests were added to:

Check that the weighted inertia scales as expected with constant, non-trivial (e.g not equal to 1) weights
Check that constant, non-trivial weights yield identical label and center coordinates

added c source

coveralls · 2015-02-07T20:46:56Z

Coverage increased (+0.01%) to 95.03% when pulling 3e78842 on xbhsu:wgtd-kmeans into 4629366 on scikit-learn:master.

ghost · 2015-02-08T02:46:24Z

I had to change

np.multiply( sample_weight[center_mask, np.newaxis], X[center_mask])

to

np.multiply( sample_weight[center_mask][:, np.newaxis], X[center_mask])

Since the former is support in numpy >= 1.8.0 (see Release notes) and was causing Travis issues and failing tests using earlier versions of numpy

agramfort · 2015-02-08T10:00:39Z

does this change speed of non weighted code?

please run pep8 checker on your files. I saw some long lines

fixed typos fix compatability issues backward compatible np.newaxis pep8 compliance

coveralls · 2015-02-08T19:37:37Z

Coverage increased (+0.01%) to 95.03% when pulling bbba285 on xbhsu:wgtd-kmeans into 4629366 on scikit-learn:master.

stop checking sample_weights if all ones simplified sample weight checks pep8

coveralls · 2015-02-10T03:38:59Z

Coverage increased (+0.01%) to 95.03% when pulling da0253a on xbhsu:wgtd-kmeans into 4629366 on scikit-learn:master.

ghost · 2015-02-10T03:45:08Z

sklearn/cluster/_k_means.pyx


    for i in range(labels.shape[0]):
-        add_row_csr(data, indices, indptr, i, centers[labels[i]])
+        for ind in range(indptr[i], indptr[i + 1]):


implemented data[i,:] * sample_weight[i] directly in cython to improve performance on sparse X by inlining add_row_csr.

ghost · 2015-02-10T04:09:49Z

Thanks for the feedback @agramfort! To get an idea of how the weighted code compares to the unweighted code speedy wise, I ran the fit method for KMeans and MinibatchKMeans with the weighted and unweighted code, averaging over ten runs. I've shared the gist here. For me, I got the following performance characteristics on dense and sparse inputs (labelled *_csr below)

The weighted code is a little slower, and the difference does grow with sample size. Perhaps its an issue with creating np.ones(shape=(n_samples,)) or the calls to np.multiply? Something that struck me as odd was that the sparse K-Means method (graph "kmeans_csr) is consistently much slower than K-means on dense inputs. I have yet to track down the source of this, but maybe you know the answer to that already?

agramfort · 2015-02-10T14:05:16Z

it'a bit of a shame to reduce performance for the 98% usecase by adding this feature.

GaelVaroquaux · 2015-02-10T17:22:26Z

it'a bit of a shame to reduce performance for the 98% usecase by adding this feature.

I must admit that I don't fully understand the usecase of a weighted kmeans. Therefore I tend to agree with you.

amueller · 2015-02-10T17:43:01Z

If the data is dense, sparse matrix formats are slower then dense formats, so your observation is expected.

jnothman · 2015-02-10T22:05:09Z

I must admit that I don't fully understand the usecase of a weighted
kmeans.

It's very possible I'm mistaken, and hoped for feedback at #3998. The
assumption is that sometimes you want to reduce your dataset to a set of
representative instances, but not all representatives indicate equal
density in the original space. The simplest example is where you have
duplicate objects (for instance I've been working with clustering documents
with similar sets of XPaths, and for some such signatures there are many
duplicates) and can't afford to represent each of the duplicated instances
separately in terms of time efficiency. BIRCH in theory should be
performing global clustering with its subcluster centres weighted, but atm
does not.

On 11 February 2015 at 04:43, Andreas Mueller notifications@github.com
wrote:

If the data is dense, sparse matrix formats are slower then dense
formats, so your observation is expected.

—
Reply to this email directly or view it on GitHub
#4218 (comment)
.

ghost · 2015-02-10T23:06:54Z

My understanding is that they're also used in modeling populations where the samples are not necessarily representative of the entire population (e.g. the proportion of subpopulations might be off). For instance, CDC, WHO, consumer spending etc. data often come with weights. People also use weights if they've collected data on a sample of consumers but are trying to extrapolate for a larger market.

agramfort · 2015-02-11T12:21:29Z

how about a convincing example along these lines? I am bit scared of "this could be useful in the near future".

amueller · 2015-03-09T14:54:33Z

This might be useful here: #4357. I like the addition but I'd prefer it if it didn't have a performance impact if no weights are present. could you not make all the calculations (or at least the ones making it slower) conditional?

speed improvements added tests pep8 compliance stylistic changes

ghost · 2015-03-12T17:26:33Z

Sorry for the delay. I got busy and then I broke something :)

Anyway, I've rewritten it so that the speed doesn't drop so dramatically. The weighting is now implemented conditionally.

I've also updated the gist for the benchmark. I increased the n_init for miniBatchKMeans for a more stable benchmark. With this rewrite, the implementation performs comparably with the unweighted version.

amueller · 2015-03-12T17:51:15Z

I'd much prefer this, but I think your branching is not optimal yet. You have a lot of code duplication. Why don't you put the if around the places where there the code is actually different?

ghost · 2015-03-12T17:56:48Z

I looked into that earlier, but when it's deep in the nested for-loops the if gets called a lot. That gave a noticeable slow down, especially for miniBatchKMeans.

amueller · 2015-03-12T18:06:40Z

Huh, ok. I would have thought the if doesn't matter compared to the matrix operations, but I see now that these are actually vector-matrix operations. You do benchmark on pretty small data, though.
Too bad cython doesn't have real templates ^^

vote539 · 2015-07-15T21:00:49Z

One example of an application for K-Means with sample weights is in the offline clustering step in the CluStream data stream clustering algorithm (see the bottom of section 4). I wrote a Python implementation of CluStream (which I'm planning to share in a few weeks), but I only have a crude temporary solution for performing the offline clustering step until SKLearn's K-Means supports sample weights properly.

qinhanmin2014 · 2018-07-28T14:11:45Z

Resolved in #10933, thanks @xbhsu

added weighted kmeans/minibatch

a54fe7f

added c source

added tests for weighted kmeans/minibatch

bbba285

fixed typos fix compatability issues backward compatible np.newaxis pep8 compliance

inlined add csr to include multp. by weight

da0253a

stop checking sample_weights if all ones simplified sample weight checks pep8

ghost reviewed Feb 10, 2015
View reviewed changes

amueller mentioned this pull request Mar 9, 2015

Scalable Kmeans++ also known as k-means|| #4357

Open

speed improvements

7e22e02

speed improvements added tests pep8 compliance stylistic changes

Jammy2211 mentioned this pull request Jan 7, 2018

scipy.spatial.KMeans doesn't support weighted clustering Jammy2211/PyAutoLens#2

Closed

qinhanmin2014 closed this Jul 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added sample_weight support to K-Means and Minibatch K-means #4218

added sample_weight support to K-Means and Minibatch K-means #4218

ghost commented Feb 7, 2015

coveralls commented Feb 7, 2015

ghost commented Feb 8, 2015

agramfort commented Feb 8, 2015

coveralls commented Feb 8, 2015

coveralls commented Feb 10, 2015

ghost Feb 10, 2015

ghost commented Feb 10, 2015

agramfort commented Feb 10, 2015 via email

GaelVaroquaux commented Feb 10, 2015 via email

amueller commented Feb 10, 2015

jnothman commented Feb 10, 2015

ghost commented Feb 10, 2015

agramfort commented Feb 11, 2015 via email

amueller commented Mar 9, 2015

ghost commented Mar 12, 2015

amueller commented Mar 12, 2015

ghost commented Mar 12, 2015

amueller commented Mar 12, 2015

vote539 commented Jul 15, 2015

qinhanmin2014 commented Jul 28, 2018

added sample_weight support to K-Means and Minibatch K-means #4218

added sample_weight support to K-Means and Minibatch K-means #4218

Conversation

ghost commented Feb 7, 2015

coveralls commented Feb 7, 2015

ghost commented Feb 8, 2015

agramfort commented Feb 8, 2015

coveralls commented Feb 8, 2015

coveralls commented Feb 10, 2015

ghost Feb 10, 2015

Choose a reason for hiding this comment

ghost commented Feb 10, 2015

agramfort commented Feb 10, 2015 via email

GaelVaroquaux commented Feb 10, 2015 via email

amueller commented Feb 10, 2015

jnothman commented Feb 10, 2015

ghost commented Feb 10, 2015

agramfort commented Feb 11, 2015 via email

amueller commented Mar 9, 2015

ghost commented Mar 12, 2015

amueller commented Mar 12, 2015

ghost commented Mar 12, 2015

amueller commented Mar 12, 2015

vote539 commented Jul 15, 2015

qinhanmin2014 commented Jul 28, 2018