Skip to content

added sample_weight support to K-Means and Minibatch K-means #4218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

added sample_weight support to K-Means and Minibatch K-means #4218

wants to merge 4 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Feb 7, 2015

This addresses issue #3998 raised by @jnothman and adds sample_weight support to two more clustering algorithms.

Centeroids in the K-Means algorithm are now given by the center of mass coordinates of the (weighted) samples and the algorithm now optimizes the weighted inertia.

\sum_{i=0}^{n_clusters}\min_{\mu_j \in C} w_j (||x_j - \mu_i||^2)

where the centers are given by the center of mass of the clusters

\mu_j = \sum_{x_i \in C_j }^{n} w_i x_i / \sum_{x_i \in C_j} w_i

The signature of fit methods have been modified to fit(self, X, y=None, sample_weight=None) to be consistent elsewhere. sample_weight should be a numpy array of length n_samples containing the weight assignment of each sample. If sample_weight is None, the default, trivial weight of 1 is assigned to all the samples.

In addition, tests were added to:

  1. Check that the weighted inertia scales as expected with constant, non-trivial (e.g not equal to 1) weights
  2. Check that constant, non-trivial weights yield identical label and center coordinates

@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 95.03% when pulling 3e78842 on xbhsu:wgtd-kmeans into 4629366 on scikit-learn:master.

@ghost
Copy link
Author

ghost commented Feb 8, 2015

I had to change

np.multiply( sample_weight[center_mask, np.newaxis], X[center_mask])

to

np.multiply( sample_weight[center_mask][:, np.newaxis], X[center_mask])

Since the former is support in numpy >= 1.8.0 (see Release notes) and was causing Travis issues and failing tests using earlier versions of numpy

@agramfort
Copy link
Member

does this change speed of non weighted code?

please run pep8 checker on your files. I saw some long lines

fixed typos

fix compatability issues

backward compatible np.newaxis

pep8 compliance
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 95.03% when pulling bbba285 on xbhsu:wgtd-kmeans into 4629366 on scikit-learn:master.

stop checking sample_weights if all ones

simplified sample weight checks

pep8
@coveralls
Copy link

Coverage Status

Coverage increased (+0.01%) to 95.03% when pulling da0253a on xbhsu:wgtd-kmeans into 4629366 on scikit-learn:master.


for i in range(labels.shape[0]):
add_row_csr(data, indices, indptr, i, centers[labels[i]])
for ind in range(indptr[i], indptr[i + 1]):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implemented data[i,:] * sample_weight[i] directly in cython to improve performance on sparse X by inlining add_row_csr.

@ghost
Copy link
Author

ghost commented Feb 10, 2015

Thanks for the feedback @agramfort! To get an idea of how the weighted code compares to the unweighted code speedy wise, I ran the fit method for KMeans and MinibatchKMeans with the weighted and unweighted code, averaging over ten runs. I've shared the gist here. For me, I got the following performance characteristics on dense and sparse inputs (labelled *_csr below)
figure_1
The weighted code is a little slower, and the difference does grow with sample size. Perhaps its an issue with creating np.ones(shape=(n_samples,)) or the calls to np.multiply? Something that struck me as odd was that the sparse K-Means method (graph "kmeans_csr) is consistently much slower than K-means on dense inputs. I have yet to track down the source of this, but maybe you know the answer to that already?

@agramfort
Copy link
Member

agramfort commented Feb 10, 2015 via email

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Feb 10, 2015 via email

@amueller
Copy link
Member

If the data is dense, sparse matrix formats are slower then dense formats, so your observation is expected.

@jnothman
Copy link
Member

I must admit that I don't fully understand the usecase of a weighted
kmeans.

It's very possible I'm mistaken, and hoped for feedback at #3998. The
assumption is that sometimes you want to reduce your dataset to a set of
representative instances, but not all representatives indicate equal
density in the original space. The simplest example is where you have
duplicate objects (for instance I've been working with clustering documents
with similar sets of XPaths, and for some such signatures there are many
duplicates) and can't afford to represent each of the duplicated instances
separately in terms of time efficiency. BIRCH in theory should be
performing global clustering with its subcluster centres weighted, but atm
does not.

On 11 February 2015 at 04:43, Andreas Mueller notifications@github.com
wrote:

If the data is dense, sparse matrix formats are slower then dense
formats, so your observation is expected.


Reply to this email directly or view it on GitHub
#4218 (comment)
.

@ghost
Copy link
Author

ghost commented Feb 10, 2015

My understanding is that they're also used in modeling populations where the samples are not necessarily representative of the entire population (e.g. the proportion of subpopulations might be off). For instance, CDC, WHO, consumer spending etc. data often come with weights. People also use weights if they've collected data on a sample of consumers but are trying to extrapolate for a larger market.

@agramfort
Copy link
Member

agramfort commented Feb 11, 2015 via email

@amueller
Copy link
Member

amueller commented Mar 9, 2015

This might be useful here: #4357. I like the addition but I'd prefer it if it didn't have a performance impact if no weights are present. could you not make all the calculations (or at least the ones making it slower) conditional?

speed improvements

added tests

pep8 compliance

stylistic changes
@ghost
Copy link
Author

ghost commented Mar 12, 2015

Sorry for the delay. I got busy and then I broke something :)

Anyway, I've rewritten it so that the speed doesn't drop so dramatically. The weighting is now implemented conditionally.

I've also updated the gist for the benchmark. I increased the n_init for miniBatchKMeans for a more stable benchmark. With this rewrite, the implementation performs comparably with the unweighted version.
figure_1

@amueller
Copy link
Member

I'd much prefer this, but I think your branching is not optimal yet. You have a lot of code duplication. Why don't you put the if around the places where there the code is actually different?

@ghost
Copy link
Author

ghost commented Mar 12, 2015

I looked into that earlier, but when it's deep in the nested for-loops the if gets called a lot. That gave a noticeable slow down, especially for miniBatchKMeans.

@amueller
Copy link
Member

Huh, ok. I would have thought the if doesn't matter compared to the matrix operations, but I see now that these are actually vector-matrix operations. You do benchmark on pretty small data, though.
Too bad cython doesn't have real templates ^^

@vote539
Copy link

vote539 commented Jul 15, 2015

One example of an application for K-Means with sample weights is in the offline clustering step in the CluStream data stream clustering algorithm (see the bottom of section 4). I wrote a Python implementation of CluStream (which I'm planning to share in a few weeks), but I only have a crude temporary solution for performing the offline clustering step until SKLearn's K-Means supports sample weights properly.

@qinhanmin2014
Copy link
Member

Resolved in #10933, thanks @xbhsu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants