-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
added sample_weight support to K-Means and Minibatch K-means #4218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
added c source
I had to change np.multiply( sample_weight[center_mask, np.newaxis], X[center_mask]) to np.multiply( sample_weight[center_mask][:, np.newaxis], X[center_mask]) Since the former is support in |
does this change speed of non weighted code? please run pep8 checker on your files. I saw some long lines |
fixed typos fix compatability issues backward compatible np.newaxis pep8 compliance
stop checking sample_weights if all ones simplified sample weight checks pep8
sklearn/cluster/_k_means.pyx
Outdated
|
||
for i in range(labels.shape[0]): | ||
add_row_csr(data, indices, indptr, i, centers[labels[i]]) | ||
for ind in range(indptr[i], indptr[i + 1]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implemented data[i,:] * sample_weight[i]
directly in cython to improve performance on sparse X
by inlining add_row_csr
.
Thanks for the feedback @agramfort! To get an idea of how the weighted code compares to the unweighted code speedy wise, I ran the |
it'a bit of a shame to reduce performance for the 98% usecase by adding
this feature.
|
it'a bit of a shame to reduce performance for the 98% usecase by adding
this feature.
I must admit that I don't fully understand the usecase of a weighted
kmeans. Therefore I tend to agree with you.
|
If the data is dense, sparse matrix formats are slower then dense formats, so your observation is expected. |
It's very possible I'm mistaken, and hoped for feedback at #3998. The On 11 February 2015 at 04:43, Andreas Mueller notifications@github.com
|
My understanding is that they're also used in modeling populations where the samples are not necessarily representative of the entire population (e.g. the proportion of subpopulations might be off). For instance, CDC, WHO, consumer spending etc. data often come with weights. People also use weights if they've collected data on a sample of consumers but are trying to extrapolate for a larger market. |
how about a convincing example along these lines? I am bit scared of "this
could be useful in the near future".
|
This might be useful here: #4357. I like the addition but I'd prefer it if it didn't have a performance impact if no weights are present. could you not make all the calculations (or at least the ones making it slower) conditional? |
speed improvements added tests pep8 compliance stylistic changes
Sorry for the delay. I got busy and then I broke something :) Anyway, I've rewritten it so that the speed doesn't drop so dramatically. The weighting is now implemented conditionally. I've also updated the gist for the benchmark. I increased the n_init for miniBatchKMeans for a more stable benchmark. With this rewrite, the implementation performs comparably with the unweighted version. |
I'd much prefer this, but I think your branching is not optimal yet. You have a lot of code duplication. Why don't you put the if around the places where there the code is actually different? |
I looked into that earlier, but when it's deep in the nested for-loops the |
Huh, ok. I would have thought the if doesn't matter compared to the matrix operations, but I see now that these are actually vector-matrix operations. You do benchmark on pretty small data, though. |
One example of an application for K-Means with sample weights is in the offline clustering step in the CluStream data stream clustering algorithm (see the bottom of section 4). I wrote a Python implementation of CluStream (which I'm planning to share in a few weeks), but I only have a crude temporary solution for performing the offline clustering step until SKLearn's K-Means supports sample weights properly. |
Resolved in #10933, thanks @xbhsu |
This addresses issue #3998 raised by @jnothman and adds sample_weight support to two more clustering algorithms.
Centeroids in the K-Means algorithm are now given by the center of mass coordinates of the (weighted) samples and the algorithm now optimizes the weighted inertia.
where the centers are given by the center of mass of the clusters
The signature of fit methods have been modified to
fit(self, X, y=None, sample_weight=None)
to be consistent elsewhere. sample_weight should be a numpy array of lengthn_samples
containing the weight assignment of each sample. If sample_weight is None, the default, trivial weight of 1 is assigned to all the samples.In addition, tests were added to: