-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Scalable Kmeans++ also known as k-means|| #4357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
What do you mean? Could you give us a reference?
|
Please a reference of a scientific paper.
|
This looks interesting: it is true that kmeans++ doesn't scale well to large datasets. We have found (in my group) that in the MiniBatchKMeans, work on the initialization was critical to get good results. Maybe this paper can improve things. |
As I understand scalable kmeans can be implemented by modifying _k_init(). But i am not able figure out what needs to be changed. Can you plz give me some clue if you have any idea. |
I had the same issues @GaelVaroquaux. The paper doesn't seem widely cited but it is worth a try. |
@jagankr I haven't looked at the paper (yet) but you are more likely to get an answer with a more specific question. What is the issue you are having? Is the strategy not clear from the paper? |
I'm not sure about the last step
Does that mean running weighted K-Means? |
will you please give an application of kmeans clustering over a large dataset that enables us to compare the kmeans and kmeans++ |
compare means it should give a notable time difference.I think, it is possible only through an application that runs over a large dataset. |
There are some comparisons of minibatch k-means and kmeans, right? You can try on 20 newsgroups, or maybe mnist, or repeated mnist? We had a function to nudge images, which you could apply to mnist, go generate more data.... |
Is anyone working on this one in the end? If not I'd like to give it a try. |
go for it! :D |
The right reference is this one, with 1200+ citations to date. |
Implementation for this available in PR #5530 |
doesn't seem to work :( |
@zermelozf Can I work on it. Also here is a summary of why this should work (as studied from authors work): Whats wrong with kmeans++ -
Intuition for a solution -After it chooses the first center it has a distribution for the data points and it picks one and chooses one point from the joint distribution and picks a point and updates the distribution. So what Bahman (author) says is that instead of this we want to over sample the data -
The above algorithm is k means parallel. @zermelozf I would request you for any valuable inputs since you gave some time into reading and implementing this. Thanks a lot. |
Sure. My first attempt at implementing this technique was not successful but feel free to take over. FYI, the technique I tried to implement was based on parallelism. Hoever the cores may already be busy from running different initialisations in parallel if I remember correctly. |
Thanks a lot @zermelozf |
Hi, |
There's been a few attempts as you can see in the thread @shrutim1 , but nobody seems to be working on it ATM. So it's up for grab. |
@adrinjalali seems like a good summer project for GSoC. |
I will look at this. Thank you.
…On Tue, Jan 15, 2019, 11:52 AM Shubham Bhardwaj ***@***.*** wrote:
@shrutim1 <https://github.com/shrutim1> . There is a repo which you might
want to consider looking at. A very nice attempt at this issue. Here is the
link <https://github.com/SheliaXin/Scalable-K-means->
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4357 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AXXKnuMAkHvEVSz6R3iunVncNfSd-jHhks5vDXOXgaJpZM4DrC34>
.
|
@GaelVaroquaux here is the C++ code: https://github.com/flashxio/knor is that interesting/useful? we've never tried to make a PR to sklearn, but certainly open to it... |
is that interesting/useful?
The inclusion criteria for scikit-learn are quite strict:
https://scikit-learn.org/dev/faq.html
I am not sure that it would make it through them.
|
hey @GaelVaroquaux , right, we have an more efficient implementation of the k-means++ algorithm, not a new algorithm, so i think based on this:
our stuff gets the exact same answer as k-means++, just runs much faster. so, it seems like it would be ok? |
@jovo sorry for the late reply. We are definitely interested in optimizing the implementation of existing methods. However our code base try to leverage the Python syntax as much as possible to make it readable by datascientists familiar with Python (our user-base). I would be interested in learning more on what are the main factor that make your implements faster Another very efficient C++ implementation of k-means++ and k-means|| (a.k.a. "scalable kmeans++") is available in Intel DAAL and daal4py: https://intelpython.github.io/daal4py/algorithms.html?highlight=kmeans#daal4py.kmeans_init Since DAAL implements both kmeans++ and kmeans||, benchmarking it might help us decide whether or not it's worth investing time in trying to implement kmeans|| in (wither with Python+numpy or Cython+BLAS) in scikit-learn. @jeremiedbb also experimented with a subsampled variant of kmeans++ that seems to give very good results. |
@ogrisel, regarding the performance @jovo mentioned. The optimizations we implement significantly improve the performance of kmeans++ for multicore NUMA machines running the algorithm in parallel. No performance improvement is made when run serially or on single-core machines. We rely on NUMA memory allocation policies, and NUMA-sensitive dynamic task dispatching and NUMA-sensitive thread binding policies, among other optimizations. Optimizations performed are performed using low level C, but can be exposed to a python interface through bindings. We describe the framework we develop to support this here |
@disa-mhembere I thought you also had a parallelized implementation of k-means++, which, at the time, was not otherwise implemented anywhere? And also some improvements to k-means algorithms via leveraging the triangle inequality more effectively than prior implementations? |
I made a summary of our current understanding of the perf issues of k-means++ in this discussion #10924 (comment). Before considering k-means||, I would like to finalize the implementation improvement presented there and maybe the algorithmic improvement based on subsampling. All of them should be easy to maintain and lead >95% of the perf benefit of more complex method for >95% of our users. |
After so many years, the papers have enough citation counts. But I don’t know if they already are superseded. |
I think that one of scikit-learn's goals is to keep its usage simple: to me, there's value in solely providing a few yet relevant and efficient implementations; hence I wonder whether a new initialization method would fulfill this goal. I am not against a new variant, but I think would consider accepting it if an implementation exists and has been shown to be competitive over scikit-learn's implementation of kmeans++ (hence I would label this present issue as "Help Wanted"). What do other maintainers think? |
I'd label it as "help wanted". CUML supports the "scalable kmeans++" init https://docs.rapids.ai/api/cuml/stable/api/#k-means-clustering |
I am not against a new variant, but I think would consider accepting it if an implementation exists and has been shown to be competitive over scikit-learn's implementation of kmeans++
I agree: if it demonstrate a compelling improvement, I'm in favor.
|
Before implementing kmeans|| I think we should optimize our existing implementation of k-means++ using Cython and possibly prange/openmp as we already did a while ago for the subsequent main loop of |
I can see Kmeans++ implementation here but do you have scalable kmeans implementation which is supposed to be better than kmeans++
EDIT (reference):
Scalable K-Means++, Bahman Bahmani et al.
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
This algorithm is also known as "k-means||"
The text was updated successfully, but these errors were encountered: