Skip to content

Scalable Kmeans++ also known as k-means|| #4357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jagankr opened this issue Mar 7, 2015 · 37 comments
Open

Scalable Kmeans++ also known as k-means|| #4357

jagankr opened this issue Mar 7, 2015 · 37 comments
Labels
Enhancement help wanted Large Scale Moderate Anything that requires some knowledge of conventions and best practices module:cluster Performance

Comments

@jagankr
Copy link

jagankr commented Mar 7, 2015

I can see Kmeans++ implementation here but do you have scalable kmeans implementation which is supposed to be better than kmeans++

EDIT (reference):

Scalable K-Means++, Bahman Bahmani et al.
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

This algorithm is also known as "k-means||"

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 7, 2015 via email

@jagankr
Copy link
Author

jagankr commented Mar 7, 2015

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 7, 2015 via email

@jagankr
Copy link
Author

jagankr commented Mar 7, 2015

@GaelVaroquaux GaelVaroquaux changed the title Scalable Kmeans Scalable Kmeans++ Mar 7, 2015
@GaelVaroquaux
Copy link
Member

This looks interesting: it is true that kmeans++ doesn't scale well to large datasets. We have found (in my group) that in the MiniBatchKMeans, work on the initialization was critical to get good results. Maybe this paper can improve things.

@jagankr
Copy link
Author

jagankr commented Mar 8, 2015

As I understand scalable kmeans can be implemented by modifying _k_init(). But i am not able figure out what needs to be changed. Can you plz give me some clue if you have any idea.
Thanks in advance

@amueller
Copy link
Member

amueller commented Mar 9, 2015

I had the same issues @GaelVaroquaux. The paper doesn't seem widely cited but it is worth a try.
Btw, I found that the init is often the most expensive part of KMeans and I could use KMeans instead of MiniBatchKMeans when I switched the init from kmeans++ to random.

@amueller
Copy link
Member

amueller commented Mar 9, 2015

@jagankr I haven't looked at the paper (yet) but you are more likely to get an answer with a more specific question. What is the issue you are having? Is the strategy not clear from the paper?

@amueller
Copy link
Member

amueller commented Mar 9, 2015

I'm not sure about the last step

Recluster the weighted points in C into k clusters

Does that mean running weighted K-Means?
Then we'd need #4218 I think.

@jagankr
Copy link
Author

jagankr commented Mar 11, 2015

will you please give an application of kmeans clustering over a large dataset that enables us to compare the kmeans and kmeans++

@jagankr
Copy link
Author

jagankr commented Mar 11, 2015

compare means it should give a notable time difference.I think, it is possible only through an application that runs over a large dataset.

@jagankr jagankr closed this as completed Mar 11, 2015
@jagankr jagankr reopened this Mar 11, 2015
@amueller
Copy link
Member

There are some comparisons of minibatch k-means and kmeans, right? You can try on 20 newsgroups, or maybe mnist, or repeated mnist? We had a function to nudge images, which you could apply to mnist, go generate more data....

@zermelozf
Copy link
Contributor

Is anyone working on this one in the end? If not I'd like to give it a try.

@raghavrv
Copy link
Member

go for it! :D

@giorgiop
Copy link
Contributor

The right reference is this one, with 1200+ citations to date.

@fabianp
Copy link
Member

fabianp commented Oct 22, 2015

Implementation for this available in PR #5530

@amueller
Copy link
Member

doesn't seem to work :(

@shubham0704
Copy link

@zermelozf Can I work on it.

Also here is a summary of why this should work (as studied from authors work):
(https://www.youtube.com/watch?v=cigXAxV3XcY)
This was the authors talk regarding this algorithm -
Summary:

Whats wrong with kmeans++ -

  • Need k-passes on data, application with large data, k can be something like a 100 or even 1000.

Intuition for a solution -

After it chooses the first center it has a distribution for the data points and it picks one and chooses one point from the joint distribution and picks a point and updates the distribution.

So what Bahman (author) says is that instead of this we want to over sample the data -

  • We want to pick each point independently so we do not want to choose points from a joint distribution we pick each point independently
    This is pretty useful for distributed implementation because each point independently gets to decide if it should be in the set of centres or not.
    This intuitively corresponds to updating your distribution much more infrequently which is a much more coarser sampling .So the way it works is that if you have a distribution you pick a lot of points from it and then update your distribution and then from the distribution you get after this step you again pick a lot of points.

The above algorithm is k means parallel.
It may not be clear even from the algorithm that this is actually sufficient but they offered a proof for its soundness.

@zermelozf I would request you for any valuable inputs since you gave some time into reading and implementing this. Thanks a lot.

@zermelozf
Copy link
Contributor

@shubham0704

Sure. My first attempt at implementing this technique was not successful but feel free to take over. FYI, the technique I tried to implement was based on parallelism. Hoever the cores may already be busy from running different initialisations in parallel if I remember correctly.

@shubham0704
Copy link

Thanks a lot @zermelozf

@shrutim1
Copy link

Hi,
I wanted to know if this issue is still open?

@adrinjalali
Copy link
Member

There's been a few attempts as you can see in the thread @shrutim1 , but nobody seems to be working on it ATM. So it's up for grab.

@adrinjalali adrinjalali added help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Jan 12, 2019
@rrkarim
Copy link

rrkarim commented Jan 12, 2019

@adrinjalali seems like a good summer project for GSoC.

@shubham0704
Copy link

@shrutim1 . There is a repo which you might want to consider looking at. A very nice attempt at this issue. Here is the link

@shrutim1
Copy link

shrutim1 commented Jan 15, 2019 via email

@jovo
Copy link

jovo commented Mar 6, 2019

@GaelVaroquaux
@shrutim1
we have this paper: https://arxiv.org/abs/1606.08905
(published at Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017.)

here is the C++ code: https://github.com/flashxio/knor
and python bindings: https://github.com/flashxio/knorPy

is that interesting/useful?

we've never tried to make a PR to sklearn, but certainly open to it...

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 6, 2019 via email

@jovo
Copy link

jovo commented Mar 8, 2019

hey @GaelVaroquaux , right, we have an more efficient implementation of the k-means++ algorithm, not a new algorithm, so i think based on this:

A technique that provides a clear-cut improvement (e.g. an enhanced data structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.

our stuff gets the exact same answer as k-means++, just runs much faster.

so, it seems like it would be ok?

@ogrisel ogrisel changed the title Scalable Kmeans++ Scalable Kmeans++ also known as k-means|| May 29, 2020
@ogrisel
Copy link
Member

ogrisel commented May 29, 2020

@jovo sorry for the late reply. We are definitely interested in optimizing the implementation of existing methods. However our code base try to leverage the Python syntax as much as possible to make it readable by datascientists familiar with Python (our user-base).

I would be interested in learning more on what are the main factor that make your implements faster

Another very efficient C++ implementation of k-means++ and k-means|| (a.k.a. "scalable kmeans++") is available in Intel DAAL and daal4py:

https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_cpp_api/group__kmeans__init.htm

https://intelpython.github.io/daal4py/algorithms.html?highlight=kmeans#daal4py.kmeans_init

Since DAAL implements both kmeans++ and kmeans||, benchmarking it might help us decide whether or not it's worth investing time in trying to implement kmeans|| in (wither with Python+numpy or Cython+BLAS) in scikit-learn.

@jeremiedbb also experimented with a subsampled variant of kmeans++ that seems to give very good results.

@disa-mhembere
Copy link

@ogrisel, regarding the performance @jovo mentioned. The optimizations we implement significantly improve the performance of kmeans++ for multicore NUMA machines running the algorithm in parallel. No performance improvement is made when run serially or on single-core machines. We rely on NUMA memory allocation policies, and NUMA-sensitive dynamic task dispatching and NUMA-sensitive thread binding policies, among other optimizations. Optimizations performed are performed using low level C, but can be exposed to a python interface through bindings. We describe the framework we develop to support this here

@jovo
Copy link

jovo commented Aug 25, 2020

@disa-mhembere I thought you also had a parallelized implementation of k-means++, which, at the time, was not otherwise implemented anywhere? And also some improvements to k-means algorithms via leveraging the triangle inequality more effectively than prior implementations?

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

I made a summary of our current understanding of the perf issues of k-means++ in this discussion #10924 (comment).

Before considering k-means||, I would like to finalize the implementation improvement presented there and maybe the algorithmic improvement based on subsampling.

All of them should be easy to maintain and lead >95% of the perf benefit of more complex method for >95% of our users.

@lorentzenchr
Copy link
Member

After so many years, the papers have enough citation counts. But I don’t know if they already are superseded.
@jeremiedbb @jjerphan @ogrisel What is your opinion? Close or set to „help wanted“?

@jjerphan
Copy link
Member

jjerphan commented May 2, 2023

I think that one of scikit-learn's goals is to keep its usage simple: to me, there's value in solely providing a few yet relevant and efficient implementations; hence I wonder whether a new initialization method would fulfill this goal.

I am not against a new variant, but I think would consider accepting it if an implementation exists and has been shown to be competitive over scikit-learn's implementation of kmeans++ (hence I would label this present issue as "Help Wanted").

What do other maintainers think?

@betatim
Copy link
Member

betatim commented May 2, 2023

I'd label it as "help wanted". CUML supports the "scalable kmeans++" init https://docs.rapids.ai/api/cuml/stable/api/#k-means-clustering

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented May 10, 2023 via email

@ogrisel
Copy link
Member

ogrisel commented Sep 27, 2023

Before implementing kmeans|| I think we should optimize our existing implementation of k-means++ using Cython and possibly prange/openmp as we already did a while ago for the subsequent main loop of Kmeans.fit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Large Scale Moderate Anything that requires some knowledge of conventions and best practices module:cluster Performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.