Scalable Kmeans++ also known as k-means|| #4357

jagankr · 2015-03-07T08:40:00Z

I can see Kmeans++ implementation here but do you have scalable kmeans implementation which is supposed to be better than kmeans++

EDIT (reference):

Scalable K-Means++, Bahman Bahmani et al.
http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

This algorithm is also known as "k-means||"

GaelVaroquaux · 2015-03-07T08:42:09Z

What do you mean? Could you give us a reference?

jagankr · 2015-03-07T09:02:18Z

https://www.youtube.com/watch?v=cigXAxV3XcY

GaelVaroquaux · 2015-03-07T09:06:09Z

Please a reference of a scientific paper.

jagankr · 2015-03-07T09:08:46Z

http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf

GaelVaroquaux · 2015-03-07T09:32:26Z

This looks interesting: it is true that kmeans++ doesn't scale well to large datasets. We have found (in my group) that in the MiniBatchKMeans, work on the initialization was critical to get good results. Maybe this paper can improve things.

jagankr · 2015-03-08T10:54:34Z

As I understand scalable kmeans can be implemented by modifying _k_init(). But i am not able figure out what needs to be changed. Can you plz give me some clue if you have any idea.
Thanks in advance

amueller · 2015-03-09T14:46:50Z

I had the same issues @GaelVaroquaux. The paper doesn't seem widely cited but it is worth a try.
Btw, I found that the init is often the most expensive part of KMeans and I could use KMeans instead of MiniBatchKMeans when I switched the init from kmeans++ to random.

amueller · 2015-03-09T14:48:07Z

@jagankr I haven't looked at the paper (yet) but you are more likely to get an answer with a more specific question. What is the issue you are having? Is the strategy not clear from the paper?

amueller · 2015-03-09T14:53:42Z

I'm not sure about the last step

Recluster the weighted points in C into k clusters

Does that mean running weighted K-Means?
Then we'd need #4218 I think.

jagankr · 2015-03-11T15:13:15Z

will you please give an application of kmeans clustering over a large dataset that enables us to compare the kmeans and kmeans++

jagankr · 2015-03-11T15:19:14Z

compare means it should give a notable time difference.I think, it is possible only through an application that runs over a large dataset.

amueller · 2015-03-11T20:21:10Z

There are some comparisons of minibatch k-means and kmeans, right? You can try on 20 newsgroups, or maybe mnist, or repeated mnist? We had a function to nudge images, which you could apply to mnist, go generate more data....

zermelozf · 2015-10-20T13:31:28Z

Is anyone working on this one in the end? If not I'd like to give it a try.

raghavrv · 2015-10-20T13:35:55Z

go for it! :D

giorgiop · 2015-10-20T13:37:25Z

The right reference is this one, with 1200+ citations to date.

fabianp · 2015-10-22T12:48:02Z

Implementation for this available in PR #5530

amueller · 2016-10-27T21:20:23Z

doesn't seem to work :(

shubham0704 · 2017-03-07T02:51:43Z

@zermelozf Can I work on it.

Also here is a summary of why this should work (as studied from authors work):
(https://www.youtube.com/watch?v=cigXAxV3XcY)
This was the authors talk regarding this algorithm -
Summary:

Whats wrong with kmeans++ -

Need k-passes on data, application with large data, k can be something like a 100 or even 1000.

Intuition for a solution -

After it chooses the first center it has a distribution for the data points and it picks one and chooses one point from the joint distribution and picks a point and updates the distribution.

So what Bahman (author) says is that instead of this we want to over sample the data -

We want to pick each point independently so we do not want to choose points from a joint distribution we pick each point independently
This is pretty useful for distributed implementation because each point independently gets to decide if it should be in the set of centres or not.
This intuitively corresponds to updating your distribution much more infrequently which is a much more coarser sampling .So the way it works is that if you have a distribution you pick a lot of points from it and then update your distribution and then from the distribution you get after this step you again pick a lot of points.

The above algorithm is k means parallel.
It may not be clear even from the algorithm that this is actually sufficient but they offered a proof for its soundness.

@zermelozf I would request you for any valuable inputs since you gave some time into reading and implementing this. Thanks a lot.

zermelozf · 2017-03-07T14:12:27Z

@shubham0704

Sure. My first attempt at implementing this technique was not successful but feel free to take over. FYI, the technique I tried to implement was based on parallelism. Hoever the cores may already be busy from running different initialisations in parallel if I remember correctly.

shubham0704 · 2017-03-07T14:46:55Z

Thanks a lot @zermelozf

shrutim1 · 2019-01-12T13:43:43Z

Hi,
I wanted to know if this issue is still open?

adrinjalali · 2019-01-12T13:48:01Z

There's been a few attempts as you can see in the thread @shrutim1 , but nobody seems to be working on it ATM. So it's up for grab.

rrkarim · 2019-01-12T20:04:53Z

@adrinjalali seems like a good summer project for GSoC.

shubham0704 · 2019-01-15T06:20:59Z

@shrutim1 . There is a repo which you might want to consider looking at. A very nice attempt at this issue. Here is the link

shrutim1 · 2019-01-15T15:02:23Z

I will look at this. Thank you.

…

On Tue, Jan 15, 2019, 11:52 AM Shubham Bhardwaj ***@***.*** wrote: @shrutim1 <https://github.com/shrutim1> . There is a repo which you might want to consider looking at. A very nice attempt at this issue. Here is the link <https://github.com/SheliaXin/Scalable-K-means-> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4357 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AXXKnuMAkHvEVSz6R3iunVncNfSd-jHhks5vDXOXgaJpZM4DrC34> .

jovo · 2019-03-06T18:00:58Z

@GaelVaroquaux
@shrutim1
we have this paper: https://arxiv.org/abs/1606.08905
(published at Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017.)

here is the C++ code: https://github.com/flashxio/knor
and python bindings: https://github.com/flashxio/knorPy

is that interesting/useful?

we've never tried to make a PR to sklearn, but certainly open to it...

GaelVaroquaux · 2019-03-06T18:15:00Z

is that interesting/useful?

The inclusion criteria for scikit-learn are quite strict: https://scikit-learn.org/dev/faq.html I am not sure that it would make it through them.

jovo · 2019-03-08T03:25:07Z

hey @GaelVaroquaux , right, we have an more efficient implementation of the k-means++ algorithm, not a new algorithm, so i think based on this:

A technique that provides a clear-cut improvement (e.g. an enhanced data structure or a more efficient approximation technique) on a widely-used method will also be considered for inclusion.

our stuff gets the exact same answer as k-means++, just runs much faster.

so, it seems like it would be ok?

ogrisel · 2020-05-29T08:44:42Z

@jovo sorry for the late reply. We are definitely interested in optimizing the implementation of existing methods. However our code base try to leverage the Python syntax as much as possible to make it readable by datascientists familiar with Python (our user-base).

I would be interested in learning more on what are the main factor that make your implements faster

Another very efficient C++ implementation of k-means++ and k-means|| (a.k.a. "scalable kmeans++") is available in Intel DAAL and daal4py:

https://software.intel.com/sites/products/documentation/doclib/daal/daal-user-and-reference-guides/daal_cpp_api/group__kmeans__init.htm

https://intelpython.github.io/daal4py/algorithms.html?highlight=kmeans#daal4py.kmeans_init

Since DAAL implements both kmeans++ and kmeans||, benchmarking it might help us decide whether or not it's worth investing time in trying to implement kmeans|| in (wither with Python+numpy or Cython+BLAS) in scikit-learn.

@jeremiedbb also experimented with a subsampled variant of kmeans++ that seems to give very good results.

disa-mhembere · 2020-08-25T04:29:51Z

@ogrisel, regarding the performance @jovo mentioned. The optimizations we implement significantly improve the performance of kmeans++ for multicore NUMA machines running the algorithm in parallel. No performance improvement is made when run serially or on single-core machines. We rely on NUMA memory allocation policies, and NUMA-sensitive dynamic task dispatching and NUMA-sensitive thread binding policies, among other optimizations. Optimizations performed are performed using low level C, but can be exposed to a python interface through bindings. We describe the framework we develop to support this here

jovo · 2020-08-25T13:57:59Z

@disa-mhembere I thought you also had a parallelized implementation of k-means++, which, at the time, was not otherwise implemented anywhere? And also some improvements to k-means algorithms via leveraging the triangle inequality more effectively than prior implementations?

ogrisel · 2020-12-16T09:36:57Z

I made a summary of our current understanding of the perf issues of k-means++ in this discussion #10924 (comment).

Before considering k-means||, I would like to finalize the implementation improvement presented there and maybe the algorithmic improvement based on subsampling.

All of them should be easy to maintain and lead >95% of the perf benefit of more complex method for >95% of our users.

lorentzenchr · 2023-05-01T20:32:22Z

After so many years, the papers have enough citation counts. But I don’t know if they already are superseded.
@jeremiedbb @jjerphan @ogrisel What is your opinion? Close or set to „help wanted“?

jjerphan · 2023-05-02T08:35:59Z

I think that one of scikit-learn's goals is to keep its usage simple: to me, there's value in solely providing a few yet relevant and efficient implementations; hence I wonder whether a new initialization method would fulfill this goal.

I am not against a new variant, but I think would consider accepting it if an implementation exists and has been shown to be competitive over scikit-learn's implementation of kmeans++ (hence I would label this present issue as "Help Wanted").

What do other maintainers think?

betatim · 2023-05-02T09:09:22Z

I'd label it as "help wanted". CUML supports the "scalable kmeans++" init https://docs.rapids.ai/api/cuml/stable/api/#k-means-clustering

GaelVaroquaux · 2023-05-10T20:39:54Z

I am not against a new variant, but I think would consider accepting it if an implementation exists and has been shown to be competitive over scikit-learn's implementation of kmeans++

I agree: if it demonstrate a compelling improvement, I'm in favor.

ogrisel · 2023-09-27T08:03:14Z

Before implementing kmeans|| I think we should optimize our existing implementation of k-means++ using Cython and possibly prange/openmp as we already did a while ago for the subsequent main loop of Kmeans.fit.

GaelVaroquaux added Enhancement Large Scale labels Mar 7, 2015

GaelVaroquaux changed the title ~~Scalable Kmeans~~ Scalable Kmeans++ Mar 7, 2015

amueller mentioned this issue Mar 9, 2015

added sample_weight support to K-Means and Minibatch K-means #4218

Closed

jagankr closed this as completed Mar 11, 2015

jagankr reopened this Mar 11, 2015

zermelozf mentioned this issue Oct 22, 2015

[WIP] k-means|| initialization for k-means #5530

Closed

shubham0704 mentioned this issue Mar 14, 2017

[WIP] Scalable K-means ++ #8585

Closed

5 tasks

adrinjalali added help wanted Moderate Anything that requires some knowledge of conventions and best practices labels Jan 12, 2019

ogrisel changed the title ~~Scalable Kmeans++~~ Scalable Kmeans++ also known as k-means|| May 29, 2020

ogrisel mentioned this issue May 29, 2020

KMeans(init='k-means++') performance issue with OpenBLAS #17334

Open

cmarmo added the module:cluster label Dec 7, 2021

lorentzenchr added the Performance label Apr 2, 2023

Scalable Kmeans++ also known as k-means|| #4357

Scalable Kmeans++ also known as k-means|| #4357

Comments

jagankr commented Mar 7, 2015 • edited by ogrisel Loading

GaelVaroquaux commented Mar 7, 2015 via email

jagankr commented Mar 7, 2015

GaelVaroquaux commented Mar 7, 2015 via email

jagankr commented Mar 7, 2015

GaelVaroquaux commented Mar 7, 2015

jagankr commented Mar 8, 2015

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

amueller commented Mar 9, 2015

jagankr commented Mar 11, 2015

jagankr commented Mar 11, 2015

amueller commented Mar 11, 2015

zermelozf commented Oct 20, 2015

raghavrv commented Oct 20, 2015

giorgiop commented Oct 20, 2015

fabianp commented Oct 22, 2015

amueller commented Oct 27, 2016

shubham0704 commented Mar 7, 2017

Whats wrong with kmeans++ -

Intuition for a solution -

So what Bahman (author) says is that instead of this we want to over sample the data -

zermelozf commented Mar 7, 2017

shubham0704 commented Mar 7, 2017

shrutim1 commented Jan 12, 2019

adrinjalali commented Jan 12, 2019

rrkarim commented Jan 12, 2019

shubham0704 commented Jan 15, 2019

shrutim1 commented Jan 15, 2019 via email

jovo commented Mar 6, 2019

GaelVaroquaux commented Mar 6, 2019 via email

jovo commented Mar 8, 2019

ogrisel commented May 29, 2020

disa-mhembere commented Aug 25, 2020

jovo commented Aug 25, 2020

ogrisel commented Dec 16, 2020

lorentzenchr commented May 1, 2023

jjerphan commented May 2, 2023

betatim commented May 2, 2023

GaelVaroquaux commented May 10, 2023 via email

ogrisel commented Sep 27, 2023

jagankr commented Mar 7, 2015 •

edited by ogrisel

Loading