Simplify Elkan k-means? #10924

amueller · 2018-04-05T22:26:27Z

I found this cool paper that says that removing some of the tests in elkans algorithm can speed it up:
http://proceedings.mlr.press/v48/newling16.pdf

The paper also looks at a simplified ying-yang algorithm, which I still think is interesting (apparently mostly for low-dim spaces and when Elkans needs too much RAM).

mohamed-ali · 2018-04-06T07:50:35Z

Issue #10744 also discusses K-means improvements. So, solving this one might close the other as well.

jnothman · 2018-04-09T08:21:35Z

Yes, I visited this poster. I think I forgot to ping you about it @amueller.

amueller · 2018-04-09T21:42:19Z

@jnothman haha no worries ;) What do you think about it?

jnothman · 2018-04-10T11:30:44Z

I can't remember my discussions with the author, but the argument for simplification in 4.1.2 and its justification as being BLAS-optimised are compelling.

aishgrt1 · 2018-04-16T05:47:46Z

Shall I take this ?

jnothman · 2018-04-16T06:24:38Z

It's very technical. I would expect you need more practice before trying this one

aishgrt1 · 2018-04-16T06:32:34Z

Sure !

mohamed-ali · 2018-04-16T15:53:46Z

I think it's worth mentioning that the authors shared an implementation here: https://github.com/idiap/eakmeans.

virajmavani · 2018-04-22T19:39:22Z

Shall I take this issue? You can check my GitHub profile and research at: https://scholar.google.co.in/citations?user=WG3xUOQAAAAJ&hl=en

jnothman · 2018-04-22T23:07:23Z

Usually we'd ask new contributors to start with something smaller. But if you feel comfortable with the code, go ahead.

kno10 · 2018-10-08T05:41:49Z

If you want to make k-means really fast, you need to go for a k-d-tree based algorithm.

D. Pelleg and A. Moore. “Accelerating exact k-means algorithms with geometric reasoning”. In: Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA. 1999, pp. 277–281

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. “An efficient k-means clustering algorithm: Analysis and implementation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 24.7 (2002), pp. 881–892

Because these can assign more than one point at once.

The algorithm by Hartigan and Wong also appears to be very good and underappreciated, and may give better results even with the same starting conditions. It would be interesting to see how common this happens (it probably makes only a tiny difference; you could combine this with checking for the accuracy loss from the binomial expansion). Unfortunately, it is missing from many benchmarks.

J. A. Hartigan, and M. A. Wong. “Algorithm AS 136: A k-means clustering algorithm.” Applied statistics (1979): 100-108.

amueller · 2020-06-06T21:04:49Z

@kno10 kd-trees only work well in low-dimensional spaces though, right?

parthsuresh · 2020-06-06T21:15:18Z

I would like to work on this issue.

amueller · 2020-06-06T21:19:36Z

@kno10 So Pelleg shows that in 6 or more dimensions the naive algorithm is faster than theirs, and Kanungo doesn't report runtimes in more than 4 dimensions. Which makes sense, and using these kind of data structures in low dimensions makes a lot of sense, but the work I'm referencing above works also in higher dimensions.

kno10 · 2020-06-09T15:44:51Z

Yes, the k-d-tree based approaches do not seem to be competitive in high-dimensional data. But in low-dimensional data, they can run in less than N*k distance computations (that is not O notation). K-means++ initialization can take longer than these algorithms on low-dimensional large data.

ogrisel · 2020-12-16T09:25:41Z

K-means++ initialization can take longer than these algorithms on low-dimensional large data.

Our current K-means++ implementation is far from being sub-optimal for several reason:

1- The is a problem with repeated input checks when computing distance that is both useless, sequential and memory bound: #19002
2- Then even without the input checks, the combo "euclidean distance followed by np.min" is partially memory bound because it allocates distances to a large temporary array that does not fit in CPU caches before taking the min. np.min is also sequential. We could fix this by computing the distances by chunks and computing the mean on the fly, either in pure Python or in a Cython helper function (that could also process the chunks in // using prange). This Cython helper could come from a refactoring of the Cython helper used in the main K-Means loop that does distances followed by argmin in chunks (see work done in #11950).

Then on top of those implementation improvement, there is also a potential algorithmic improvement:

3- Then we can also sub-sampling the data used for k-means++ which seems to be quite safe and can lead to significant further scalability improvement for very large datasets according to #11493.

ogrisel · 2020-12-16T09:38:30Z

Once those are fixed, we can also explore again the opportunity to use k-d trees for low dim data, but it's better to compare on a strong baseline than the current code in master.

ogrisel · 2020-12-16T10:30:36Z

Also simplifying Elkan (as referenced in the issue description) sounds like a good idea anyways.

amueller added Enhancement Performance help wanted labels Apr 5, 2018

amueller mentioned this issue Jul 16, 2018

Make kmeans++ subsample data for initialization on large n_samples or n_clusters #11493

Open

amueller mentioned this issue Nov 14, 2018

[MRG] new K-means implementation for improved performances #11950

Merged

6 tasks

This was referenced Dec 16, 2020

ENH Avoid repeated input checks in kmeans++ #19002

Merged

Scalable Kmeans++ also known as k-means|| #4357

Open

cmarmo added the module:cluster label Jan 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify Elkan k-means? #10924

Simplify Elkan k-means? #10924

amueller commented Apr 5, 2018

mohamed-ali commented Apr 6, 2018

jnothman commented Apr 9, 2018 via email

amueller commented Apr 9, 2018

jnothman commented Apr 10, 2018

aishgrt1 commented Apr 16, 2018

jnothman commented Apr 16, 2018

aishgrt1 commented Apr 16, 2018

mohamed-ali commented Apr 16, 2018

virajmavani commented Apr 22, 2018

jnothman commented Apr 22, 2018

kno10 commented Oct 8, 2018 •

edited

Loading

amueller commented Jun 6, 2020

parthsuresh commented Jun 6, 2020

amueller commented Jun 6, 2020

kno10 commented Jun 9, 2020

ogrisel commented Dec 16, 2020 •

edited

Loading

ogrisel commented Dec 16, 2020

ogrisel commented Dec 16, 2020

Simplify Elkan k-means? #10924

Simplify Elkan k-means? #10924

Comments

amueller commented Apr 5, 2018

mohamed-ali commented Apr 6, 2018

jnothman commented Apr 9, 2018 via email

amueller commented Apr 9, 2018

jnothman commented Apr 10, 2018

aishgrt1 commented Apr 16, 2018

jnothman commented Apr 16, 2018

aishgrt1 commented Apr 16, 2018

mohamed-ali commented Apr 16, 2018

virajmavani commented Apr 22, 2018

jnothman commented Apr 22, 2018

kno10 commented Oct 8, 2018 • edited Loading

amueller commented Jun 6, 2020

parthsuresh commented Jun 6, 2020

amueller commented Jun 6, 2020

kno10 commented Jun 9, 2020

ogrisel commented Dec 16, 2020 • edited Loading

ogrisel commented Dec 16, 2020

ogrisel commented Dec 16, 2020

kno10 commented Oct 8, 2018 •

edited

Loading

ogrisel commented Dec 16, 2020 •

edited

Loading