Skip to content

Simplify Elkan k-means? #10924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Apr 5, 2018 · 18 comments
Open

Simplify Elkan k-means? #10924

amueller opened this issue Apr 5, 2018 · 18 comments

Comments

@amueller
Copy link
Member

amueller commented Apr 5, 2018

I found this cool paper that says that removing some of the tests in elkans algorithm can speed it up:
http://proceedings.mlr.press/v48/newling16.pdf

The paper also looks at a simplified ying-yang algorithm, which I still think is interesting (apparently mostly for low-dim spaces and when Elkans needs too much RAM).

@mohamed-ali
Copy link
Contributor

Issue #10744 also discusses K-means improvements. So, solving this one might close the other as well.

@jnothman
Copy link
Member

jnothman commented Apr 9, 2018 via email

@amueller
Copy link
Member Author

amueller commented Apr 9, 2018

@jnothman haha no worries ;) What do you think about it?

@jnothman
Copy link
Member

I can't remember my discussions with the author, but the argument for simplification in 4.1.2 and its justification as being BLAS-optimised are compelling.

@aishgrt1
Copy link
Contributor

Shall I take this ?

@jnothman
Copy link
Member

It's very technical. I would expect you need more practice before trying this one

@aishgrt1
Copy link
Contributor

Sure !

@mohamed-ali
Copy link
Contributor

I think it's worth mentioning that the authors shared an implementation here: https://github.com/idiap/eakmeans.

@virajmavani
Copy link
Contributor

Shall I take this issue? You can check my GitHub profile and research at: https://scholar.google.co.in/citations?user=WG3xUOQAAAAJ&hl=en

@jnothman
Copy link
Member

Usually we'd ask new contributors to start with something smaller. But if you feel comfortable with the code, go ahead.

@kno10
Copy link
Contributor

kno10 commented Oct 8, 2018

If you want to make k-means really fast, you need to go for a k-d-tree based algorithm.

D. Pelleg and A. Moore. “Accelerating exact k-means algorithms with geometric reasoning”. In: Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA. 1999, pp. 277–281

T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu. “An efficient k-means clustering algorithm: Analysis and implementation”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 24.7 (2002), pp. 881–892

Because these can assign more than one point at once.

The algorithm by Hartigan and Wong also appears to be very good and underappreciated, and may give better results even with the same starting conditions. It would be interesting to see how common this happens (it probably makes only a tiny difference; you could combine this with checking for the accuracy loss from the binomial expansion). Unfortunately, it is missing from many benchmarks.

J. A. Hartigan, and M. A. Wong. “Algorithm AS 136: A k-means clustering algorithm.” Applied statistics (1979): 100-108.

@amueller
Copy link
Member Author

amueller commented Jun 6, 2020

@kno10 kd-trees only work well in low-dimensional spaces though, right?

@parthsuresh
Copy link

I would like to work on this issue.

@amueller
Copy link
Member Author

amueller commented Jun 6, 2020

@kno10 So Pelleg shows that in 6 or more dimensions the naive algorithm is faster than theirs, and Kanungo doesn't report runtimes in more than 4 dimensions. Which makes sense, and using these kind of data structures in low dimensions makes a lot of sense, but the work I'm referencing above works also in higher dimensions.

@kno10
Copy link
Contributor

kno10 commented Jun 9, 2020

Yes, the k-d-tree based approaches do not seem to be competitive in high-dimensional data. But in low-dimensional data, they can run in less than N*k distance computations (that is not O notation). K-means++ initialization can take longer than these algorithms on low-dimensional large data.

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

K-means++ initialization can take longer than these algorithms on low-dimensional large data.

Our current K-means++ implementation is far from being sub-optimal for several reason:

1- The is a problem with repeated input checks when computing distance that is both useless, sequential and memory bound: #19002
2- Then even without the input checks, the combo "euclidean distance followed by np.min" is partially memory bound because it allocates distances to a large temporary array that does not fit in CPU caches before taking the min. np.min is also sequential. We could fix this by computing the distances by chunks and computing the mean on the fly, either in pure Python or in a Cython helper function (that could also process the chunks in // using prange). This Cython helper could come from a refactoring of the Cython helper used in the main K-Means loop that does distances followed by argmin in chunks (see work done in #11950).

Then on top of those implementation improvement, there is also a potential algorithmic improvement:

3- Then we can also sub-sampling the data used for k-means++ which seems to be quite safe and can lead to significant further scalability improvement for very large datasets according to #11493.

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

Once those are fixed, we can also explore again the opportunity to use k-d trees for low dim data, but it's better to compare on a strong baseline than the current code in master.

@ogrisel
Copy link
Member

ogrisel commented Dec 16, 2020

Also simplifying Elkan (as referenced in the issue description) sounds like a good idea anyways.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants