-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Simplify Elkan k-means? #10924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Issue #10744 also discusses K-means improvements. So, solving this one might close the other as well. |
Yes, I visited this poster. I think I forgot to ping you about it @amueller.
|
@jnothman haha no worries ;) What do you think about it? |
I can't remember my discussions with the author, but the argument for simplification in 4.1.2 and its justification as being BLAS-optimised are compelling. |
Shall I take this ? |
It's very technical. I would expect you need more practice before trying this one |
Sure ! |
I think it's worth mentioning that the authors shared an implementation here: https://github.com/idiap/eakmeans. |
Shall I take this issue? You can check my GitHub profile and research at: https://scholar.google.co.in/citations?user=WG3xUOQAAAAJ&hl=en |
Usually we'd ask new contributors to start with something smaller. But if you feel comfortable with the code, go ahead. |
If you want to make k-means really fast, you need to go for a k-d-tree based algorithm.
Because these can assign more than one point at once. The algorithm by Hartigan and Wong also appears to be very good and underappreciated, and may give better results even with the same starting conditions. It would be interesting to see how common this happens (it probably makes only a tiny difference; you could combine this with checking for the accuracy loss from the binomial expansion). Unfortunately, it is missing from many benchmarks.
|
@kno10 kd-trees only work well in low-dimensional spaces though, right? |
I would like to work on this issue. |
@kno10 So Pelleg shows that in 6 or more dimensions the naive algorithm is faster than theirs, and Kanungo doesn't report runtimes in more than 4 dimensions. Which makes sense, and using these kind of data structures in low dimensions makes a lot of sense, but the work I'm referencing above works also in higher dimensions. |
Yes, the k-d-tree based approaches do not seem to be competitive in high-dimensional data. But in low-dimensional data, they can run in less than N*k distance computations (that is not O notation). K-means++ initialization can take longer than these algorithms on low-dimensional large data. |
Our current K-means++ implementation is far from being sub-optimal for several reason: 1- The is a problem with repeated input checks when computing distance that is both useless, sequential and memory bound: #19002 Then on top of those implementation improvement, there is also a potential algorithmic improvement: 3- Then we can also sub-sampling the data used for k-means++ which seems to be quite safe and can lead to significant further scalability improvement for very large datasets according to #11493. |
Once those are fixed, we can also explore again the opportunity to use k-d trees for low dim data, but it's better to compare on a strong baseline than the current code in master. |
Also simplifying Elkan (as referenced in the issue description) sounds like a good idea anyways. |
I found this cool paper that says that removing some of the tests in elkans algorithm can speed it up:
http://proceedings.mlr.press/v48/newling16.pdf
The paper also looks at a simplified ying-yang algorithm, which I still think is interesting (apparently mostly for low-dim spaces and when Elkans needs too much RAM).
The text was updated successfully, but these errors were encountered: