-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Scalable K-means ++ #8585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
IMHO, there is two important things to be done for you current PR:
You might want to avoid making code outside of scikit-learn since the core developers will have limited time to play with it. More importantly, plots and other results from those scripts can be posted directly inside the PR. It will greatly ease the follow-up of the PR. |
Sure.Thanks a lot @glemaitre . |
@@ -191,7 +301,8 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto', | |||
centroid seeds. The final results will be the best output of | |||
n_init consecutive runs in terms of inertia. | |||
|
|||
init : {'k-means++', 'random', or ndarray, or a callable}, optional | |||
init : {'k-means++', 'k-means||', 'random', or ndarray, or a callable}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a description below of what kmeans|| is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add.Thanks a lot @jmschrei
For easy follow up the algorithm: This is a small benchmark test for an artificial data set. Basically the same one as the one in previous pr.
OBSERVATIONS:
Code for plot is in this (gist) |
Hi, Could you re-post with clearer labelled plots. Which one is kmeans (master) and which one is the kmeans++? Also a bigger plot please... And did you try kmeans (current implementation) with |
@raghavrv Thanks for taking a look at this. For clarity these are my benchmarks results in a tabular format-
These graphs say the moment I over sample (take extra candidate centres ) time taken shoots up. I will post proper graphs shortly. Thanks. : ) |
So the kmeans|| is quite slow and the benchmarks are as expected from the conclusions of #5530 or am I missing something? If that is the case, I think it would be wise to not expend energy on it then.. (Please feel free to let me know if there is something I'm missing that would make addition of kmeans|| useful) |
I would like to suggest that this is just a sequential implementation. I am sure making it parallel would reduce the time taken. This pr was put to actually show implementation of the algorithm and that it works although slow. I started to work on it because I once wrote a search handler that had some very long code, I made it into functions, made it work async. This made search handler very fast. I am not sure how fast this can go but given a chance I would like to work on it. |
I was just suggesting... Please feel free to pursue further if you believe time and effort spent on it would yield a benefit to you (in terms of your PR getting merged) and to the community (in terms of the PR being significantly faster for significant use cases)... |
Also this PR is not cythonized while the one you are comparing to is... So cythonizing your implementation could also speed up maybe... |
Thanks a lot @raghavrv . |
I would like to add it as my GSOC proposal.Is it possible? |
I guess it would not be worth it . Anyways learnt a lot. Thanks a lot for all the reviews. @glemaitre , @raghavrv , @jmschrei . |
Reference Issue
Fix #4357
What does this implement/fix? Explain your changes.
This implementation fixes the need for a scalable k-means algorithm. Currently it is a sequential version but on approval work will start on making a parallel version.
A to-do list -
Any other comments?
I have done some commenting which may feel unnecessary but this is only for maintainer to easily understand my code. This version need some refinement but it works in general. For implementation details with plotting kindly see : https://github.com/shubham0704/scalable_kmeans In this run:
requirements before running plot:numpy and matplotlib