Skip to content

[WIP] Scalable K-means ++ #8585

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

shubham0704
Copy link

@shubham0704 shubham0704 commented Mar 14, 2017

Reference Issue

Fix #4357

What does this implement/fix? Explain your changes.

This implementation fixes the need for a scalable k-means algorithm. Currently it is a sequential version but on approval work will start on making a parallel version.
A to-do list -

  • Test it in scikit-learn library
  • Make a parallel version and test it
  • Do bench-marking
  • Write tests
  • Documentation work

Any other comments?

I have done some commenting which may feel unnecessary but this is only for maintainer to easily understand my code. This version need some refinement but it works in general. For implementation details with plotting kindly see : https://github.com/shubham0704/scalable_kmeans In this run:

  • for just testing take test_script.py from my repo and paste it in sckit-learn folder after generating binaries
  • for plot it is enough to clone my repo and run sample_plot.py (for plot) .
    requirements before running plot:numpy and matplotlib

@glemaitre
Copy link
Member

glemaitre commented Mar 16, 2017

IMHO, there is two important things to be done for you current PR:

  • for just testing take test_script.py from my repo and paste it in sckit-learn folder after generating binaries
  • for plot it is enough to clone my repo and run sample_plot.py (for plot) requirements before running plot:numpy and matplotlib

You might want to avoid making code outside of scikit-learn since the core developers will have limited time to play with it. More importantly, plots and other results from those scripts can be posted directly inside the PR. It will greatly ease the follow-up of the PR.

@shubham0704
Copy link
Author

Sure.Thanks a lot @glemaitre .

@@ -191,7 +301,8 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
centroid seeds. The final results will be the best output of
n_init consecutive runs in terms of inertia.

init : {'k-means++', 'random', or ndarray, or a callable}, optional
init : {'k-means++', 'k-means||', 'random', or ndarray, or a callable},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a description below of what kmeans|| is

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add.Thanks a lot @jmschrei

@shubham0704
Copy link
Author

For easy follow up the algorithm:
scalable

This is a small benchmark test for an artificial data set. Basically the same one as the one in previous pr.

X, y = datasets.make_blobs(n_samples=10000,
                           n_features=20,
                           centers=15,
                           cluster_std=1)

screenshot from 2017-03-19 21-46-10

OBSERVATIONS:

  • Time taken to calculate centres increases with increase in sampling factor. So this happens because when we calculate D(x,c) in each step of the main for-loop (Step-3) candidate centres increases. we need to make the process of calculating distance of point from candidate_centres parallel. Which can be easily done as each point selects independently from a set of candidates.
  • For step-5 we have to run (# n_clusters) for loops in parallel.

Code for plot is in this (gist)
The values for inertia and exact values for time taken are in this html document (an image of my ipynb): (doc).
I can post the plot image but inertia values can clearly show accuracy of initialisation so I refrained.

@raghavrv
Copy link
Member

Hi, Could you re-post with clearer labelled plots. Which one is kmeans (master) and which one is the kmeans++? Also a bigger plot please...

And did you try kmeans (current implementation) with n_jobs... The speed gains seem really weird... Is this PR's implementation 100x better than current ?!

@shubham0704
Copy link
Author

@raghavrv Thanks for taking a look at this. For clarity these are my benchmarks results in a tabular format-
I have compared the current k-means with k-means++ initialisation Vs k-means with my k-means||
initialisation.
sampling_factor - number of candidate centres chosen at one pass

Algorithm Inertia Total-time taken sampling_factor
k-means++ 184201.108453 1.41764903069 does not apply
``k-means `` 184949.289745
``k-means `` 184616.469914
``k-means `` 185868.158701
``k-means `` 185753.381721
``k-means `` 185371.57211

These graphs say the moment I over sample (take extra candidate centres ) time taken shoots up.
Essentially I feel this can be controlled by parallelizing this implementation.

I will post proper graphs shortly. Thanks. : )

@raghavrv
Copy link
Member

So the kmeans|| is quite slow and the benchmarks are as expected from the conclusions of #5530 or am I missing something?

If that is the case, I think it would be wise to not expend energy on it then.. (Please feel free to let me know if there is something I'm missing that would make addition of kmeans|| useful)

@shubham0704
Copy link
Author

shubham0704 commented Mar 25, 2017

I would like to suggest that this is just a sequential implementation. I am sure making it parallel would reduce the time taken. This pr was put to actually show implementation of the algorithm and that it works although slow.

I started to work on it because I once wrote a search handler that had some very long code, I made it into functions, made it work async. This made search handler very fast. I am not sure how fast this can go but given a chance I would like to work on it.

@raghavrv
Copy link
Member

I was just suggesting...

Please feel free to pursue further if you believe time and effort spent on it would yield a benefit to you (in terms of your PR getting merged) and to the community (in terms of the PR being significantly faster for significant use cases)...

@raghavrv
Copy link
Member

I would like to suggest that this is just a sequential implementation. I am sure making it parallel would reduce the time taken

Also this PR is not cythonized while the one you are comparing to is... So cythonizing your implementation could also speed up maybe...

@shubham0704
Copy link
Author

Thanks a lot @raghavrv .

@shubham0704
Copy link
Author

I would like to add it as my GSOC proposal.Is it possible?

@shubham0704
Copy link
Author

I guess it would not be worth it . Anyways learnt a lot. Thanks a lot for all the reviews. @glemaitre , @raghavrv , @jmschrei .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Scalable Kmeans++ also known as k-means||
5 participants