[WIP] Scalable K-means ++ #8585

shubham0704 · 2017-03-14T10:50:10Z

Reference Issue

What does this implement/fix? Explain your changes.

This implementation fixes the need for a scalable k-means algorithm. Currently it is a sequential version but on approval work will start on making a parallel version.
A to-do list -

Any other comments?

I have done some commenting which may feel unnecessary but this is only for maintainer to easily understand my code. This version need some refinement but it works in general. For implementation details with plotting kindly see : https://github.com/shubham0704/scalable_kmeans In this run:

for just testing take test_script.py from my repo and paste it in sckit-learn folder after generating binaries
for plot it is enough to clone my repo and run sample_plot.py (for plot) .
requirements before running plot:numpy and matplotlib

glemaitre · 2017-03-16T10:34:52Z

IMHO, there is two important things to be done for you current PR:

write the unit tests right now;
make some benchmarks such that we can compare with [WIP] k-means|| initialization for k-means #5530.

for just testing take test_script.py from my repo and paste it in sckit-learn folder after generating binaries

for plot it is enough to clone my repo and run sample_plot.py (for plot) requirements before running plot:numpy and matplotlib

You might want to avoid making code outside of scikit-learn since the core developers will have limited time to play with it. More importantly, plots and other results from those scripts can be posted directly inside the PR. It will greatly ease the follow-up of the PR.

shubham0704 · 2017-03-16T15:24:22Z

Sure.Thanks a lot @glemaitre .

jmschrei · 2017-03-19T06:20:51Z

sklearn/cluster/k_means_.py

@@ -191,7 +301,8 @@ def k_means(X, n_clusters, init='k-means++', precompute_distances='auto',
        centroid seeds. The final results will be the best output of
        n_init consecutive runs in terms of inertia.

-    init : {'k-means++', 'random', or ndarray, or a callable}, optional
+    init : {'k-means++', 'k-means||', 'random', or ndarray, or a callable},


Need to add a description below of what kmeans|| is

Sure, will add.Thanks a lot @jmschrei

shubham0704 · 2017-03-19T16:35:04Z

For easy follow up the algorithm:

This is a small benchmark test for an artificial data set. Basically the same one as the one in previous pr.

X, y = datasets.make_blobs(n_samples=10000,
                           n_features=20,
                           centers=15,
                           cluster_std=1)

OBSERVATIONS:

Time taken to calculate centres increases with increase in sampling factor. So this happens because when we calculate D(x,c) in each step of the main for-loop (Step-3) candidate centres increases. we need to make the process of calculating distance of point from candidate_centres parallel. Which can be easily done as each point selects independently from a set of candidates.
For step-5 we have to run (# n_clusters) for loops in parallel.

Code for plot is in this (gist)
The values for inertia and exact values for time taken are in this html document (an image of my ipynb): (doc).
I can post the plot image but inertia values can clearly show accuracy of initialisation so I refrained.

raghavrv · 2017-03-24T13:18:10Z

Hi, Could you re-post with clearer labelled plots. Which one is kmeans (master) and which one is the kmeans++? Also a bigger plot please...

And did you try kmeans (current implementation) with n_jobs... The speed gains seem really weird... Is this PR's implementation 100x better than current ?!

shubham0704 · 2017-03-25T15:34:07Z

@raghavrv Thanks for taking a look at this. For clarity these are my benchmarks results in a tabular format-
I have compared the current k-means with k-means++ initialisation Vs k-means with my k-means||
initialisation.
sampling_factor - number of candidate centres chosen at one pass

Algorithm	Inertia	Total-time taken	sampling_factor
`k-means++`	184201.108453	1.41764903069	does not apply
``k-means		``	184949.289745
``k-means		``	184616.469914
``k-means		``	185868.158701
``k-means		``	185753.381721
``k-means		``	185371.57211

These graphs say the moment I over sample (take extra candidate centres ) time taken shoots up.
Essentially I feel this can be controlled by parallelizing this implementation.

I will post proper graphs shortly. Thanks. : )

raghavrv · 2017-03-25T15:46:45Z

So the kmeans|| is quite slow and the benchmarks are as expected from the conclusions of #5530 or am I missing something?

If that is the case, I think it would be wise to not expend energy on it then.. (Please feel free to let me know if there is something I'm missing that would make addition of kmeans|| useful)

shubham0704 · 2017-03-25T17:17:56Z

I would like to suggest that this is just a sequential implementation. I am sure making it parallel would reduce the time taken. This pr was put to actually show implementation of the algorithm and that it works although slow.

I started to work on it because I once wrote a search handler that had some very long code, I made it into functions, made it work async. This made search handler very fast. I am not sure how fast this can go but given a chance I would like to work on it.

raghavrv · 2017-03-25T17:30:00Z

I was just suggesting...

Please feel free to pursue further if you believe time and effort spent on it would yield a benefit to you (in terms of your PR getting merged) and to the community (in terms of the PR being significantly faster for significant use cases)...

raghavrv · 2017-03-25T17:34:57Z

I would like to suggest that this is just a sequential implementation. I am sure making it parallel would reduce the time taken

Also this PR is not cythonized while the one you are comparing to is... So cythonizing your implementation could also speed up maybe...

shubham0704 · 2017-03-25T17:39:05Z

Thanks a lot @raghavrv .

shubham0704 · 2017-03-25T17:57:51Z

I would like to add it as my GSOC proposal.Is it possible?

shubham0704 · 2017-03-27T09:58:28Z

I guess it would not be worth it . Anyways learnt a lot. Thanks a lot for all the reviews. @glemaitre , @raghavrv , @jmschrei .

shubham0704 added 3 commits March 14, 2017 13:44

[WIP] Scalable k-means ++

18c5193

tested in scikit-learn

0ad8dae

minor changes

a1b59ab

TomDLT added the New Feature label Mar 14, 2017

faster version

b609b0d

jmschrei reviewed Mar 19, 2017

View reviewed changes

made recommended changes

add8714

shubham0704 closed this Mar 27, 2017

jmschrei mentioned this pull request Apr 12, 2017

slower and less accurate performance on a simple example relative to sklearn jmschrei/pomegranate#240

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] Scalable K-means ++ #8585

[WIP] Scalable K-means ++ #8585

Uh oh!

shubham0704 commented Mar 14, 2017 •

edited

Loading

Uh oh!

glemaitre commented Mar 16, 2017 •

edited

Loading

Uh oh!

shubham0704 commented Mar 16, 2017

Uh oh!

jmschrei Mar 19, 2017

Uh oh!

shubham0704 Mar 19, 2017

Uh oh!

shubham0704 commented Mar 19, 2017

Uh oh!

raghavrv commented Mar 24, 2017

Uh oh!

shubham0704 commented Mar 25, 2017

Uh oh!

raghavrv commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 25, 2017 •

edited

Loading

Uh oh!

raghavrv commented Mar 25, 2017

Uh oh!

raghavrv commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 27, 2017

Uh oh!

Uh oh!

Uh oh!

[WIP] Scalable K-means ++ #8585

[WIP] Scalable K-means ++ #8585

Uh oh!

Conversation

shubham0704 commented Mar 14, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

glemaitre commented Mar 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shubham0704 commented Mar 16, 2017

Uh oh!

jmschrei Mar 19, 2017

Choose a reason for hiding this comment

Uh oh!

shubham0704 Mar 19, 2017

Choose a reason for hiding this comment

Uh oh!

shubham0704 commented Mar 19, 2017

Uh oh!

raghavrv commented Mar 24, 2017

Uh oh!

shubham0704 commented Mar 25, 2017

Uh oh!

raghavrv commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 25, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raghavrv commented Mar 25, 2017

Uh oh!

raghavrv commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 25, 2017

Uh oh!

shubham0704 commented Mar 27, 2017

Uh oh!

Uh oh!

shubham0704 commented Mar 14, 2017 •

edited

Loading

glemaitre commented Mar 16, 2017 •

edited

Loading

shubham0704 commented Mar 25, 2017 •

edited

Loading