[WIP] k-means|| initialization for k-means #5530

zermelozf · 2015-10-22T12:44:33Z

Following the discussion on issue #4357. I have implemented a draft of k-means|| (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

In this implementation, no attempt was made at parallelising the k-means|| initialisation! This could in theory be done though and we can discuss the utility of adding a truly data parallel, scalable version of the algorithm.

Parallelizing the initialisation at the data level is somewhat incompatible with the current parallelisation strategy that runs several models (starting with different initialisation seeds) in parallel. The provided implementation thus only provides an alternative way to initialise the cluster centres, and should not be more scalable than k-means++.

I also have attempted to reproduce claims from the paper that k-means|| initialization produces better empirical results with mixed results. Graphs showing the performance and runtime of k-means with different initialisation approaches are shown below (cf. this gist).

Artificial dataset generated with make_blobs(n_samples=10000, n_features=20, centers=15, cluster_std=1):

20newsgroups dataset:

In the end, no initialisation seems to dominates the other in terms of speed or accuracy. I am not entirely convinced this implementation is really a big plus as the name is a bit deceiving, it is neither parallel nor more scalable. It may improve runtime or performance marginally in some cases though.

MechCoder · 2015-11-06T21:58:07Z

sklearn/cluster/k_means_.py

+    n_clusters: integer
+        The number of seeds to choose
+
+    l: int


This should not be optional at least by your checks below. They say that it can be thought of theta(k), but I don't get what theta is

MechCoder · 2015-11-06T22:00:18Z

The initial benchmarks don't look too pleasing, as you have said. If not for anything else, it is because the user has two more parameters to tune.

MechCoder · 2015-11-06T22:13:38Z

Actually, the present algorithm in kmeans++ is parallelizable by itself across n_local trials. But the problem as you have mentioned already, is that we are already parallelizing across n_init.

However it might be worth trying to do something like this.

If n_local_trials >> n_init:
    for i in n_init:
        Parallelize across n_local_trials to obtain initial centroids.
   Parallelize across n_local_seeds using these centroids

MechCoder · 2015-11-07T00:22:58Z

sklearn/cluster/k_means_.py

+        closest_dist_sq  = np.minimum(distances, closest_dist_sq)
+        current_pot = closest_dist_sq.sum()
+
+        center_ids.extend(candidate_ids)


Should we check for duplicate centers here? (especially since we are not doing it in parallel)

amueller · 2016-10-07T22:28:34Z

@zermelozf are you still working on this? Should someone else take over?

zermelozf · 2016-10-08T15:17:04Z

@amueller, I am not working on this anymore as I could not reproduce the claim of the paper. If someone would like to investigate this further, please feel free to take over the code.

amueller · 2016-10-08T18:35:27Z

@zermelozf ok, that is pretty useful information. thanks :) I'll tag it "need contributor" but we might decide to close it in the future.

amueller · 2016-10-31T19:46:15Z

actually closing it

fabianp mentioned this pull request Oct 22, 2015

Scalable Kmeans++ also known as k-means|| #4357

Open

Added a draft version of k-means||

f92e3c0

zermelozf force-pushed the parallel-kmean branch from d2d2ae9 to f92e3c0 Compare October 22, 2015 12:52

MechCoder reviewed Nov 6, 2015
View reviewed changes

MechCoder reviewed Nov 7, 2015
View reviewed changes

amueller added Enhancement Need Contributor labels Oct 8, 2016

RPGOne approved these changes Oct 10, 2016

View reviewed changes

amueller closed this Oct 31, 2016

glemaitre mentioned this pull request Mar 16, 2017

[WIP] Scalable K-means ++ #8585

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[WIP] k-means|| initialization for k-means #5530

[WIP] k-means|| initialization for k-means #5530

Uh oh!

zermelozf commented Oct 22, 2015

Uh oh!

MechCoder Nov 6, 2015

Uh oh!

MechCoder commented Nov 6, 2015

Uh oh!

MechCoder commented Nov 6, 2015

Uh oh!

MechCoder Nov 7, 2015

Uh oh!

amueller commented Oct 7, 2016

Uh oh!

zermelozf commented Oct 8, 2016

Uh oh!

amueller commented Oct 8, 2016 •

edited

Loading

Uh oh!

amueller commented Oct 31, 2016

Uh oh!

Uh oh!

Uh oh!

[WIP] k-means|| initialization for k-means #5530

[WIP] k-means|| initialization for k-means #5530

Uh oh!

Conversation

zermelozf commented Oct 22, 2015

Uh oh!

MechCoder Nov 6, 2015

Choose a reason for hiding this comment

Uh oh!

MechCoder commented Nov 6, 2015

Uh oh!

MechCoder commented Nov 6, 2015

Uh oh!

MechCoder Nov 7, 2015

Choose a reason for hiding this comment

Uh oh!

amueller commented Oct 7, 2016

Uh oh!

zermelozf commented Oct 8, 2016

Uh oh!

amueller commented Oct 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Oct 31, 2016

Uh oh!

Uh oh!

amueller commented Oct 8, 2016 •

edited

Loading