Skip to content

[WIP] k-means|| initialization for k-means #5530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

zermelozf
Copy link
Contributor

Following the discussion on issue #4357. I have implemented a draft of k-means|| (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).

In this implementation, no attempt was made at parallelising the k-means|| initialisation! This could in theory be done though and we can discuss the utility of adding a truly data parallel, scalable version of the algorithm.

Parallelizing the initialisation at the data level is somewhat incompatible with the current parallelisation strategy that runs several models (starting with different initialisation seeds) in parallel. The provided implementation thus only provides an alternative way to initialise the cluster centres, and should not be more scalable than k-means++.

I also have attempted to reproduce claims from the paper that k-means|| initialization produces better empirical results with mixed results. Graphs showing the performance and runtime of k-means with different initialisation approaches are shown below (cf. this gist).

Artificial dataset generated with make_blobs(n_samples=10000, n_features=20, centers=15, cluster_std=1):

blobs

20newsgroups dataset:

newgroups

In the end, no initialisation seems to dominates the other in terms of speed or accuracy. I am not entirely convinced this implementation is really a big plus as the name is a bit deceiving, it is neither parallel nor more scalable. It may improve runtime or performance marginally in some cases though.

n_clusters: integer
The number of seeds to choose

l: int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be optional at least by your checks below. They say that it can be thought of theta(k), but I don't get what theta is

@MechCoder
Copy link
Member

The initial benchmarks don't look too pleasing, as you have said. If not for anything else, it is because the user has two more parameters to tune.

@MechCoder
Copy link
Member

Actually, the present algorithm in kmeans++ is parallelizable by itself across n_local trials. But the problem as you have mentioned already, is that we are already parallelizing across n_init.

However it might be worth trying to do something like this.

If n_local_trials >> n_init:
    for i in n_init:
        Parallelize across n_local_trials to obtain initial centroids.
   Parallelize across n_local_seeds using these centroids

closest_dist_sq = np.minimum(distances, closest_dist_sq)
current_pot = closest_dist_sq.sum()

center_ids.extend(candidate_ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check for duplicate centers here? (especially since we are not doing it in parallel)

@amueller
Copy link
Member

amueller commented Oct 7, 2016

@zermelozf are you still working on this? Should someone else take over?

@zermelozf
Copy link
Contributor Author

@amueller, I am not working on this anymore as I could not reproduce the claim of the paper. If someone would like to investigate this further, please feel free to take over the code.

@amueller
Copy link
Member

amueller commented Oct 8, 2016

@zermelozf ok, that is pretty useful information. thanks :) I'll tag it "need contributor" but we might decide to close it in the future.

@amueller
Copy link
Member

actually closing it

@amueller amueller closed this Oct 31, 2016
@glemaitre glemaitre mentioned this pull request Mar 16, 2017
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants