-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] k-means|| initialization for k-means #5530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d2d2ae9
to
f92e3c0
Compare
n_clusters: integer | ||
The number of seeds to choose | ||
|
||
l: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not be optional at least by your checks below. They say that it can be thought of theta(k)
, but I don't get what theta
is
The initial benchmarks don't look too pleasing, as you have said. If not for anything else, it is because the user has two more parameters to tune. |
Actually, the present algorithm in However it might be worth trying to do something like this.
|
closest_dist_sq = np.minimum(distances, closest_dist_sq) | ||
current_pot = closest_dist_sq.sum() | ||
|
||
center_ids.extend(candidate_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we check for duplicate centers here? (especially since we are not doing it in parallel)
@zermelozf are you still working on this? Should someone else take over? |
@amueller, I am not working on this anymore as I could not reproduce the claim of the paper. If someone would like to investigate this further, please feel free to take over the code. |
@zermelozf ok, that is pretty useful information. thanks :) I'll tag it "need contributor" but we might decide to close it in the future. |
actually closing it |
Following the discussion on issue #4357. I have implemented a draft of k-means|| (http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
In this implementation, no attempt was made at parallelising the k-means|| initialisation! This could in theory be done though and we can discuss the utility of adding a truly data parallel, scalable version of the algorithm.
Parallelizing the initialisation at the data level is somewhat incompatible with the current parallelisation strategy that runs several models (starting with different initialisation seeds) in parallel. The provided implementation thus only provides an alternative way to initialise the cluster centres, and should not be more scalable than k-means++.
I also have attempted to reproduce claims from the paper that k-means|| initialization produces better empirical results with mixed results. Graphs showing the performance and runtime of k-means with different initialisation approaches are shown below (cf. this gist).
Artificial dataset generated with make_blobs(n_samples=10000, n_features=20, centers=15, cluster_std=1):
20newsgroups dataset:
In the end, no initialisation seems to dominates the other in terms of speed or accuracy. I am not entirely convinced this implementation is really a big plus as the name is a bit deceiving, it is neither parallel nor more scalable. It may improve runtime or performance marginally in some cases though.