-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
KMeans: precompute_distances differences accros machines #7193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you give an example of a random state that yields different results on different machines? I assume you find that the results are replicable on a single machine. |
When I re-run KMeans on Machine A and Machine B they are repeatable. The problem is that there is the difference across machines. I will try to come up with a sample data set to illustrate that. |
Thanks. Yes, the lack of dataset is a problem. First: could you please check whether you can replicate the difference by installing numpy 1.8.2 on Machine B? |
I did something slightly different. I checked on Machine A and B whether clustering is different when precompute_distances is True and False and it seems to be different on both versions of numpy (and also across versions when it's True, as reported in the original issue). I will try to prepare an example to illustrate it, but unfortunately this might take some time. |
OK, I've created an example script here: To reproduce the problem, it needs to be run twice, once with precompute_distances set to True and once set to False. The output is logged to test.log, so it's the best to store it under some temporary file after the first run and use a diff tool to compare. One of the silhouette scores differs, which is a result of a different clustering. Hope that helps. |
Which distances k-means computes if set precompute_distances to True? |
precompute_distances is deprecated and unused anymore (#11950). Closing. |
Description
When using KMeans, setting precomputer_distances to True gives different cluster assignments for the same data set on different machines.
Steps/Code to Reproduce
Example:
Expected Results
The same cluster assignment on different machines.
Actual Results
Different cluster assignments on different machines. I realized that by comparing the silhouette score. It's worth noting that if I repeat the clustering on the same machine the results are identical (clustering is the same).
Versions
Machine A:
Machine B:
The NumPy version and the kernel version differs.
The text was updated successfully, but these errors were encountered: