-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH: Add support for cosine distance in k-means #12192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -186,7 +213,8 @@ def _check_sample_weight(X, sample_weight): | |||
def k_means(X, n_clusters, sample_weight=None, init='k-means++', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would these changes interact with #11950, since both are modifying k_means()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going for minimal changes to the current implementation. However, I looked into the cited pull request and I believe the cosine distance using chunked BLAS calls would work as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definition of clusters by the mean coordinates simply does not make sense for most metrics. I think it is bad to assume the user knows what they're doing here, because many users naively assume that kmeans should allow arbitrary metric. It shouldn't.
Rather we should encourage use of hierarchical clustering for non-Euclidean metrics, or kmedoids (#11099), although we are waiting there on some evidence that kmedoids has benefits over hierarchical clustering (can you help there?)
8e813d6
to
35d8cef
Compare
@jnothman Using the mean coordinates for cluster definition is effective for clustering text documents. I made some changes to allow only the euclidean and cosine metrics. The cosine distance path might be slower because I didn't implement a cythonized E-step. Should I go ahead and do that? K-means showed better results in terms of cluster quality metrics compared to the KMedoids implementation cited above. |
d1c3fa5
to
6730bcb
Compare
Closing this one as stated in the original issue. |
Reference Issues/PRs
See also #1188.
What does this implement/fix? Explain your changes.
Adds support for cosine distance in k-means including the minibatch implementation. We trust that users know what they're doing and allow the metric to be a callable with metric_kwargs.
Any other comments?
I enabled the cleaning of headers, footers, and quotes from the 20newsgroups dataset documents for the example program plot_document_clustering.py. Now it clearly shows improvements in the metrics V-measure and ARI when using cosine distance for text document clustering.