Skip to content

ENH: Add support for cosine distance in k-means #12192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

flaviomartins
Copy link

Reference Issues/PRs

See also #1188.

What does this implement/fix? Explain your changes.

Adds support for cosine distance in k-means including the minibatch implementation. We trust that users know what they're doing and allow the metric to be a callable with metric_kwargs.

Any other comments?

I enabled the cleaning of headers, footers, and quotes from the 20newsgroups dataset documents for the example program plot_document_clustering.py. Now it clearly shows improvements in the metrics V-measure and ARI when using cosine distance for text document clustering.

@@ -186,7 +213,8 @@ def _check_sample_weight(X, sample_weight):
def k_means(X, n_clusters, sample_weight=None, init='k-means++',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would these changes interact with #11950, since both are modifying k_means()?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going for minimal changes to the current implementation. However, I looked into the cited pull request and I believe the cosine distance using chunked BLAS calls would work as well.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definition of clusters by the mean coordinates simply does not make sense for most metrics. I think it is bad to assume the user knows what they're doing here, because many users naively assume that kmeans should allow arbitrary metric. It shouldn't.

Rather we should encourage use of hierarchical clustering for non-Euclidean metrics, or kmedoids (#11099), although we are waiting there on some evidence that kmedoids has benefits over hierarchical clustering (can you help there?)

@flaviomartins
Copy link
Author

@jnothman Using the mean coordinates for cluster definition is effective for clustering text documents. I made some changes to allow only the euclidean and cosine metrics. The cosine distance path might be slower because I didn't implement a cythonized E-step. Should I go ahead and do that?

K-means showed better results in terms of cluster quality metrics compared to the KMedoids implementation cited above.

Base automatically changed from master to main January 22, 2021 10:50
@flaviomartins flaviomartins requested a review from jnothman June 1, 2023 17:22
@glemaitre
Copy link
Member

Closing this one as stated in the original issue.

@glemaitre glemaitre closed this Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants