ENH: Add support for cosine distance in k-means #12192

flaviomartins · 2018-09-28T15:55:20Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Adds support for cosine distance in k-means including the minibatch implementation. We trust that users know what they're doing and allow the metric to be a callable with metric_kwargs.

Any other comments?

I enabled the cleaning of headers, footers, and quotes from the 20newsgroups dataset documents for the example program plot_document_clustering.py. Now it clearly shows improvements in the metrics V-measure and ARI when using cosine distance for text document clustering.

xhluca · 2018-09-28T18:04:15Z

sklearn/cluster/k_means_.py

@@ -186,7 +213,8 @@ def _check_sample_weight(X, sample_weight):
 def k_means(X, n_clusters, sample_weight=None, init='k-means++',


How would these changes interact with #11950, since both are modifying k_means()?

I'm going for minimal changes to the current implementation. However, I looked into the cited pull request and I believe the cosine distance using chunked BLAS calls would work as well.

jnothman

Definition of clusters by the mean coordinates simply does not make sense for most metrics. I think it is bad to assume the user knows what they're doing here, because many users naively assume that kmeans should allow arbitrary metric. It shouldn't.

Rather we should encourage use of hierarchical clustering for non-Euclidean metrics, or kmedoids (#11099), although we are waiting there on some evidence that kmedoids has benefits over hierarchical clustering (can you help there?)

flaviomartins · 2018-10-02T13:20:45Z

@jnothman Using the mean coordinates for cluster definition is effective for clustering text documents. I made some changes to allow only the euclidean and cosine metrics. The cosine distance path might be slower because I didn't implement a cythonized E-step. Should I go ahead and do that?

K-means showed better results in terms of cluster quality metrics compared to the KMedoids implementation cited above.

glemaitre · 2024-03-13T17:08:05Z

Closing this one as stated in the original issue.

xhluca reviewed Sep 28, 2018

View reviewed changes

jnothman requested changes Sep 30, 2018

View reviewed changes

flaviomartins force-pushed the cosine_rebased_20 branch from 8e813d6 to 35d8cef Compare October 1, 2018 16:33

liufsd mentioned this pull request May 27, 2019

Please support custom distance function for k-means #13956

Closed

amueller added Needs Decision Requires decision Waiting for Reviewer labels Aug 6, 2019

flaviomartins added 3 commits September 5, 2019 18:24

ENH: Add support for cosine distance in k-means

d1cd600

fix flake8 errors

90290b6

raise ValueError for unsupported metrics

6730bcb

flaviomartins force-pushed the cosine_rebased_20 branch from d1c3fa5 to 6730bcb Compare September 5, 2019 17:32

github-actions bot added the module:cluster label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:50

cmarmo removed the Waiting for Reviewer label Feb 14, 2022

flaviomartins requested a review from jnothman June 1, 2023 17:22

glemaitre closed this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add support for cosine distance in k-means #12192

ENH: Add support for cosine distance in k-means #12192

flaviomartins commented Sep 28, 2018

xhluca Sep 28, 2018

flaviomartins Oct 1, 2018

jnothman left a comment

flaviomartins commented Oct 2, 2018

glemaitre commented Mar 13, 2024

		@@ -186,7 +213,8 @@ def _check_sample_weight(X, sample_weight):
		def k_means(X, n_clusters, sample_weight=None, init='k-means++',

ENH: Add support for cosine distance in k-means #12192

ENH: Add support for cosine distance in k-means #12192

Conversation

flaviomartins commented Sep 28, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

xhluca Sep 28, 2018

Choose a reason for hiding this comment

flaviomartins Oct 1, 2018

Choose a reason for hiding this comment

jnothman left a comment

Choose a reason for hiding this comment

flaviomartins commented Oct 2, 2018

glemaitre commented Mar 13, 2024