KMedoids implementation Part Three #11099

znd4 · 2018-05-16T01:24:53Z

Helping @Kornel to finish up/close #7694.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This is an attempt to finish the implementation of KMedoids clustering.

Any other comments?

Thanks to @Kornel and @terkkila for their work on this, and to everyone who's reviewed things so far. Also, thanks and apologies in advance to whoever reviews my code.

… private.

…ded tests for KMedoids::fit() and KMedoids::fit_predict()

… tests.

…ptive ones.

…Euclidean distance.

…etrics and plots the results.

…doids.

…ith python 2.7

…inside fit() instead. Also, random_state constructor argument was added to KMedoids so as to ensure identical behavior between multiple fit() calls. Made the examples and tests to comply with PEP8 convention.

…nd transform() -- added unit tests to verify. Simplified the heuristic based on recommendation. Using just one underscore to denote private method.

…ns have been used by fit(). Moved array checks to their own private method.

…k to an example script.

- changed positionless formatting to percent notation - fixed documentation plot

domenico-somma · 2018-07-09T10:30:56Z

It's basically extracted from the "Dataset SOFT file" link there as far as I can tell...

Yes, it's just that...

ybdesire · 2019-02-26T13:04:45Z

waited a lot of time for this k-medoids algorithm

jnothman · 2019-02-26T18:33:30Z

@ybdesire, could you please help us get this released by giving us a dataset where k medoids is a much better choice than hierarchical agglomerative clustering with the same metric?

znd4 · 2019-02-27T00:58:39Z

One of the assumptions of kmeans (or kmedoids) clustering is that the data is spherically distributed. I tried to create a "spherically distributed" dataset in a space with a non-euclidean metric, but everything I came up with felt pretty contrived. @ybdesire, that might be an avenue worth pursuing.

znd4 · 2019-02-27T01:20:09Z

@jnothman, does a generated dataset like that qualify? I started with k random strings as seeds, then generate points by selecting one of the seeds and randomly changing N characters, where N is drawn from a "gaussian-ish" distribution.

jnothman · 2019-02-27T13:19:49Z

It's not extremely persuasive if you have to intentionally engineer spheres in non-euclidean space...?

znd4 · 2019-02-27T14:08:10Z

Haha I began to worry that that was the case. I'd be personally sad to see this PR thrown out (for purely biased selfish reasons), but it's possible that this should be closed until someone can find a persuasive use-case for KMedoids.

jnothman · 2019-02-28T09:00:19Z

Well there's certainly nothing wrong with releasing kmedoids as a separate package and seeing how it gets used. That's probably better than hiding it here in a pull request. It's easy to appreciate the theoretical motivation for developing a K medoids algorithm, it's just hard to be convinced that it's practically more useful than hierarchical or density based clustering.

EggsBenedict · 2019-03-20T13:43:02Z

I'd personally really like to see this move forward, too (and a kmeans implementation that allows you to specify a distance metric). For K-Medoids specifically I've been applying it to datasets where the cluster centroid should always be tied to a specific observation as that observation is used as a comparison to later be evaluated (by a human) for conceptual similarity.

Textual data doesn't seem appropriate for a clusterer where you need to set the number of clusters.

Which is a great point! In this case, approximating K accurately is very important. Despite its issues with scaling, the algorithm does well with noise and seems to converge quickly and relatively consistently.

jnothman · 2019-03-20T21:20:39Z

So you want to use kmedoids less to cluster and more to identify a prototype of identified groups?

EggsBenedict · 2019-03-20T21:29:39Z

In this particular case, the quantity of identified groups is still a derived number based on a separate model acting on a vector representation of the text data so I think it still would be considered clustering but also I do think there's truth to what you're saying. It's an edge case, for sure...

jnothman · 2019-03-21T06:26:04Z

@rth would this be a good candidate for extras if we are happy with the implementation but holding out on clear justification?

rth · 2019-03-22T08:06:23Z

would this be a good candidate for extras if we are happy with the implementation but holding out on clear justification?

I think it would be.

We recently created https://github.com/scikit-learn-contrib/scikit-learn-extra as the place where to put good quality scikit-learn contributions that don't get merged for one reason or another. This would allow users to start using such implementations, then if they gain enough popularity on practical problems we can consider merging them back into scikit-learn. In any case that would be better than a pending PR.

@zdog234 Would you be interested in making a contribution with this PR there?

BTW, has anyone looked at "A simple and fast algorithm for K-medoids clustering" by Park & Jun (2009)? After a superficial read it looks like they improve upon the original Kaufman & Rousseeuw (1987) paper in particular with respect to scalability (and it has lots of citations).

adrinjalali · 2019-03-29T20:36:40Z

Opened scikit-learn-contrib/scikit-learn-extra#8, closing this one.

znd4 · 2019-03-30T15:51:58Z

I'd definitely be interested in porting this over to scikit-learn-extra! I also wouldn't mind trying to implement an updated algorithm.

AvantiShri · 2020-01-09T04:50:20Z

One use-case for k-mediods is finding "exemplars" (i.e. finding a set of datapoints that can be used as representatives of the full set). Using it for clustering itself may be less valuable than simply obtaining exemplars.

kno10 · 2020-01-26T15:43:05Z

doc/modules/clustering.rst

+currently only supports Partitioning Around Medoids (PAM). The PAM algorithm
+uses a greedy search, which may fail to find the global optimum. It consists of
+two alternating steps commonly called the
+Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987).


BUILD and SWAP are different. These are not assignment and update steps.

kno10 · 2020-01-26T15:43:28Z

doc/modules/clustering.rst

+
+.. topic:: References:
+
+ * "Clustering by Means of Medoids'"


Currently this code does not implement this reference!

kno10 · 2020-01-26T15:45:39Z

This implementation implements the rather poor "Alternating" algorithm (that was also later reinvented by Park). It is known in literature to produce substantially worse results than the BUILD+SWAP approach used by Kaufman and Rousseeuw.
Yes its faster - but because it gets stuck in worse solutions... try on some non-trivial data, and compare the result quality.

jnothman · 2020-01-26T21:53:03Z

@kno10, this has been published in scikit-learn-extra (via scikit-learn-contrib/scikit-learn-extra#12). Perhaps you should comment over there, where the same reference was included.

kno10 · 2020-01-26T22:33:29Z

I have; this comment is just in case this gets revived here once more.

kno10 · 2023-11-22T16:58:56Z

In case someone stumbles across this old issue:
There is a scikit-learn compatible kmedoids package implementing fast algorithms installable via pip install kmedoids:
https://python-kmedoids.readthedocs.io/en/latest/

terkkila and others added 30 commits October 25, 2016 21:02

Added new input arguments: clustering and distance_metric.

d4537f5

Removed deprecated mlpy import.

5c916f8

Allowed usage of any pairwise distance metric defined in Scikit-Learn.

228ef33

Added unit tests. Renamed k -> n_clusters. Made some member variables…

19b22d7

… private.

KMedoids is now derived from BaseEstimator, and has proper mixins. Ad…

a854442

…ded tests for KMedoids::fit() and KMedoids::fit_predict()

Added KMedoids::transform(). Update transform and predict methods and…

c4745c5

… tests.

Updated initialization. Updated unit tests.

8119cec

Added naive test for KMedoids::fit(). Updated KMedoids interface.

840b713

Modifed according to PEP8 requirements.

e941bfb

Added authors and license. Changed some variable names to more descri…

58e8e8b

…ptive ones.

Testing all available pairwise distance functions for fitting in a loop.

abb9b64

Added testing with iris data set.

9eff466

In unit test, comparing K-Medoids to K-Means when K-Medoids is using …

ab97949

…Euclidean distance.

Refactored. Added comments.

cb308f6

Small tweaks.

15a7428

Testing the initialization method for K-Medoids. Unit test coverage 91%.

ccbc756

Adding example script, which runs K-Medoids with different distance m…

ab4ae4e

…etrics and plots the results.

Added checks for input data, which also convert the data if needed.

61df1f6

Added try/catch around importing exceptions.

c7b1164

Added try/catch around importing exceptions in testing script of K-Me…

af5abb0

…doids.

Made parts compatible with python 3.4 while retaining compatibility w…

935f9aa

…ith python 2.7

Moved init args checcking to separate function, which is called from …

216601d

…inside fit() instead. Also, random_state constructor argument was added to KMedoids so as to ensure identical behavior between multiple fit() calls. Made the examples and tests to comply with PEP8 convention.

Using 4 spaces to replace tab. Using check_is_fitted() in predict() a…

135d3cf

…nd transform() -- added unit tests to verify. Simplified the heuristic based on recommendation. Using just one underscore to denote private method.

Added max_iter input argument for KMedoids.

65ae85f

Removed importing of ValueError.

0c92427

Added self.n_iter_ member to KMedoids to keep track how many iteratio…

b308a35

…ns have been used by fit(). Moved array checks to their own private method.

Inlined the function to check if KMedoids is fitted.

47844d5

First version of KMedoids documentation. Documentation includes a lin…

2b3c697

…k to an example script.

ENH: K-Medoids fixes

63a899d

ENH: KMedoids python 2.6 compatibility

d9d92d2

- changed positionless formatting to percent notation - fixed documentation plot

qinhanmin2014 mentioned this pull request Jul 8, 2018

[MRG] Adding K-Medoids clustering algorithm revival #7694

Closed

qinhanmin2014 added this to the 0.20 milestone Jul 8, 2018

amueller modified the milestones: 0.20, 0.21 Jul 16, 2018

adrinjalali mentioned this pull request Mar 29, 2019

KMedoids implementation scikit-learn-contrib/scikit-learn-extra#8

Closed

adrinjalali closed this Mar 29, 2019

znd4 mentioned this pull request Apr 29, 2019

Implementing KMedoids in scikit-learn-extra scikit-learn-contrib/scikit-learn-extra#12

Merged

znd4 mentioned this pull request Jul 28, 2019

Add example where KMedoids is better than existing scikit-learn clustering algorithms scikit-learn-contrib/scikit-learn-extra#22

Open

kno10 reviewed Jan 26, 2020

View reviewed changes

Uh oh!

KMedoids implementation Part Three #11099

KMedoids implementation Part Three #11099

Uh oh!

Conversation

znd4 commented May 16, 2018

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

domenico-somma commented Jul 9, 2018

Uh oh!

ybdesire commented Feb 26, 2019

Uh oh!

jnothman commented Feb 26, 2019

Uh oh!

znd4 commented Feb 27, 2019

Uh oh!

znd4 commented Feb 27, 2019

Uh oh!

jnothman commented Feb 27, 2019

Uh oh!

znd4 commented Feb 27, 2019

Uh oh!

jnothman commented Feb 28, 2019 via email

Uh oh!

EggsBenedict commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Mar 20, 2019 via email

Uh oh!

EggsBenedict commented Mar 20, 2019

Uh oh!

jnothman commented Mar 21, 2019 via email

Uh oh!

rth commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Mar 29, 2019

Uh oh!

znd4 commented Mar 30, 2019

Uh oh!

AvantiShri commented Jan 9, 2020

Uh oh!

kno10 Jan 26, 2020

Choose a reason for hiding this comment

Uh oh!

kno10 Jan 26, 2020

Choose a reason for hiding this comment

Uh oh!

kno10 commented Jan 26, 2020

Uh oh!

jnothman commented Jan 26, 2020 via email

Uh oh!

kno10 commented Jan 26, 2020

Uh oh!

kno10 commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EggsBenedict commented Mar 20, 2019 •

edited

Loading

rth commented Mar 22, 2019 •

edited

Loading

kno10 commented Nov 22, 2023 •

edited

Loading