Skip to content

KMedoids implementation Part Three #11099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 104 commits into from
Closed

Conversation

znd4
Copy link

@znd4 znd4 commented May 16, 2018

Helping @Kornel to finish up/close #7694.

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This is an attempt to finish the implementation of KMedoids clustering.

Any other comments?

Thanks to @Kornel and @terkkila for their work on this, and to everyone who's reviewed things so far. Also, thanks and apologies in advance to whoever reviews my code.

terkkila and others added 30 commits October 25, 2016 21:02
…ded tests for KMedoids::fit() and KMedoids::fit_predict()
…inside fit() instead. Also, random_state constructor argument was added to KMedoids so as to ensure identical behavior between multiple fit() calls. Made the examples and tests to comply with PEP8 convention.
…nd transform() -- added unit tests to verify. Simplified the heuristic based on recommendation. Using just one underscore to denote private method.
…ns have been used by fit(). Moved array checks to their own private method.
 - changed positionless formatting to percent notation
 - fixed documentation plot
@domenico-somma
Copy link

It's basically extracted from the "Dataset SOFT file" link there as far as I can tell...​

Yes, it's just that...

@amueller amueller modified the milestones: 0.20, 0.21 Jul 16, 2018
@ybdesire
Copy link

waited a lot of time for this k-medoids algorithm

@jnothman
Copy link
Member

@ybdesire, could you please help us get this released by giving us a dataset where k medoids is a much better choice than hierarchical agglomerative clustering with the same metric?

@znd4
Copy link
Author

znd4 commented Feb 27, 2019

One of the assumptions of kmeans (or kmedoids) clustering is that the data is spherically distributed. I tried to create a "spherically distributed" dataset in a space with a non-euclidean metric, but everything I came up with felt pretty contrived. @ybdesire, that might be an avenue worth pursuing.

@znd4
Copy link
Author

znd4 commented Feb 27, 2019

@jnothman, does a generated dataset like that qualify? I started with k random strings as seeds, then generate points by selecting one of the seeds and randomly changing N characters, where N is drawn from a "gaussian-ish" distribution.

@jnothman
Copy link
Member

It's not extremely persuasive if you have to intentionally engineer spheres in non-euclidean space...?

@znd4
Copy link
Author

znd4 commented Feb 27, 2019

Haha I began to worry that that was the case. I'd be personally sad to see this PR thrown out (for purely biased selfish reasons), but it's possible that this should be closed until someone can find a persuasive use-case for KMedoids.

@jnothman
Copy link
Member

jnothman commented Feb 28, 2019 via email

@EggsBenedict
Copy link

EggsBenedict commented Mar 20, 2019

I'd personally really like to see this move forward, too (and a kmeans implementation that allows you to specify a distance metric). For K-Medoids specifically I've been applying it to datasets where the cluster centroid should always be tied to a specific observation as that observation is used as a comparison to later be evaluated (by a human) for conceptual similarity.

Textual data doesn't seem appropriate for a clusterer where you need to set the number of clusters.

Which is a great point! In this case, approximating K accurately is very important. Despite its issues with scaling, the algorithm does well with noise and seems to converge quickly and relatively consistently.

@jnothman
Copy link
Member

jnothman commented Mar 20, 2019 via email

@EggsBenedict
Copy link

In this particular case, the quantity of identified groups is still a derived number based on a separate model acting on a vector representation of the text data so I think it still would be considered clustering but also I do think there's truth to what you're saying. It's an edge case, for sure...

@jnothman
Copy link
Member

jnothman commented Mar 21, 2019 via email

@rth
Copy link
Member

rth commented Mar 22, 2019

would this be a good candidate for extras if we are happy with the implementation but holding out on clear justification?

I think it would be.

We recently created https://github.com/scikit-learn-contrib/scikit-learn-extra as the place where to put good quality scikit-learn contributions that don't get merged for one reason or another. This would allow users to start using such implementations, then if they gain enough popularity on practical problems we can consider merging them back into scikit-learn. In any case that would be better than a pending PR.

@zdog234 Would you be interested in making a contribution with this PR there?

BTW, has anyone looked at "A simple and fast algorithm for K-medoids clustering" by Park & Jun (2009)? After a superficial read it looks like they improve upon the original Kaufman & Rousseeuw (1987) paper in particular with respect to scalability (and it has lots of citations).

@adrinjalali
Copy link
Member

Opened scikit-learn-contrib/scikit-learn-extra#8, closing this one.

@znd4
Copy link
Author

znd4 commented Mar 30, 2019

I'd definitely be interested in porting this over to scikit-learn-extra! I also wouldn't mind trying to implement an updated algorithm.

@AvantiShri
Copy link

One use-case for k-mediods is finding "exemplars" (i.e. finding a set of datapoints that can be used as representatives of the full set). Using it for clustering itself may be less valuable than simply obtaining exemplars.

currently only supports Partitioning Around Medoids (PAM). The PAM algorithm
uses a greedy search, which may fail to find the global optimum. It consists of
two alternating steps commonly called the
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUILD and SWAP are different. These are not assignment and update steps.


.. topic:: References:

* "Clustering by Means of Medoids'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this code does not implement this reference!

@kno10
Copy link
Contributor

kno10 commented Jan 26, 2020

This implementation implements the rather poor "Alternating" algorithm (that was also later reinvented by Park). It is known in literature to produce substantially worse results than the BUILD+SWAP approach used by Kaufman and Rousseeuw.
Yes its faster - but because it gets stuck in worse solutions... try on some non-trivial data, and compare the result quality.

@jnothman
Copy link
Member

jnothman commented Jan 26, 2020 via email

@kno10
Copy link
Contributor

kno10 commented Jan 26, 2020

I have; this comment is just in case this gets revived here once more.

@kno10
Copy link
Contributor

kno10 commented Nov 22, 2023

In case someone stumbles across this old issue:
There is a scikit-learn compatible kmedoids package implementing fast algorithms installable via pip install kmedoids:
https://python-kmedoids.readthedocs.io/en/latest/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.