-
-
Notifications
You must be signed in to change notification settings - Fork 26k
KMedoids implementation Part Three #11099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ded tests for KMedoids::fit() and KMedoids::fit_predict()
…Euclidean distance.
…etrics and plots the results.
…inside fit() instead. Also, random_state constructor argument was added to KMedoids so as to ensure identical behavior between multiple fit() calls. Made the examples and tests to comply with PEP8 convention.
…nd transform() -- added unit tests to verify. Simplified the heuristic based on recommendation. Using just one underscore to denote private method.
…ns have been used by fit(). Moved array checks to their own private method.
…k to an example script.
- changed positionless formatting to percent notation - fixed documentation plot
Yes, it's just that... |
waited a lot of time for this k-medoids algorithm |
@ybdesire, could you please help us get this released by giving us a dataset where k medoids is a much better choice than hierarchical agglomerative clustering with the same metric? |
One of the assumptions of kmeans (or kmedoids) clustering is that the data is spherically distributed. I tried to create a "spherically distributed" dataset in a space with a non-euclidean metric, but everything I came up with felt pretty contrived. @ybdesire, that might be an avenue worth pursuing. |
@jnothman, does a generated dataset like that qualify? I started with k random strings as seeds, then generate points by selecting one of the seeds and randomly changing N characters, where N is drawn from a "gaussian-ish" distribution. |
It's not extremely persuasive if you have to intentionally engineer spheres in non-euclidean space...? |
Haha I began to worry that that was the case. I'd be personally sad to see this PR thrown out (for purely biased selfish reasons), but it's possible that this should be closed until someone can find a persuasive use-case for KMedoids. |
Well there's certainly nothing wrong with releasing kmedoids as a separate
package and seeing how it gets used. That's probably better than hiding it
here in a pull request. It's easy to appreciate the theoretical motivation
for developing a K medoids algorithm, it's just hard to be convinced that
it's practically more useful than hierarchical or density based clustering.
|
I'd personally really like to see this move forward, too (and a kmeans implementation that allows you to specify a distance metric). For K-Medoids specifically I've been applying it to datasets where the cluster centroid should always be tied to a specific observation as that observation is used as a comparison to later be evaluated (by a human) for conceptual similarity.
Which is a great point! In this case, approximating K accurately is very important. Despite its issues with scaling, the algorithm does well with noise and seems to converge quickly and relatively consistently. |
So you want to use kmedoids less to cluster and more to identify a
prototype of identified groups?
|
In this particular case, the quantity of identified groups is still a derived number based on a separate model acting on a vector representation of the text data so I think it still would be considered clustering but also I do think there's truth to what you're saying. It's an edge case, for sure... |
@rth would this be a good candidate for extras if we are happy with the
implementation but holding out on clear justification?
|
I think it would be. We recently created https://github.com/scikit-learn-contrib/scikit-learn-extra as the place where to put good quality scikit-learn contributions that don't get merged for one reason or another. This would allow users to start using such implementations, then if they gain enough popularity on practical problems we can consider merging them back into scikit-learn. In any case that would be better than a pending PR. @zdog234 Would you be interested in making a contribution with this PR there? BTW, has anyone looked at "A simple and fast algorithm for K-medoids clustering" by Park & Jun (2009)? After a superficial read it looks like they improve upon the original Kaufman & Rousseeuw (1987) paper in particular with respect to scalability (and it has lots of citations). |
Opened scikit-learn-contrib/scikit-learn-extra#8, closing this one. |
I'd definitely be interested in porting this over to scikit-learn-extra! I also wouldn't mind trying to implement an updated algorithm. |
One use-case for k-mediods is finding "exemplars" (i.e. finding a set of datapoints that can be used as representatives of the full set). Using it for clustering itself may be less valuable than simply obtaining exemplars. |
currently only supports Partitioning Around Medoids (PAM). The PAM algorithm | ||
uses a greedy search, which may fail to find the global optimum. It consists of | ||
two alternating steps commonly called the | ||
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUILD and SWAP are different. These are not assignment and update steps.
|
||
.. topic:: References: | ||
|
||
* "Clustering by Means of Medoids'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this code does not implement this reference!
This implementation implements the rather poor "Alternating" algorithm (that was also later reinvented by Park). It is known in literature to produce substantially worse results than the BUILD+SWAP approach used by Kaufman and Rousseeuw. |
@kno10, this has been published in scikit-learn-extra (via
scikit-learn-contrib/scikit-learn-extra#12).
Perhaps you should comment over there, where the same reference was
included.
|
I have; this comment is just in case this gets revived here once more. |
In case someone stumbles across this old issue: |
Helping @Kornel to finish up/close #7694.
Reference Issues/PRs
What does this implement/fix? Explain your changes.
This is an attempt to finish the implementation of KMedoids clustering.
Any other comments?
Thanks to @Kornel and @terkkila for their work on this, and to everyone who's reviewed things so far. Also, thanks and apologies in advance to whoever reviews my code.