-
-
Notifications
You must be signed in to change notification settings - Fork 26k
KMedoids implementation Part Three #11099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
104 commits
Select commit
Hold shift + click to select a range
d4537f5
Added new input arguments: clustering and distance_metric.
terkkila 5c916f8
Removed deprecated mlpy import.
terkkila 228ef33
Allowed usage of any pairwise distance metric defined in Scikit-Learn.
terkkila 19b22d7
Added unit tests. Renamed k -> n_clusters. Made some member variables…
terkkila a854442
KMedoids is now derived from BaseEstimator, and has proper mixins. Ad…
terkkila c4745c5
Added KMedoids::transform(). Update transform and predict methods and…
terkkila 8119cec
Updated initialization. Updated unit tests.
terkkila 840b713
Added naive test for KMedoids::fit(). Updated KMedoids interface.
terkkila e941bfb
Modifed according to PEP8 requirements.
terkkila 58e8e8b
Added authors and license. Changed some variable names to more descri…
terkkila abb9b64
Testing all available pairwise distance functions for fitting in a loop.
terkkila 9eff466
Added testing with iris data set.
terkkila ab97949
In unit test, comparing K-Medoids to K-Means when K-Medoids is using …
terkkila cb308f6
Refactored. Added comments.
terkkila 15a7428
Small tweaks.
terkkila ccbc756
Testing the initialization method for K-Medoids. Unit test coverage 91%.
terkkila ab4ae4e
Adding example script, which runs K-Medoids with different distance m…
terkkila 61df1f6
Added checks for input data, which also convert the data if needed.
terkkila c7b1164
Added try/catch around importing exceptions.
terkkila af5abb0
Added try/catch around importing exceptions in testing script of K-Me…
terkkila 935f9aa
Made parts compatible with python 3.4 while retaining compatibility w…
terkkila 216601d
Moved init args checcking to separate function, which is called from …
terkkila 135d3cf
Using 4 spaces to replace tab. Using check_is_fitted() in predict() a…
terkkila 65ae85f
Added max_iter input argument for KMedoids.
terkkila 0c92427
Removed importing of ValueError.
terkkila b308a35
Added self.n_iter_ member to KMedoids to keep track how many iteratio…
terkkila 47844d5
Inlined the function to check if KMedoids is fitted.
terkkila 2b3c697
First version of KMedoids documentation. Documentation includes a lin…
terkkila 63a899d
ENH: K-Medoids fixes
d9d92d2
ENH: KMedoids python 2.6 compatibility
5cfb565
ENH: K-Medoids review
ffa7015
ENH: K-medoids review
a5634ce
ENH: K-medoids flake8 fix
c1a6dbb
ENH: K-Medoids - deterministic random init
f06f87d
ENH: KMedoids extended test
03a90eb
ENH: K-Medoids review
fd342e0
ENH: K-medoids: flake fix
5dfcda3
ENH: K-Medoids review
cddf289
ENH: K-Medoids cluster distances calculation fix
7492c50
ENH: KMedoids flake8 fix
7dcab06
ENH: KMedoids - doc fix
65e09e0
ENH: KMedoids docstring fix
28553d1
ENH: KMedoids doc fix
dee489a
Merge remote-tracking branch 'upstream/master' into tk-kmedoids
5f2dbaa
ENH: KMedoids - review
3c453c3
ENH: KMedoids: test fix
4a609c8
ENH: KMedoids - review
069e7dc
ENH: KMedoids - review
8395ef3
ENH: K-Medoids sphinx gallery fix
edba303
Merge branch 'origin/master' into tk-kmedoids
3b5b055
ENH: KMedoids: utf header
0f6148e
ENH: KMedoids - review
bf05a4a
ENH: KMedoids - review
ddd619d
Merge remote-tracking branch 'upstream/master' into tk-kmedoids
Kornel a6c61b9
DOC KMedoids author email change
Kornel 3bcca5f
Changed random_state to local variable instead of class atribute.
znd4 bef5466
Moved RandomSeed initialization to be inside each test
znd4 e214a6b
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
znd4 58733f4
Implementd KMedoids++ initialization and precomputed kmedoids and mad…
znd4 4886c2a
Made documentation changes
znd4 8587a89
changed None to np.newaxis
znd4 9cb3b63
removed unnecessary line
znd4 8b94438
removed redundant example
znd4 e34abb2
Dedent notes section of kmedoids
znd4 1d9141c
Add test for empty cluster warning
znd4 91964c0
Added test for _update_medoid_idxs with empty cluster
znd4 f7af0a9
Made small change to iris test
znd4 55b9b83
Made slight corrections to test_iris comment
znd4 8e069b2
Made small tweaks to test_kmedoids_fit_naive_with_all_pairwise_distan…
znd4 2b1e590
Swapped range(0,n_clusters) for n_clusters-1
znd4 cb35af2
Updated empty cluster warning to tell user that the cluster medoid mi…
znd4 756accc
added to init_methods in _check_init_args
znd4 cb6b8f9
Added k-medoids++ to init_args test
znd4 be3dcf0
Changed test_kmedoids_fit_ functions to only test Euclidean
znd4 5cea0b8
Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…
znd4 d5109c7
Added tests for 'random' init and 'heuristic' init
znd4 6294566
Changed self.n_iter_ to n_iter_ (not referenced anywhere but in this …
znd4 4b1738a
Added some instance vars to __init__
znd4 b0df68b
Cleaned up some code
znd4 df25a76
Cleaned up my tests/added docstrings.
znd4 8f71d0a
Changed import order to match pep
znd4 731f624
assert_greater import removed (no longer needed)
znd4 11eb14e
Removed unused alias "as w"
znd4 7193bfc
Removed unused rng definition in one test
znd4 2108cdd
Added spaces after commas and aligned arrays
znd4 33b9f17
Added docstring to test_kmedoids_iris
znd4 70053cc
Revert "Merge branch 'master' of git://github.com/scikit-learn/scikit…
znd4 e09f3af
Revert "Merge branch 'master' of https://github.com/scikit-learn/scik…
znd4 9b30231
Added simple (and weak) convergence test
znd4 ba721d6
Oops, forgot to commit the changes to k_medoids_.py
znd4 9fe12ab
Created test_max_iter and added assertion about convergence for each …
znd4 d5b2606
Changed _iter_step back to n_iter_
znd4 1b79d02
Removed the initialization of non-parameter arguments in KMedoids.__i…
znd4 04bdfef
Add is_fitted check to predict
znd4 bb03e02
Added random_state=rng to test_max_iter
znd4 6cdbf85
Pep8-ified test_k_medoids.py
znd4 50e58de
Fixed PEP8 errors
znd4 40ae80e
Revert "Revert "Merge branch 'master' of https://github.com/scikit-le…
znd4 75bb60f
Revert "Revert "Merge branch 'master' of git://github.com/scikit-lear…
znd4 0998613
Made some requested changes to kmedoids documentation
znd4 088d6ba
Added KMeans to clustering comparison
znd4 194215b
Added KMedoids to clustering comparison
znd4 3762b4e
Remove/ignore .vscode
znd4 7abf5b0
Fixed Linter errors.
znd4 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -54,6 +54,13 @@ Overview of clustering methods | |
- General-purpose, even cluster size, flat geometry, not too many clusters | ||
- Distances between points | ||
|
||
* - :ref:`K-Medoids <k_medoids>` | ||
- number of clusters | ||
- Moderate ``n_samples``, medium ``n_clusters`` | ||
- General-purpose, uneven cluster size, flat geometry, | ||
not too many clusters | ||
- Any pairwise distance | ||
|
||
* - :ref:`Affinity propagation <affinity_propagation>` | ||
- damping, sample preference | ||
- Not scalable with n_samples | ||
|
@@ -278,6 +285,64 @@ small, as shown in the example and cited reference. | |
D. Sculley, *Proceedings of the 19th international conference on World | ||
wide web* (2010) | ||
|
||
.. _k_medoids: | ||
|
||
K-Medoids | ||
========= | ||
|
||
:class:`KMedoids` is related to the :class:`KMeans` algorithm. While | ||
:class:`KMeans` tries to minimize the within cluster sum-of-squares, | ||
:class:`KMedoids` tries to minimize the sum of distances between each point and | ||
the medoid of its cluster. The medoid is a data point (unlike the centroid) | ||
which has least total distance to the other members of its cluster. The use of | ||
a data point to represent each cluster's center allows the use of any distance | ||
metric for clustering. | ||
|
||
:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans` | ||
as it will choose one of the cluster members as the medoid while | ||
:class:`KMeans` will move the center of the cluster towards the outlier which | ||
might in turn move other points away from the cluster centre. | ||
|
||
:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans` | ||
except that the Manhattan Median is used for each cluster center instead of | ||
the centroid. K-Medians is robust to outliers, but it is limited to the | ||
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee | ||
that the center of each cluster will be a member of the original dataset. | ||
|
||
The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number | ||
of samples, :math:`T` is the number of iterations and :math:`K` is the number of | ||
clusters. This makes it more suitable for smaller datasets in comparison to | ||
:class:`KMeans` which is :math:`O(N K T)`. | ||
|
||
.. topic:: Examples: | ||
|
||
* :ref:`sphx_glr_auto_examples_cluster_plot_kmedoids_digits.py`: Applying K-Medoids on digits | ||
with various distance metrics. | ||
|
||
|
||
**Algorithm description:** | ||
There are several algorithms to compute K-Medoids, though :class:`KMedoids` | ||
currently only supports Partitioning Around Medoids (PAM). The PAM algorithm | ||
uses a greedy search, which may fail to find the global optimum. It consists of | ||
two alternating steps commonly called the | ||
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987). | ||
|
||
PAM works as follows: | ||
|
||
* Initialize: Select ``n_clusters`` from the dataset as the medoids using | ||
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter). | ||
* Assignment step: assign each element from the dataset to the closest medoid. | ||
* Update step: Identify the new medoid of each cluster. | ||
* Repeat the assignment and update step while the medoids keep changing or | ||
maximum number of iterations ``max_iter`` is reached. | ||
|
||
.. topic:: References: | ||
|
||
* "Clustering by Means of Medoids'" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently this code does not implement this reference! |
||
Kaufman, L. and Rousseeuw, P.J., | ||
Statistical Data Analysis Based on the L1–Norm and Related Methods, edited | ||
by Y. Dodge, North-Holland, 405–416. 1987 | ||
|
||
.. _affinity_propagation: | ||
|
||
Affinity Propagation | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
============================================================= | ||
A demo of K-Medoids clustering on the handwritten digits data | ||
============================================================= | ||
|
||
In this example we compare different pairwise distance | ||
metrics for K-Medoids. | ||
|
||
""" | ||
import numpy as np | ||
import matplotlib.pyplot as plt | ||
|
||
from collections import namedtuple | ||
from sklearn.cluster import KMedoids, KMeans | ||
from sklearn.datasets import load_digits | ||
from sklearn.decomposition import PCA | ||
from sklearn.preprocessing import scale | ||
|
||
print(__doc__) | ||
|
||
# Authors: Timo Erkkilä <timo.erkkila@gmail.com> | ||
# Antti Lehmussola <antti.lehmussola@gmail.com> | ||
# Kornel Kiełczewski <kornel.mail@gmail.com> | ||
# License: BSD 3 clause | ||
|
||
np.random.seed(42) | ||
|
||
digits = load_digits() | ||
data = scale(digits.data) | ||
n_digits = len(np.unique(digits.target)) | ||
|
||
reduced_data = PCA(n_components=2).fit_transform(data) | ||
|
||
# Step size of the mesh. Decrease to increase the quality of the VQ. | ||
h = .02 # point in the mesh [x_min, m_max]x[y_min, y_max]. | ||
|
||
# Plot the decision boundary. For that, we will assign a color to each | ||
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1 | ||
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1 | ||
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) | ||
|
||
plt.figure() | ||
plt.clf() | ||
|
||
plt.suptitle("Comparing multiple K-Medoids metrics to K-Means and each other", | ||
fontsize=14) | ||
|
||
Algorithm = namedtuple('ClusterAlgorithm', ['model', 'description']) | ||
|
||
selected_models = [ | ||
Algorithm(KMedoids(metric='manhattan', | ||
n_clusters=n_digits), | ||
'KMedoids (manhattan)'), | ||
Algorithm(KMedoids(metric='euclidean', | ||
n_clusters=n_digits), | ||
'KMedoids (euclidean)'), | ||
Algorithm(KMedoids(metric='cosine', | ||
n_clusters=n_digits), | ||
'KMedoids (cosine)'), | ||
Algorithm(KMeans(n_clusters=n_digits), | ||
'KMeans') | ||
] | ||
|
||
plot_rows = int(np.ceil(len(selected_models) / 2.0)) | ||
plot_cols = 2 | ||
|
||
for i, (model, description) in enumerate(selected_models): | ||
|
||
# Obtain labels for each point in mesh. Use last trained model. | ||
model.fit(reduced_data) | ||
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]) | ||
|
||
# Put the result into a color plot | ||
Z = Z.reshape(xx.shape) | ||
plt.subplot(plot_cols, plot_rows, i + 1) | ||
plt.imshow(Z, interpolation='nearest', | ||
extent=(xx.min(), xx.max(), yy.min(), yy.max()), | ||
cmap=plt.cm.Paired, | ||
aspect='auto', origin='lower') | ||
|
||
plt.plot(reduced_data[:, 0], | ||
reduced_data[:, 1], | ||
'k.', markersize=2, | ||
alpha=0.3, | ||
) | ||
# Plot the centroids as a white X | ||
centroids = model.cluster_centers_ | ||
plt.scatter(centroids[:, 0], centroids[:, 1], | ||
marker='x', s=169, linewidths=3, | ||
color='w', zorder=10) | ||
plt.title(description) | ||
plt.xlim(x_min, x_max) | ||
plt.ylim(y_min, y_max) | ||
plt.xticks(()) | ||
plt.yticks(()) | ||
|
||
plt.show() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BUILD and SWAP are different. These are not assignment and update steps.