Skip to content

KMedoids implementation Part Three #11099

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 104 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
104 commits
Select commit Hold shift + click to select a range
d4537f5
Added new input arguments: clustering and distance_metric.
terkkila Jul 31, 2015
5c916f8
Removed deprecated mlpy import.
terkkila Jul 31, 2015
228ef33
Allowed usage of any pairwise distance metric defined in Scikit-Learn.
terkkila Aug 1, 2015
19b22d7
Added unit tests. Renamed k -> n_clusters. Made some member variables…
terkkila Aug 2, 2015
a854442
KMedoids is now derived from BaseEstimator, and has proper mixins. Ad…
terkkila Aug 2, 2015
c4745c5
Added KMedoids::transform(). Update transform and predict methods and…
terkkila Aug 2, 2015
8119cec
Updated initialization. Updated unit tests.
terkkila Aug 2, 2015
840b713
Added naive test for KMedoids::fit(). Updated KMedoids interface.
terkkila Aug 2, 2015
e941bfb
Modifed according to PEP8 requirements.
terkkila Aug 2, 2015
58e8e8b
Added authors and license. Changed some variable names to more descri…
terkkila Aug 2, 2015
abb9b64
Testing all available pairwise distance functions for fitting in a loop.
terkkila Aug 2, 2015
9eff466
Added testing with iris data set.
terkkila Aug 3, 2015
ab97949
In unit test, comparing K-Medoids to K-Means when K-Medoids is using …
terkkila Aug 3, 2015
cb308f6
Refactored. Added comments.
terkkila Aug 3, 2015
15a7428
Small tweaks.
terkkila Aug 3, 2015
ccbc756
Testing the initialization method for K-Medoids. Unit test coverage 91%.
terkkila Aug 3, 2015
ab4ae4e
Adding example script, which runs K-Medoids with different distance m…
terkkila Aug 4, 2015
61df1f6
Added checks for input data, which also convert the data if needed.
terkkila Aug 4, 2015
c7b1164
Added try/catch around importing exceptions.
terkkila Aug 5, 2015
af5abb0
Added try/catch around importing exceptions in testing script of K-Me…
terkkila Aug 5, 2015
935f9aa
Made parts compatible with python 3.4 while retaining compatibility w…
terkkila Aug 6, 2015
216601d
Moved init args checcking to separate function, which is called from …
terkkila Aug 6, 2015
135d3cf
Using 4 spaces to replace tab. Using check_is_fitted() in predict() a…
terkkila Aug 6, 2015
65ae85f
Added max_iter input argument for KMedoids.
terkkila Aug 6, 2015
0c92427
Removed importing of ValueError.
terkkila Aug 6, 2015
b308a35
Added self.n_iter_ member to KMedoids to keep track how many iteratio…
terkkila Aug 7, 2015
47844d5
Inlined the function to check if KMedoids is fitted.
terkkila Aug 10, 2015
2b3c697
First version of KMedoids documentation. Documentation includes a lin…
terkkila Mar 13, 2016
63a899d
ENH: K-Medoids fixes
Oct 18, 2016
d9d92d2
ENH: KMedoids python 2.6 compatibility
Oct 25, 2016
5cfb565
ENH: K-Medoids review
Oct 31, 2016
ffa7015
ENH: K-medoids review
Oct 31, 2016
a5634ce
ENH: K-medoids flake8 fix
Oct 31, 2016
c1a6dbb
ENH: K-Medoids - deterministic random init
Oct 31, 2016
f06f87d
ENH: KMedoids extended test
Nov 2, 2016
03a90eb
ENH: K-Medoids review
Nov 17, 2016
fd342e0
ENH: K-medoids: flake fix
Nov 18, 2016
5dfcda3
ENH: K-Medoids review
Jan 3, 2017
cddf289
ENH: K-Medoids cluster distances calculation fix
Jan 4, 2017
7492c50
ENH: KMedoids flake8 fix
Jan 4, 2017
7dcab06
ENH: KMedoids - doc fix
Jan 4, 2017
65e09e0
ENH: KMedoids docstring fix
Jan 4, 2017
28553d1
ENH: KMedoids doc fix
Jan 5, 2017
dee489a
Merge remote-tracking branch 'upstream/master' into tk-kmedoids
Jan 5, 2017
5f2dbaa
ENH: KMedoids - review
Jan 9, 2017
3c453c3
ENH: KMedoids: test fix
Jan 9, 2017
4a609c8
ENH: KMedoids - review
Jan 9, 2017
069e7dc
ENH: KMedoids - review
Nov 25, 2017
8395ef3
ENH: K-Medoids sphinx gallery fix
Nov 29, 2017
edba303
Merge branch 'origin/master' into tk-kmedoids
Nov 29, 2017
3b5b055
ENH: KMedoids: utf header
Nov 29, 2017
0f6148e
ENH: KMedoids - review
Dec 14, 2017
bf05a4a
ENH: KMedoids - review
Dec 14, 2017
ddd619d
Merge remote-tracking branch 'upstream/master' into tk-kmedoids
Kornel Mar 19, 2018
a6c61b9
DOC KMedoids author email change
Kornel Apr 15, 2018
3bcca5f
Changed random_state to local variable instead of class atribute.
znd4 Apr 17, 2018
bef5466
Moved RandomSeed initialization to be inside each test
znd4 Apr 17, 2018
e214a6b
Merge branch 'master' of https://github.com/scikit-learn/scikit-learn…
znd4 Apr 18, 2018
58733f4
Implementd KMedoids++ initialization and precomputed kmedoids and mad…
znd4 Apr 27, 2018
4886c2a
Made documentation changes
znd4 Apr 28, 2018
8587a89
changed None to np.newaxis
znd4 Apr 29, 2018
9cb3b63
removed unnecessary line
znd4 Apr 29, 2018
8b94438
removed redundant example
znd4 Apr 29, 2018
e34abb2
Dedent notes section of kmedoids
znd4 Apr 29, 2018
1d9141c
Add test for empty cluster warning
znd4 May 3, 2018
91964c0
Added test for _update_medoid_idxs with empty cluster
znd4 May 5, 2018
f7af0a9
Made small change to iris test
znd4 May 5, 2018
55b9b83
Made slight corrections to test_iris comment
znd4 May 5, 2018
8e069b2
Made small tweaks to test_kmedoids_fit_naive_with_all_pairwise_distan…
znd4 May 5, 2018
2b1e590
Swapped range(0,n_clusters) for n_clusters-1
znd4 May 5, 2018
cb35af2
Updated empty cluster warning to tell user that the cluster medoid mi…
znd4 May 7, 2018
756accc
added to init_methods in _check_init_args
znd4 May 8, 2018
cb6b8f9
Added k-medoids++ to init_args test
znd4 May 8, 2018
be3dcf0
Changed test_kmedoids_fit_ functions to only test Euclidean
znd4 May 8, 2018
5cea0b8
Merge branch 'master' of git://github.com/scikit-learn/scikit-learn i…
znd4 May 12, 2018
d5109c7
Added tests for 'random' init and 'heuristic' init
znd4 May 12, 2018
6294566
Changed self.n_iter_ to n_iter_ (not referenced anywhere but in this …
znd4 May 13, 2018
4b1738a
Added some instance vars to __init__
znd4 May 13, 2018
b0df68b
Cleaned up some code
znd4 May 13, 2018
df25a76
Cleaned up my tests/added docstrings.
znd4 May 13, 2018
8f71d0a
Changed import order to match pep
znd4 May 14, 2018
731f624
assert_greater import removed (no longer needed)
znd4 May 14, 2018
11eb14e
Removed unused alias "as w"
znd4 May 14, 2018
7193bfc
Removed unused rng definition in one test
znd4 May 14, 2018
2108cdd
Added spaces after commas and aligned arrays
znd4 May 14, 2018
33b9f17
Added docstring to test_kmedoids_iris
znd4 May 14, 2018
70053cc
Revert "Merge branch 'master' of git://github.com/scikit-learn/scikit…
znd4 May 14, 2018
e09f3af
Revert "Merge branch 'master' of https://github.com/scikit-learn/scik…
znd4 May 14, 2018
9b30231
Added simple (and weak) convergence test
znd4 May 15, 2018
ba721d6
Oops, forgot to commit the changes to k_medoids_.py
znd4 May 15, 2018
9fe12ab
Created test_max_iter and added assertion about convergence for each …
znd4 May 16, 2018
d5b2606
Changed _iter_step back to n_iter_
znd4 May 16, 2018
1b79d02
Removed the initialization of non-parameter arguments in KMedoids.__i…
znd4 May 16, 2018
04bdfef
Add is_fitted check to predict
znd4 May 16, 2018
bb03e02
Added random_state=rng to test_max_iter
znd4 May 16, 2018
6cdbf85
Pep8-ified test_k_medoids.py
znd4 May 16, 2018
50e58de
Fixed PEP8 errors
znd4 May 16, 2018
40ae80e
Revert "Revert "Merge branch 'master' of https://github.com/scikit-le…
znd4 May 17, 2018
75bb60f
Revert "Revert "Merge branch 'master' of git://github.com/scikit-lear…
znd4 May 17, 2018
0998613
Made some requested changes to kmedoids documentation
znd4 May 24, 2018
088d6ba
Added KMeans to clustering comparison
znd4 May 31, 2018
194215b
Added KMedoids to clustering comparison
znd4 May 31, 2018
3762b4e
Remove/ignore .vscode
znd4 Jun 1, 2018
7abf5b0
Fixed Linter errors.
znd4 Jun 1, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ doc/samples
*.prof
.tox/
.coverage
.vscode

lfw_preprocessed/
nips2010_pdf/
Expand All @@ -54,6 +55,7 @@ benchmarks/bench_covertype_data/
*.prefs
.pydevproject
.idea
*.iml

*.c
*.cpp
Expand Down
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ Classes
cluster.DBSCAN
cluster.FeatureAgglomeration
cluster.KMeans
cluster.KMedoids
cluster.MiniBatchKMeans
cluster.MeanShift
cluster.SpectralClustering
Expand Down
65 changes: 65 additions & 0 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,13 @@ Overview of clustering methods
- General-purpose, even cluster size, flat geometry, not too many clusters
- Distances between points

* - :ref:`K-Medoids <k_medoids>`
- number of clusters
- Moderate ``n_samples``, medium ``n_clusters``
- General-purpose, uneven cluster size, flat geometry,
not too many clusters
- Any pairwise distance

* - :ref:`Affinity propagation <affinity_propagation>`
- damping, sample preference
- Not scalable with n_samples
Expand Down Expand Up @@ -278,6 +285,64 @@ small, as shown in the example and cited reference.
D. Sculley, *Proceedings of the 19th international conference on World
wide web* (2010)

.. _k_medoids:

K-Medoids
=========

:class:`KMedoids` is related to the :class:`KMeans` algorithm. While
:class:`KMeans` tries to minimize the within cluster sum-of-squares,
:class:`KMedoids` tries to minimize the sum of distances between each point and
the medoid of its cluster. The medoid is a data point (unlike the centroid)
which has least total distance to the other members of its cluster. The use of
a data point to represent each cluster's center allows the use of any distance
metric for clustering.

:class:`KMedoids` can be more robust to noise and outliers than :class:`KMeans`
as it will choose one of the cluster members as the medoid while
:class:`KMeans` will move the center of the cluster towards the outlier which
might in turn move other points away from the cluster centre.

:class:`KMedoids` is also different from K-Medians, which is analogous to :class:`KMeans`
except that the Manhattan Median is used for each cluster center instead of
the centroid. K-Medians is robust to outliers, but it is limited to the
Manhattan Distance metric and, similar to :class:`KMeans`, it does not guarantee
that the center of each cluster will be a member of the original dataset.

The complexity of K-Medoids is :math:`O(N^2 K T)` where :math:`N` is the number
of samples, :math:`T` is the number of iterations and :math:`K` is the number of
clusters. This makes it more suitable for smaller datasets in comparison to
:class:`KMeans` which is :math:`O(N K T)`.

.. topic:: Examples:

* :ref:`sphx_glr_auto_examples_cluster_plot_kmedoids_digits.py`: Applying K-Medoids on digits
with various distance metrics.


**Algorithm description:**
There are several algorithms to compute K-Medoids, though :class:`KMedoids`
currently only supports Partitioning Around Medoids (PAM). The PAM algorithm
uses a greedy search, which may fail to find the global optimum. It consists of
two alternating steps commonly called the
Assignment and Update steps (BUILD and SWAP in Kaufmann and Rousseeuw, 1987).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BUILD and SWAP are different. These are not assignment and update steps.


PAM works as follows:

* Initialize: Select ``n_clusters`` from the dataset as the medoids using
a heuristic, random, or k-medoids++ approach (configurable using the ``init`` parameter).
* Assignment step: assign each element from the dataset to the closest medoid.
* Update step: Identify the new medoid of each cluster.
* Repeat the assignment and update step while the medoids keep changing or
maximum number of iterations ``max_iter`` is reached.

.. topic:: References:

* "Clustering by Means of Medoids'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this code does not implement this reference!

Kaufman, L. and Rousseeuw, P.J.,
Statistical Data Analysis Based on the L1–Norm and Related Methods, edited
by Y. Dodge, North-Holland, 405–416. 1987

.. _affinity_propagation:

Affinity Propagation
Expand Down
5 changes: 5 additions & 0 deletions examples/cluster/plot_cluster_comparison.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,10 @@
# ============
ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True)
two_means = cluster.MiniBatchKMeans(n_clusters=params['n_clusters'])
kmedoids = cluster.KMedoids(
n_clusters=params['n_clusters'],
init='k-medoids++',
metric='euclidean')
ward = cluster.AgglomerativeClustering(
n_clusters=params['n_clusters'], linkage='ward',
connectivity=connectivity)
Expand All @@ -127,6 +131,7 @@

clustering_algorithms = (
('MiniBatchKMeans', two_means),
('KMedoids', kmedoids),
('AffinityPropagation', affinity_propagation),
('MeanShift', ms),
('SpectralClustering', spectral),
Expand Down
98 changes: 98 additions & 0 deletions examples/cluster/plot_kmedoids_digits.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# -*- coding: utf-8 -*-
"""
=============================================================
A demo of K-Medoids clustering on the handwritten digits data
=============================================================

In this example we compare different pairwise distance
metrics for K-Medoids.

"""
import numpy as np
import matplotlib.pyplot as plt

from collections import namedtuple
from sklearn.cluster import KMedoids, KMeans
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

print(__doc__)

# Authors: Timo Erkkilä <timo.erkkila@gmail.com>
# Antti Lehmussola <antti.lehmussola@gmail.com>
# Kornel Kiełczewski <kornel.mail@gmail.com>
# License: BSD 3 clause

np.random.seed(42)

digits = load_digits()
data = scale(digits.data)
n_digits = len(np.unique(digits.target))

reduced_data = PCA(n_components=2).fit_transform(data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = .02 # point in the mesh [x_min, m_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

plt.figure()
plt.clf()

plt.suptitle("Comparing multiple K-Medoids metrics to K-Means and each other",
fontsize=14)

Algorithm = namedtuple('ClusterAlgorithm', ['model', 'description'])

selected_models = [
Algorithm(KMedoids(metric='manhattan',
n_clusters=n_digits),
'KMedoids (manhattan)'),
Algorithm(KMedoids(metric='euclidean',
n_clusters=n_digits),
'KMedoids (euclidean)'),
Algorithm(KMedoids(metric='cosine',
n_clusters=n_digits),
'KMedoids (cosine)'),
Algorithm(KMeans(n_clusters=n_digits),
'KMeans')
]

plot_rows = int(np.ceil(len(selected_models) / 2.0))
plot_cols = 2

for i, (model, description) in enumerate(selected_models):

# Obtain labels for each point in mesh. Use last trained model.
model.fit(reduced_data)
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.subplot(plot_cols, plot_rows, i + 1)
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')

plt.plot(reduced_data[:, 0],
reduced_data[:, 1],
'k.', markersize=2,
alpha=0.3,
)
# Plot the centroids as a white X
centroids = model.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=169, linewidths=3,
color='w', zorder=10)
plt.title(description)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())

plt.show()
2 changes: 2 additions & 0 deletions sklearn/cluster/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
from .hierarchical import (ward_tree, AgglomerativeClustering, linkage_tree,
FeatureAgglomeration)
from .k_means_ import k_means, KMeans, MiniBatchKMeans
from .k_medoids_ import KMedoids
from .dbscan_ import dbscan, DBSCAN
from .bicluster import SpectralBiclustering, SpectralCoclustering
from .birch import Birch
Expand All @@ -19,6 +20,7 @@
'Birch',
'DBSCAN',
'KMeans',
'KMedoids',
'FeatureAgglomeration',
'MeanShift',
'MiniBatchKMeans',
Expand Down
7 changes: 7 additions & 0 deletions sklearn/cluster/k_means_.py
Original file line number Diff line number Diff line change
Expand Up @@ -837,6 +837,13 @@ class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
For large scale learning (say n_samples > 10k) MiniBatchKMeans is
probably much faster than the default batch implementation.


KMedoids
KMedoids tries to minimize the sum of distances between each point and
the medoid of its cluster. Unlike in KMeans the medoid is a data point.
The use of a data point to represent each cluster's center allows the
use of any distance metric for clustering.

Notes
------
The k-means problem is solved using Lloyd's algorithm.
Expand Down
Loading