Skip to content

[MRG]Clusters-Class auto-match #10604

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 43 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
2213018
adding functionality to allow more clustering metrics
LPugens Feb 7, 2018
ef2ec2b
formatting compliant
LPugens Feb 7, 2018
5fab323
formatting and adding an functionality example
LPugens Feb 7, 2018
8402cf2
adding comma to comply with formatting
LPugens Feb 7, 2018
e97ef2b
Fixed doc generator
LPugens Feb 7, 2018
625ced6
more modifications to be compliant with coding guidelines
LPugens Feb 7, 2018
46fd79f
fixing doc bug
LPugens Feb 8, 2018
d1bdae1
fixing doctest and adopting max_assignment_score name as proposed by …
LPugens Feb 8, 2018
07be65c
fixing examle again
LPugens Feb 8, 2018
2b01392
Doc fixing to pass Travis verification
LPugens Feb 8, 2018
b5116b7
Yet another doc fix
LPugens Feb 8, 2018
81804bf
fixing travis version of scipy
LPugens Feb 8, 2018
4ba0cac
allowing for any number of clusters and classes
LPugens Feb 8, 2018
d4aab23
Merge branch 'clustering_match' of https://github.com/LPugens/scikit-…
LPugens Feb 8, 2018
29b7154
allowing for any number of clusters and classes and undoing travis sc…
LPugens Feb 8, 2018
2588538
better implementation
LPugens Feb 10, 2018
6e436d8
fixing test result
LPugens Feb 11, 2018
c3d1ea5
fixing pep8 formatting error
LPugens Feb 11, 2018
10892dc
fixing test error
LPugens Feb 11, 2018
8e12b2a
sorting result set for maintaining interoperability between python 2 …
LPugens Feb 11, 2018
63dd108
adding documentation
LPugens Feb 11, 2018
c50284f
fixing doc
LPugens Feb 11, 2018
5c8c3f3
fixing doc
LPugens Feb 11, 2018
fefe91c
fixing nomenclature
LPugens Jul 1, 2018
c10f69c
Merge branch 'master' into clustering_match
LPugens Jul 1, 2018
0225c32
Merge branch 'master' into clustering_match
LPugens Jul 1, 2018
bd31fb9
fixing nomenclature
LPugens Jul 1, 2018
cd3a09d
Merge remote-tracking branch 'origin/clustering_match' into clusterin…
LPugens Jul 1, 2018
1539a0b
fixing commit bug
LPugens Jul 1, 2018
9d39163
fixing doc title
LPugens Jul 1, 2018
1e08a8b
avoinding messing with imports
LPugens Jul 1, 2018
51a68cb
changing default label nomenclature
LPugens Jul 1, 2018
01e9253
adding negative indices to the test
LPugens Jul 1, 2018
e11cd91
nomenclature fix
LPugens Jul 1, 2018
320858d
comment fix
LPugens Jul 1, 2018
0a65cc2
doc fix
LPugens Jul 1, 2018
4847df3
comment fix
LPugens Jul 1, 2018
42a7a1d
simplifying test
LPugens Jul 1, 2018
cdbe4a3
fixing name style on labels
LPugens Jul 1, 2018
1474d53
fixing name style on labels
LPugens Jul 1, 2018
f582884
fixing line length
LPugens Jul 1, 2018
7227ab5
fixing line length
LPugens Jul 2, 2018
e1686d6
fixing comment code
LPugens Jul 2, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/modules/classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -904,6 +904,7 @@ details.
metrics.davies_bouldin_score
metrics.completeness_score
metrics.cluster.contingency_matrix
metrics.cluster.map_cluster_labels
metrics.fowlkes_mallows_score
metrics.homogeneity_completeness_v_measure
metrics.homogeneity_score
Expand Down
71 changes: 71 additions & 0 deletions doc/modules/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1733,3 +1733,74 @@ Drawbacks

* `Wikipedia entry for contingency matrix
<https://en.wikipedia.org/wiki/Contingency_table>`_

.. _map_cluster_labels:

Map cluster labels
-------------------

Map cluster labels
(:func:`sklearn.metrics.cluster.map_cluster_labels`) provides a
friendly way for the user to calculate classical classification
metrics, such as :func:`sklearn.metrics.accuracy_score` and
:func:`sklearn.metrics.f1_score`.

Here is an example::

>>> from sklearn.metrics.cluster import map_cluster_labels, adjusted_rand_score
>>> from sklearn.metrics import confusion_matrix, accuracy_score
>>> y_true = ['a'] * 1 + ['b'] * 2 + ['c'] * 20 + ['d'] * 6 + ['e'] * \
... 13 + ['f'] * 2 + ['g'] * 3 + ['h'] * 3 + ['i'] * 2 + ['j'] * 1
>>> y_pred = [6] * 1 + [2] * 2 + [0] * 6 + [2] * 10 + [8] * 4 + [1] *\
... 4 + [5] * 2 + [0] * 4 + [3] * 5 + [6] * 2 + [9] * 2 + [7] *\
... 2 + [0] * 2 + [8] * 1 + [4] * 3 + [3] * 2 + [8] * 1
>>> y_pred = map_cluster_labels(y_true, y_pred)
>>> confusion_matrix(y_true, y_pred)
array([[ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 2, 0, 0, 0, 0, 0, 0, 0],
[ 0, 0, 10, 0, 0, 0, 6, 0, 0, 4],
[ 0, 0, 0, 4, 0, 0, 0, 0, 2, 0],
[ 2, 2, 0, 0, 5, 0, 4, 0, 0, 0],
[ 0, 0, 0, 0, 0, 2, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 2, 0, 0, 1],
[ 0, 0, 0, 0, 0, 0, 0, 3, 0, 0],
[ 0, 0, 0, 0, 2, 0, 0, 0, 0, 0],
[ 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]])
>>> accuracy_score(y_true, y_pred) # doctest: +ELLIPSIS
0.52...
>>> adjusted_rand_score(y_true, y_pred) # doctest: +ELLIPSIS
0.29...

Notice the confusion matrix above has its main diagonal maximized, meaning
the maximum possible value of accuracy score is obtained by such match of
true classes and clusters.

This conversion of clustering labels is also compatible with default
clustering metrics, since the change in clusters labels does not
affect results of such metrics, such as the ARI above.

Another example::

>>> y_true = ['a', 'a', 'a', 'b', 'b', 'b']
>>> y_pred = [3, 0, 1, 1, 2, 2]
>>> map_cluster_labels(y_true, y_pred)
['DEFAULT_LABEL_1', 'a', 'DEFAULT_LABEL_0', 'DEFAULT_LABEL_0', 'b', 'b']

The above example shows what happens with your clustering method identifies
more clusters than true classes. *Such results must be treated carefully*,
since not all metrics derived from such mapping are meaningful.

Advantages
~~~~~~~~~~

- Enables calculation of classical classification metrics, such as
accuracy and f1_score.

- Allows for a meaningful and easy-to-read clustering output when classes
are known.

Drawbacks
~~~~~~~~~

- One should use this tool carefully, since its metrics are not always
meaningful for every clustering task.
5 changes: 5 additions & 0 deletions examples/cluster/plot_affinity_propagation.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)
translated_labels = metrics.cluster.map_cluster_labels(labels_true, labels)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
Expand All @@ -38,6 +39,10 @@
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels, metric='sqeuclidean'))
print("Accuracy: %0.3f"
% metrics.accuracy_score(labels_true, translated_labels))
print("Confusion Matrix:\n%s"
% str(metrics.confusion_matrix(labels_true, translated_labels)))

# #############################################################################
# Plot result
Expand Down
5 changes: 3 additions & 2 deletions sklearn/metrics/cluster/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from .supervised import v_measure_score
from .supervised import fowlkes_mallows_score
from .supervised import entropy
from .supervised import map_cluster_labels
from .unsupervised import silhouette_samples
from .unsupervised import silhouette_score
from .unsupervised import calinski_harabaz_score
Expand All @@ -27,6 +28,6 @@
"adjusted_rand_score", "completeness_score", "contingency_matrix",
"expected_mutual_information", "homogeneity_completeness_v_measure",
"homogeneity_score", "mutual_info_score", "v_measure_score",
"fowlkes_mallows_score", "entropy", "silhouette_samples",
"silhouette_score", "calinski_harabaz_score",
"fowlkes_mallows_score", "entropy", "map_cluster_labels",
"silhouette_samples", "silhouette_score", "calinski_harabaz_score",
"davies_bouldin_score", "consensus_score"]
71 changes: 71 additions & 0 deletions sklearn/metrics/cluster/supervised.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
# Thierry Guillemot <thierry.guillemot.work@gmail.com>
# Gregory Stupp <stuppie@gmail.com>
# Joel Nothman <joel.nothman@gmail.com>
# Lucas Pugens Fernandes <lpfernandes@gmail.com>
# License: BSD 3 clause

from __future__ import division
Expand All @@ -22,7 +23,9 @@

from .expected_mutual_info_fast import expected_mutual_information
from ...utils.validation import check_array
from ...utils.multiclass import unique_labels
from ...utils.fixes import comb
from ...utils.linear_assignment_ import linear_assignment


def comb2(n):
Expand Down Expand Up @@ -871,3 +874,71 @@ def entropy(labels):
# log(a / b) should be calculated as log(a) - log(b) for
# possible loss of precision
return -np.sum((pi / pi_sum) * (np.log(pi) - log(pi_sum)))


def map_cluster_labels(labels_true, labels_pred):
"""Translate prediction labels to maximize the accuracy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Find the best mapping between true and predicted cluster labels"


Translate the prediction labels of a clustering output to those in the
ground truth to enable calc of external metrics (eg. accuracy, f1_score,
...). Translation is done by maximization of the confusion matrix :math:`C`
main diagonal sum :math:`\sum{i=0}^{K}C_{i, i}`.

Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before this

----------
labels_true : array, shape = [n_samples]
Ground truth (correct) target values.
labels_pred : array, shape = [n_samples]
Estimated clusters as returned by a clustering algorithm.

Returns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before this

-------
trans : array, shape = [n_classes, n_classes]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not what you return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be tempted to return three arrays:

  • true_clusters
  • pred_clusters
  • score

Each is of length min(n_true_clusters, n_pred_clusters). This is more than what the user functionally needs, but seeing as we don't have any similar functions in the library, and don't expect to, I think this is most useful to help with subsequent computations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. I've now realised you're producing something slightly different to what I was thinking. And fair enough. Perhaps you're right that we should produce your more usable/understandable interface, but it is not sufficient to implement something like CEAF :(

Mapping of labels_pred clusters, such that :math:`trans\subseteq
labels_true`

References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before this

----------

Examples
--------
>>> from sklearn.metrics import confusion_matrix
>>> from sklearn.metrics.cluster import map_cluster_labels
>>> labels_true = ["class1", "class2", "class3", "class1", "class1",
... "class3"]
>>> labels_pred = [0, 0, 2, 2, 0, 2]
>>> y_pred_translated = map_cluster_labels(labels_true, labels_pred)
>>> y_pred_translated
['class1', 'class1', 'class3', 'class3', 'class1', 'class3']
>>> confusion_matrix(labels_true, y_pred_translated)
array([[2, 0, 1],
[1, 0, 0],
[0, 0, 2]])
"""

classes = unique_labels(labels_true).tolist()
n_classes = len(classes)
clusters = unique_labels(labels_pred).tolist()
n_clusters = len(clusters)

if n_clusters > n_classes:
classes += ['DEFAULT_LABEL_'+str(i) for i in
range(n_clusters-n_classes)]
elif n_classes > n_clusters:
clusters += ['DEFAULT_CLUSTER_'+str(i) for i in
range(n_classes-n_clusters)]

C = contingency_matrix(labels_true, labels_pred)
true_idx, pred_idx = linear_assignment(-C).T

true_idx = true_idx.tolist()
pred_idx = pred_idx.tolist()

true_idx = [classes[idx] for idx in true_idx]
true_idx = true_idx + sorted(set(classes) - set(true_idx))
pred_idx = [clusters[idx] for idx in pred_idx]
pred_idx = pred_idx + sorted(set(clusters) - set(pred_idx))

return_list = [true_idx[pred_idx.index(y)] for y in labels_pred]

return return_list
33 changes: 32 additions & 1 deletion sklearn/metrics/cluster/tests/test_supervised.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@
from sklearn.metrics.cluster import mutual_info_score
from sklearn.metrics.cluster import normalized_mutual_info_score
from sklearn.metrics.cluster import v_measure_score
from sklearn.metrics.cluster import map_cluster_labels

from sklearn.utils import assert_all_finite
from sklearn.utils.testing import (
assert_equal, assert_almost_equal, assert_raise_message,
assert_equal, assert_almost_equal, assert_raise_message,
)
from numpy.testing import assert_array_almost_equal

Expand Down Expand Up @@ -275,3 +276,33 @@ def test_fowlkes_mallows_score_properties():
# symmetric and permutation(both together)
score_both = fowlkes_mallows_score(labels_b, (labels_a + 2) % 3)
assert_almost_equal(score_both, expected)


def test_map_cluster_labels():
# handcrafted example - same number of clusters and classes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is a reason to use a large example, it is best to use something in tests that a reader can easily reason about, i.e. a small example. Besides, linear_assignment is tested: your job here is to check you've used it correctly.

y_true = ['a', 'b', 'b', 'c', 'c', 'a']
y_pred = [1, 0, 0, 1, 2, 1]

expected = ['a', 'b', 'b', 'a', 'c', 'a']

y_pred_translated = map_cluster_labels(y_true, y_pred)
assert_equal(y_pred_translated, expected)

# handcrafted example - more clusters than classes
y_true = ['a', 'a', 'a', 'b', 'b', 'b']
y_pred = [4, 0, 1, 1, 2, 2]

expected = ['DEFAULT_LABEL_1', 'a', 'DEFAULT_LABEL_0', 'DEFAULT_LABEL_0',
'b', 'b']

y_pred_translated = map_cluster_labels(y_true, y_pred)
assert_equal(y_pred_translated, expected)

# handcrafted example - more classes than clusters
y_true = ['a', 'd', 'e', 'b', 'b', 'b']
y_pred = [0, 0, -1, -1, 2, 2]

expected = ['a', 'a', 'e', 'e', 'b', 'b']

y_pred_translated = map_cluster_labels(y_true, y_pred)
assert_equal(y_pred_translated, expected)