[MRG+2] ENH Add new metrics for clustering #6823

tguillemot · 2016-05-25T14:18:49Z

Reference Issue

I propose to integrate step by step the work of @afouchet propose by #4301.

What does this implement/fix? Explain your changes.

This PR proposes some metric :

Fowlkess-Mallows (supervised)
~~Distortion (unsupervised)~~
~~Inertia (unsupervised)~~
Calinski (unsupervised)

Any other comments?

This is the first step to add a new module to choose the optimal number of cluster (stabiliity, gap statistic, distortion jump, silhouette, Calinsky and Harabasz index, and 2 "elbow" methods).
@afouchet @jnothman @raghavrv @agramfort

afouchet · 2016-05-25T17:54:18Z

Good for me

tguillemot · 2016-05-25T20:13:16Z

sklearn/metrics/cluster/unsupervised.py

+
+
+def inertia_score(X, labels, metric="euclidean", **kwds):
+    """Compute the inertia of a given dataset and their cluster assignment.


I've added also inertia because I will need it later to implement the gap statistics.
To be honest, I don't see what are the differences between distortion and inertia (unless the normalization).

I know Inertia is the function which is minimized by kmeans but is it really useful to have in a specific function like this ? Tell me if you want to keep it or not.

TomDLT · 2016-06-02T13:24:18Z

doc/modules/clustering.rst

+Perfect labeling is scored 1.0::
+
+  >>> labels_pred = labels_true[:]
+  >>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)  # doctest: +ELLIPSIS


metrics.fowlkes_mallows_score

TomDLT · 2016-06-02T13:59:13Z

Except nitpicks, LGTM
You will need to add a whats_new entry, and to rebase your commits.

jnothman · 2016-06-04T13:35:07Z

sklearn/metrics/cluster/supervised.py

-    if (classes.shape[0] == clusters.shape[0] == 1
-            or classes.shape[0] == clusters.shape[0] == 0
-            or classes.shape[0] == clusters.shape[0] == len(labels_true)):
+    if (classes.shape[0] == clusters.shape[0] == 1 or


Please don't go making arbitrary cosmetic changes to other code. It makes your work much harder to review.

Sorry, you're right.
I will do the pep8 corrections with a different PR next time.

jnothman · 2016-06-04T13:55:14Z

Sorry for unfamiliarity with the subject matter... Am I right in thinking FMI is just the geometric mean of precision and recall when calculated over pairs? If so, this should be noted. Any reason this is particularly suited to clustering, or why we might prefer harmonic mean (as in F-score) over geometric or vice-versa, or does it come down to convention?

jnothman · 2016-06-04T13:57:10Z

Also, I think it would still be good to have @afouchet's example in this PR.

tguillemot · 2016-06-08T14:13:49Z

Am I right in thinking FMI is just the geometric mean of precision and recall when calculated over pairs ?

It's that indeed, I will add a comment

Any reason this is particularly suited to clustering, or why we might prefer harmonic mean (as in F-score) over geometric or vice-versa, or does it come down to convention?

I'm not an expert on that but it seems that FMI was originally introduced as a measure to compare hierarchical clusterings and was used then for flat clustering. For me it's just a classical clustering measure.

If you want I've found an overview about those measure:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.6189&rep=rep1&type=pdf

tguillemot · 2016-06-08T14:36:03Z

Also, I think it would still be good to have @afouchet's example in this PR.

@jnothman I agree with you but it's not related to that PR.
The plan is to push the demo at the same time than the code which computes the optimal number of clusters.
This is not the purpose of this PR and I will do another PR especially for that soon.

agramfort · 2016-06-08T14:37:30Z

I agree with you but it's not related to that PR.
The plan is to push the demo at the same time than the code which computes
the optimal number of clusters.
This is not the purpose of this PR and I will do another PR especially for
that soon.

+1 I would say one thing at a time.

jnothman · 2016-06-08T14:58:57Z

doc/modules/clustering.rst

+----------------------
+
+The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
+the precision and recal::


tguillemot · 2016-06-10T08:45:30Z

@jnothman I've corrected your comments. It's ok for you ?

TomDLT · 2016-06-10T08:50:24Z

doc/whats_new.rst

@@ -58,6 +58,14 @@ New features
     One can pass method names such as `predict_proba` to be used in the cross
     validation framework instead of the default `predict`. By `Ori Ziv`_ and `Sears Merritt`_.

+   - Added :func:`metrics.cluster.fowlkes_mallows_score`, the Fowlkes Mallows
+     Index which measures the similarity of two clusterings of a set of points
+     By `Thierry Guillemot`_.


You should add Arnaud Fouchet.

Of course, sorry Arnaud to forget you.

jnothman · 2016-06-13T11:57:37Z

doc/modules/clustering.rst

+----------------------
+
+The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
+the precision and recall::


should be "pairwise precision and recall"

tguillemot · 2016-06-14T11:55:56Z

@lesteve @jnothman Thx for your reviews !!!
Ok to merge ?

jnothman · 2016-06-14T12:05:04Z

sklearn/metrics/cluster/supervised.py

+        A clustering of the data into disjoint subsets.
+
+    max_n_classes : int, optional (default=5000)
+        Maximal number of classes handled by the adjusted_rand_score


I don't think the reference to adjusted_rand_score is correct.

(Though I think this max_n_classes business is yuck, and should be avoided by using a sparse contingency matrix as in #4788)

tguillemot · 2016-06-14T12:26:38Z

(Though I think this max_n_classes business is yuck, and should be avoided by using a sparse contingency matrix as in #4788)

@jnothman It's a good idea indeed. If you want I will do another PR to change the measure of those files once #4788 will be merged.

jnothman · 2016-06-14T12:32:12Z

A few documentation issues:

You've not added these functions to classes.rst
Currently there is no link from the user guide to the function documentation for either metric.
I suspect the mathematical notation in docstrings should be marked with .. math:: to be rendered correctly.

Otherwise this can be merged.

tguillemot · 2016-06-16T13:45:17Z

@jnothman Sorry for being slow, I've corrected the doc.

@jnothman @lesteve Thnaks for your reviews.

This work is based on the code of A. Fourchet proposes in the PR-4301. - Speedup the computation. - Simplify the doc. - Correct some bugs. - Add test.

jnothman · 2016-06-16T13:54:52Z

Sorry for being slow, I've corrected the doc.

No worries, and thank you!

jnothman · 2016-06-16T14:02:03Z

Merged! Thanks and congratulations!

jnothman · 2016-06-16T14:02:39Z

(to @tguillemot but also to @afouchet of course...)

tguillemot · 2016-06-16T14:05:42Z

Thanks @jnothman and @afouchet

raghavrv · 2016-06-16T14:21:00Z

🍻

afouchet · 2016-06-16T14:23:38Z

Thanks @jnothman and @tguillemot !

…cikit-learn#6823) Based on the code of A. Fouchet in PR#4301.

amueller · 2016-10-11T21:14:28Z

uhh nice, I didn't realize this was merged! (and yes, I'm slow)

tguillemot reviewed May 25, 2016
View reviewed changes

tguillemot force-pushed the new_metric_cluster_1 branch 3 times, most recently from c709026 to 265fc71 Compare May 30, 2016 21:19

TomDLT reviewed Jun 2, 2016
View reviewed changes

TomDLT changed the title ~~[MRG] ENH Add new metrics for clustering~~ [MRG+1] ENH Add new metrics for clustering Jun 2, 2016

jnothman reviewed Jun 4, 2016
View reviewed changes

tguillemot force-pushed the new_metric_cluster_1 branch 2 times, most recently from 38cc660 to 0dc2a60 Compare June 8, 2016 14:50

jnothman reviewed Jun 8, 2016
View reviewed changes

tguillemot force-pushed the new_metric_cluster_1 branch from 0dc2a60 to 8750c9c Compare June 9, 2016 08:31

TomDLT reviewed Jun 10, 2016
View reviewed changes

tguillemot force-pushed the new_metric_cluster_1 branch from 8750c9c to d7c0e84 Compare June 10, 2016 09:23

jnothman reviewed Jun 13, 2016
View reviewed changes

tguillemot force-pushed the new_metric_cluster_1 branch 3 times, most recently from dab633d to 1bd1623 Compare June 14, 2016 09:15

jnothman reviewed Jun 14, 2016
View reviewed changes

tguillemot force-pushed the new_metric_cluster_1 branch from 1bd1623 to 229618d Compare June 14, 2016 12:18

tguillemot force-pushed the new_metric_cluster_1 branch 3 times, most recently from 6f1d7e8 to f850111 Compare June 16, 2016 13:42

afouchet and others added 3 commits June 16, 2016 15:46

Merge work of @afouchet.

63cc496

[MRG] Introduction of some new cluster metrics.

f3c377c

This work is based on the code of A. Fourchet proposes in the PR-4301. - Speedup the computation. - Simplify the doc. - Correct some bugs. - Add test.

Docstring fix.

e1be8db

tguillemot force-pushed the new_metric_cluster_1 branch from f850111 to e1be8db Compare June 16, 2016 13:46

jnothman merged commit 1f86b1d into scikit-learn:master Jun 16, 2016

tguillemot mentioned this pull request Jun 29, 2016

[WIP] ENH Optimal n_clusters values #6948

Closed

9 tasks

olologin pushed a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016

[MRG] ENH Add Calinsky-Harabaz and Fowkes-Mallows clustering metrics (s…

3ea7396

…cikit-learn#6823) Based on the code of A. Fouchet in PR#4301.

TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016

[MRG] ENH Add Calinsky-Harabaz and Fowkes-Mallows clustering metrics (s…

1fa018c

…cikit-learn#6823) Based on the code of A. Fouchet in PR#4301.

tguillemot mentioned this pull request Jan 9, 2017

[MRG] Choose number of clusters #4301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG+2] ENH Add new metrics for clustering #6823

[MRG+2] ENH Add new metrics for clustering #6823

tguillemot commented May 25, 2016 •

edited by raghavrv

Loading

afouchet commented May 25, 2016

tguillemot May 25, 2016

TomDLT Jun 2, 2016

TomDLT commented Jun 2, 2016 •

edited

Loading

jnothman Jun 4, 2016

tguillemot Jun 8, 2016

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

tguillemot commented Jun 8, 2016

tguillemot commented Jun 8, 2016 •

edited

Loading

agramfort commented Jun 8, 2016

jnothman Jun 8, 2016

tguillemot commented Jun 10, 2016

TomDLT Jun 10, 2016

tguillemot Jun 10, 2016

jnothman Jun 13, 2016 •

edited

Loading

tguillemot commented Jun 14, 2016

jnothman Jun 14, 2016

jnothman Jun 14, 2016

tguillemot commented Jun 14, 2016

jnothman commented Jun 14, 2016

tguillemot commented Jun 16, 2016

jnothman commented Jun 16, 2016

jnothman commented Jun 16, 2016

jnothman commented Jun 16, 2016

tguillemot commented Jun 16, 2016

raghavrv commented Jun 16, 2016

afouchet commented Jun 16, 2016 •

edited

Loading

amueller commented Oct 11, 2016



		def inertia_score(X, labels, metric="euclidean", **kwds):
		"""Compute the inertia of a given dataset and their cluster assignment.

[MRG+2] ENH Add new metrics for clustering #6823

[MRG+2] ENH Add new metrics for clustering #6823

Conversation

tguillemot commented May 25, 2016 • edited by raghavrv Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

afouchet commented May 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomDLT commented Jun 2, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Jun 4, 2016

jnothman commented Jun 4, 2016

tguillemot commented Jun 8, 2016

tguillemot commented Jun 8, 2016 • edited Loading

agramfort commented Jun 8, 2016

Choose a reason for hiding this comment

tguillemot commented Jun 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman Jun 13, 2016 • edited Loading

Choose a reason for hiding this comment

tguillemot commented Jun 14, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tguillemot commented Jun 14, 2016

jnothman commented Jun 14, 2016

tguillemot commented Jun 16, 2016

jnothman commented Jun 16, 2016

jnothman commented Jun 16, 2016

jnothman commented Jun 16, 2016

tguillemot commented Jun 16, 2016

raghavrv commented Jun 16, 2016

afouchet commented Jun 16, 2016 • edited Loading

amueller commented Oct 11, 2016

tguillemot commented May 25, 2016 •

edited by raghavrv

Loading

TomDLT commented Jun 2, 2016 •

edited

Loading

tguillemot commented Jun 8, 2016 •

edited

Loading

jnothman Jun 13, 2016 •

edited

Loading

afouchet commented Jun 16, 2016 •

edited

Loading