Skip to content

[MRG+2] ENH Add new metrics for clustering #6823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 16, 2016

Conversation

tguillemot
Copy link
Contributor

@tguillemot tguillemot commented May 25, 2016

Reference Issue

I propose to integrate step by step the work of @afouchet propose by #4301.

What does this implement/fix? Explain your changes.

This PR proposes some metric :

  • Fowlkess-Mallows (supervised)
  • Distortion (unsupervised)
  • Inertia (unsupervised)
  • Calinski (unsupervised)

Any other comments?

This is the first step to add a new module to choose the optimal number of cluster (stabiliity, gap statistic, distortion jump, silhouette, Calinsky and Harabasz index, and 2 "elbow" methods).
@afouchet @jnothman @raghavrv @agramfort

@afouchet
Copy link

Good for me



def inertia_score(X, labels, metric="euclidean", **kwds):
"""Compute the inertia of a given dataset and their cluster assignment.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added also inertia because I will need it later to implement the gap statistics.
To be honest, I don't see what are the differences between distortion and inertia (unless the normalization).

I know Inertia is the function which is minimized by kmeans but is it really useful to have in a specific function like this ? Tell me if you want to keep it or not.

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch 3 times, most recently from c709026 to 265fc71 Compare May 30, 2016 21:19
Perfect labeling is scored 1.0::

>>> labels_pred = labels_true[:]
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred) # doctest: +ELLIPSIS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics.fowlkes_mallows_score

@TomDLT
Copy link
Member

TomDLT commented Jun 2, 2016

Except nitpicks, LGTM
You will need to add a whats_new entry, and to rebase your commits.

@TomDLT TomDLT changed the title [MRG] ENH Add new metrics for clustering [MRG+1] ENH Add new metrics for clustering Jun 2, 2016
if (classes.shape[0] == clusters.shape[0] == 1
or classes.shape[0] == clusters.shape[0] == 0
or classes.shape[0] == clusters.shape[0] == len(labels_true)):
if (classes.shape[0] == clusters.shape[0] == 1 or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't go making arbitrary cosmetic changes to other code. It makes your work much harder to review.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you're right.
I will do the pep8 corrections with a different PR next time.

@jnothman
Copy link
Member

jnothman commented Jun 4, 2016

Sorry for unfamiliarity with the subject matter... Am I right in thinking FMI is just the geometric mean of precision and recall when calculated over pairs? If so, this should be noted. Any reason this is particularly suited to clustering, or why we might prefer harmonic mean (as in F-score) over geometric or vice-versa, or does it come down to convention?

@jnothman
Copy link
Member

jnothman commented Jun 4, 2016

Also, I think it would still be good to have @afouchet's example in this PR.

@tguillemot
Copy link
Contributor Author

Am I right in thinking FMI is just the geometric mean of precision and recall when calculated over pairs ?

It's that indeed, I will add a comment

Any reason this is particularly suited to clustering, or why we might prefer harmonic mean (as in F-score) over geometric or vice-versa, or does it come down to convention?

I'm not an expert on that but it seems that FMI was originally introduced as a measure to compare hierarchical clusterings and was used then for flat clustering. For me it's just a classical clustering measure.

If you want I've found an overview about those measure:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.164.6189&rep=rep1&type=pdf

@tguillemot
Copy link
Contributor Author

tguillemot commented Jun 8, 2016

Also, I think it would still be good to have @afouchet's example in this PR.

@jnothman I agree with you but it's not related to that PR.
The plan is to push the demo at the same time than the code which computes the optimal number of clusters.
This is not the purpose of this PR and I will do another PR especially for that soon.

@agramfort
Copy link
Member

I agree with you but it's not related to that PR.
The plan is to push the demo at the same time than the code which computes
the optimal number of clusters.
This is not the purpose of this PR and I will do another PR especially for
that soon.

+1 I would say one thing at a time.

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch 2 times, most recently from 38cc660 to 0dc2a60 Compare June 8, 2016 14:50
----------------------

The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
the precision and recal::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*recall

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch from 0dc2a60 to 8750c9c Compare June 9, 2016 08:31
@tguillemot
Copy link
Contributor Author

@jnothman I've corrected your comments. It's ok for you ?

@@ -58,6 +58,14 @@ New features
One can pass method names such as `predict_proba` to be used in the cross
validation framework instead of the default `predict`. By `Ori Ziv`_ and `Sears Merritt`_.

- Added :func:`metrics.cluster.fowlkes_mallows_score`, the Fowlkes Mallows
Index which measures the similarity of two clusterings of a set of points
By `Thierry Guillemot`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add Arnaud Fouchet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, sorry Arnaud to forget you.

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch from 8750c9c to d7c0e84 Compare June 10, 2016 09:23
----------------------

The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of
the precision and recall::
Copy link
Member

@jnothman jnothman Jun 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "pairwise precision and recall"

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch 3 times, most recently from dab633d to 1bd1623 Compare June 14, 2016 09:15
@tguillemot
Copy link
Contributor Author

@lesteve @jnothman Thx for your reviews !!!
Ok to merge ?

A clustering of the data into disjoint subsets.

max_n_classes : int, optional (default=5000)
Maximal number of classes handled by the adjusted_rand_score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the reference to adjusted_rand_score is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though I think this max_n_classes business is yuck, and should be avoided by using a sparse contingency matrix as in #4788)

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch from 1bd1623 to 229618d Compare June 14, 2016 12:18
@tguillemot
Copy link
Contributor Author

(Though I think this max_n_classes business is yuck, and should be avoided by using a sparse contingency matrix as in #4788)

@jnothman It's a good idea indeed. If you want I will do another PR to change the measure of those files once #4788 will be merged.

@jnothman
Copy link
Member

A few documentation issues:

  • You've not added these functions to classes.rst
  • Currently there is no link from the user guide to the function documentation for either metric.
  • I suspect the mathematical notation in docstrings should be marked with .. math:: to be rendered correctly.

Otherwise this can be merged.

@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch 3 times, most recently from 6f1d7e8 to f850111 Compare June 16, 2016 13:42
@tguillemot
Copy link
Contributor Author

@jnothman Sorry for being slow, I've corrected the doc.

@jnothman @lesteve Thnaks for your reviews.

afouchet and others added 3 commits June 16, 2016 15:46
This work is based on the code of A. Fourchet proposes in the PR-4301.
- Speedup the computation.
- Simplify the doc.
- Correct some bugs.
- Add test.
@tguillemot tguillemot force-pushed the new_metric_cluster_1 branch from f850111 to e1be8db Compare June 16, 2016 13:46
@jnothman
Copy link
Member

Sorry for being slow, I've corrected the doc.

No worries, and thank you!

@jnothman jnothman merged commit 1f86b1d into scikit-learn:master Jun 16, 2016
@jnothman
Copy link
Member

Merged! Thanks and congratulations!

@jnothman
Copy link
Member

(to @tguillemot but also to @afouchet of course...)

@tguillemot
Copy link
Contributor Author

Thanks @jnothman and @afouchet

@raghavrv
Copy link
Member

🍻

@afouchet
Copy link

afouchet commented Jun 16, 2016

Thanks @jnothman and @tguillemot !

olologin pushed a commit to olologin/scikit-learn that referenced this pull request Aug 24, 2016
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
@amueller
Copy link
Member

uhh nice, I didn't realize this was merged! (and yes, I'm slow)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants