-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] ENH Add new metrics for clustering #6823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG+2] ENH Add new metrics for clustering #6823
Conversation
Good for me |
|
||
|
||
def inertia_score(X, labels, metric="euclidean", **kwds): | ||
"""Compute the inertia of a given dataset and their cluster assignment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added also inertia
because I will need it later to implement the gap statistics.
To be honest, I don't see what are the differences between distortion and inertia (unless the normalization).
I know Inertia is the function which is minimized by kmeans but is it really useful to have in a specific function like this ? Tell me if you want to keep it or not.
c709026
to
265fc71
Compare
Perfect labeling is scored 1.0:: | ||
|
||
>>> labels_pred = labels_true[:] | ||
>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred) # doctest: +ELLIPSIS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics.fowlkes_mallows_score
Except nitpicks, LGTM |
if (classes.shape[0] == clusters.shape[0] == 1 | ||
or classes.shape[0] == clusters.shape[0] == 0 | ||
or classes.shape[0] == clusters.shape[0] == len(labels_true)): | ||
if (classes.shape[0] == clusters.shape[0] == 1 or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't go making arbitrary cosmetic changes to other code. It makes your work much harder to review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, you're right.
I will do the pep8 corrections with a different PR next time.
Sorry for unfamiliarity with the subject matter... Am I right in thinking FMI is just the geometric mean of precision and recall when calculated over pairs? If so, this should be noted. Any reason this is particularly suited to clustering, or why we might prefer harmonic mean (as in F-score) over geometric or vice-versa, or does it come down to convention? |
Also, I think it would still be good to have @afouchet's example in this PR. |
It's that indeed, I will add a comment
I'm not an expert on that but it seems that FMI was originally introduced as a measure to compare hierarchical clusterings and was used then for flat clustering. For me it's just a classical clustering measure. If you want I've found an overview about those measure: |
@jnothman I agree with you but it's not related to that PR. |
|
38cc660
to
0dc2a60
Compare
---------------------- | ||
|
||
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of | ||
the precision and recal:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*recall
0dc2a60
to
8750c9c
Compare
@jnothman I've corrected your comments. It's ok for you ? |
@@ -58,6 +58,14 @@ New features | |||
One can pass method names such as `predict_proba` to be used in the cross | |||
validation framework instead of the default `predict`. By `Ori Ziv`_ and `Sears Merritt`_. | |||
|
|||
- Added :func:`metrics.cluster.fowlkes_mallows_score`, the Fowlkes Mallows | |||
Index which measures the similarity of two clusterings of a set of points | |||
By `Thierry Guillemot`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add Arnaud Fouchet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, sorry Arnaud to forget you.
8750c9c
to
d7c0e84
Compare
---------------------- | ||
|
||
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of | ||
the precision and recall:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be "pairwise precision and recall"
dab633d
to
1bd1623
Compare
A clustering of the data into disjoint subsets. | ||
|
||
max_n_classes : int, optional (default=5000) | ||
Maximal number of classes handled by the adjusted_rand_score |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the reference to adjusted_rand_score
is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Though I think this max_n_classes
business is yuck, and should be avoided by using a sparse contingency matrix as in #4788)
1bd1623
to
229618d
Compare
A few documentation issues:
Otherwise this can be merged. |
6f1d7e8
to
f850111
Compare
This work is based on the code of A. Fourchet proposes in the PR-4301. - Speedup the computation. - Simplify the doc. - Correct some bugs. - Add test.
f850111
to
e1be8db
Compare
No worries, and thank you! |
Merged! Thanks and congratulations! |
(to @tguillemot but also to @afouchet of course...) |
🍻 |
Thanks @jnothman and @tguillemot ! |
…cikit-learn#6823) Based on the code of A. Fouchet in PR#4301.
…cikit-learn#6823) Based on the code of A. Fouchet in PR#4301.
uhh nice, I didn't realize this was merged! (and yes, I'm slow) |
Reference Issue
I propose to integrate step by step the work of @afouchet propose by #4301.
What does this implement/fix? Explain your changes.
This PR proposes some metric :
Distortion (unsupervised)Inertia (unsupervised)Any other comments?
This is the first step to add a new module to choose the optimal number of cluster (stabiliity, gap statistic, distortion jump, silhouette, Calinsky and Harabasz index, and 2 "elbow" methods).
@afouchet @jnothman @raghavrv @agramfort