Skip to content

Averaging of precision/recall/F1 inconsistent #3122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mbatchkarov opened this issue Apr 29, 2014 · 2 comments
Closed

Averaging of precision/recall/F1 inconsistent #3122

mbatchkarov opened this issue Apr 29, 2014 · 2 comments

Comments

@mbatchkarov
Copy link
Contributor

Hello everyone,

I've come across what I believe is inconsistent behaviour in several functions in the sklearn.metrics module. I think the problem is with the documentation, but I prefer not to change that until I've ran it past the community and confirmed my understanding is correct (also a fix seems to be in the works #2610 ).

In the example below I'm using recall_score, but the same applies to precision_score and f1_score.

from sklearn.metrics import recall_score

# two classes, string or int labels. Per-class recall works as expected
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average=None)
print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average=None)

# three classes- per-class recall as expected
print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average=None)
# macro-averaging works as I expect it to
print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average='macro')

The output of the above is what I expect it to be, particularly the macro-averaged recall is just the mean of the three recalls per class.

[ 0.5  1. ]
[ 0.5  1. ]
[ 0.5  1.   1. ]
0.833333333333

When averaging in the binary case, if the labels contain the magic value 1, you get the recall for class 1. This is mentioned in the documentation (even though I do not understand why this is done). However, if the labels do not contain the integer 1, an error is raised.

# averaging silently returns class 1's recall
print 'silent ', recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro') # outputs "silent 1.0"
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro') # this raises

In #2094 @jnothman suggests binary classification should be handled differently. I am not sure why this is the case. My understanding the macro averaging works the same way for any number of classes > 2. To get the behaviour I expect out of recall_score in the binary case, I had to call it with pos_label=None:

# macro averaging as expected if pos_label==None (undocumented!)
print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro', pos_label=None) # outputs 0.75
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro', pos_label=None) # outputs 0.75

I think two points need to be made more explicit in the documentation:

  • why the metrics work differently in the binary case
  • the semantics of pos_label and labels
@jnothman
Copy link
Member

In #2094 @jnothman suggests binary classification should be handled differently.

it is more that binary classification has been historically handled differently. On the contrary, I would like this to cease, and for users to select binary P/R/F explicitly (see #2610, #2679).

I think two points need to be made more explicit in the documentation:

The documentation is already much clearer than it used to be about the interaction between y, pos_label and average. But it's a bad design, which is why I propose #2610 in which (after a deprecation cycle), one would require labels=[1] (or perhaps labels=1 or labels='binary') to get the current binary functionality; and similarly one might use labels=[1,2,3] to exclude a majority class 0 for averaged multiclass P/R/F. In short, binary would not be special-cased and surprising.

The major reason #2610 is still a work in progress, however, is that there is a usage that needs deprecation for a version before it can be implemented (see #2952).

In short this is a feature, not a bug; but it's a bad feature ("explicit is better than implicit") and is gradually being fixed. I'm therefore closing this issue, but if you can word a documentation clarification, go ahead and submit a PR.

@jnothman
Copy link
Member

PS: this problem has bitten us even in internal tests, so it's not being taken lightly, it's just something that needs gradual change in order to maintain backwards compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants