Averaging of precision/recall/F1 inconsistent

Hello everyone,

I've come across what I believe is inconsistent behaviour in several functions in the `sklearn.metrics` module. I think the problem is with the documentation, but I prefer not to change that until I've ran it past the community and confirmed my understanding is correct (also a fix seems to be in the works #2610 ).

In the example below I'm using `recall_score`, but the same applies to `precision_score` and `f1_score`. 

<pre>
from sklearn.metrics import recall_score

# two classes, string or int labels. Per-class recall works as expected
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average=None)
print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average=None)

# three classes- per-class recall as expected
print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average=None)
# macro-averaging works as I expect it to
print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average='macro')
</pre>


The output of the above is what I expect it to be, particularly the macro-averaged recall is just the mean of the three recalls per class. 

<pre>
[ 0.5  1. ]
[ 0.5  1. ]
[ 0.5  1.   1. ]
0.833333333333
</pre>


When averaging in the binary case, if the labels contain the magic value `1`, you get the recall for class `1`. This is mentioned in the documentation (even though I do not understand why this is done). However, if the labels do not contain the integer `1`, an error is raised.

<pre>
# averaging silently returns class 1's recall
print 'silent ', recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro') # outputs "silent 1.0"
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro') # this raises
</pre>


In #2094 @jnothman suggests binary classification should be handled differently. I am not sure why this is the case. My understanding the macro averaging works the same way for any number of classes > 2. To get the behaviour I expect out of `recall_score` in the binary case, I had to call it with `pos_label=None`:

<pre>
# macro averaging as expected if pos_label==None (undocumented!)
print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro', pos_label=None) # outputs 0.75
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro', pos_label=None) # outputs 0.75
</pre>


I think two points need to be made more explicit in the documentation:
- why the metrics work differently in the binary case
- the semantics of `pos_label` and `labels`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Averaging of precision/recall/F1 inconsistent #3122

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Averaging of precision/recall/F1 inconsistent #3122

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions