Skip to content

Averaging of precision/recall/F1 inconsistent #3122

Closed
@mbatchkarov

Description

@mbatchkarov

Hello everyone,

I've come across what I believe is inconsistent behaviour in several functions in the sklearn.metrics module. I think the problem is with the documentation, but I prefer not to change that until I've ran it past the community and confirmed my understanding is correct (also a fix seems to be in the works #2610 ).

In the example below I'm using recall_score, but the same applies to precision_score and f1_score.

from sklearn.metrics import recall_score

# two classes, string or int labels. Per-class recall works as expected
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average=None)
print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average=None)

# three classes- per-class recall as expected
print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average=None)
# macro-averaging works as I expect it to
print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average='macro')

The output of the above is what I expect it to be, particularly the macro-averaged recall is just the mean of the three recalls per class.

[ 0.5  1. ]
[ 0.5  1. ]
[ 0.5  1.   1. ]
0.833333333333

When averaging in the binary case, if the labels contain the magic value 1, you get the recall for class 1. This is mentioned in the documentation (even though I do not understand why this is done). However, if the labels do not contain the integer 1, an error is raised.

# averaging silently returns class 1's recall
print 'silent ', recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro') # outputs "silent 1.0"
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro') # this raises

In #2094 @jnothman suggests binary classification should be handled differently. I am not sure why this is the case. My understanding the macro averaging works the same way for any number of classes > 2. To get the behaviour I expect out of recall_score in the binary case, I had to call it with pos_label=None:

# macro averaging as expected if pos_label==None (undocumented!)
print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro', pos_label=None) # outputs 0.75
print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro', pos_label=None) # outputs 0.75

I think two points need to be made more explicit in the documentation:

  • why the metrics work differently in the binary case
  • the semantics of pos_label and labels

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions