Description
Hello everyone,
I've come across what I believe is inconsistent behaviour in several functions in the sklearn.metrics
module. I think the problem is with the documentation, but I prefer not to change that until I've ran it past the community and confirmed my understanding is correct (also a fix seems to be in the works #2610 ).
In the example below I'm using recall_score
, but the same applies to precision_score
and f1_score
.
from sklearn.metrics import recall_score # two classes, string or int labels. Per-class recall works as expected print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average=None) print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average=None) # three classes- per-class recall as expected print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average=None) # macro-averaging works as I expect it to print recall_score(['b', 'a', 'a', 'b', 'c'], ['b', 'b', 'a', 'b', 'c'], average='macro')
The output of the above is what I expect it to be, particularly the macro-averaged recall is just the mean of the three recalls per class.
[ 0.5 1. ] [ 0.5 1. ] [ 0.5 1. 1. ] 0.833333333333
When averaging in the binary case, if the labels contain the magic value 1
, you get the recall for class 1
. This is mentioned in the documentation (even though I do not understand why this is done). However, if the labels do not contain the integer 1
, an error is raised.
# averaging silently returns class 1's recall print 'silent ', recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro') # outputs "silent 1.0" print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro') # this raises
In #2094 @jnothman suggests binary classification should be handled differently. I am not sure why this is the case. My understanding the macro averaging works the same way for any number of classes > 2. To get the behaviour I expect out of recall_score
in the binary case, I had to call it with pos_label=None
:
# macro averaging as expected if pos_label==None (undocumented!) print recall_score([1, 0, 0, 1], [1, 1, 0, 1], average='macro', pos_label=None) # outputs 0.75 print recall_score(['b', 'a', 'a', 'b'], ['b', 'b', 'a', 'b'], average='macro', pos_label=None) # outputs 0.75
I think two points need to be made more explicit in the documentation:
- why the metrics work differently in the binary case
- the semantics of
pos_label
andlabels