-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the workflow you want to enable
The API offers the possibility to set the behavior upon stumbling upon a zero division issue when no positive label is present in the dataset, upon computing precision_score
, recall_score
or f1_score
using the keyword argument:
zero_division{“warn”, 0.0, 1.0, np.nan}, default=”warn”
Sets the value to return when there is a zero division.
Notes: - If set to “warn”, this acts like 0, but a warning is also raised. - If set to np.nan, such values will be excluded from the average.
New in version 1.3: np.nan option was added.
precision_recall_curve
, roc_curve
, roc_auc_score
, average_precision_score
, label_ranking_average_precision_score
, despite having to compute precision or recall under the hood do not offer the same possibility.
While this is unlikely to pose problem in the micro averaging setting, it becomes more likely in the sample (AP and LRAP, possibly roc_auc_score) or macro averaging setting.
For instance precision_recall_curve
is not using the precision_score
or recall_score
functions:
scikit-learn/sklearn/metrics/_ranking.py
Lines 970 to 985 in e4efd8b
ps = tps + fps | |
# Initialize the result array with zeros to make sure that precision[ps == 0] | |
# does not contain uninitialized values. | |
precision = np.zeros_like(tps) | |
np.divide(tps, ps, out=precision, where=(ps != 0)) | |
# When no positive label in y_true, recall is set to 1 for all thresholds | |
# tps[-1] == 0 <=> y_true == all negative labels | |
if tps[-1] == 0: | |
warnings.warn( | |
"No positive class found in y_true, " | |
"recall is set to one for all thresholds." | |
) | |
recall = np.ones_like(tps) | |
else: | |
recall = tps / tps[-1] |
In this implementation recall=1 and precision=0 when there is no positive example.
Describe your proposed solution
Use the implemented precision_score
and recall_score
in all precision-recall and ROC curve functions. Add the same zero_division
kwarg and forward it to the precision_score
and recall_score
functions
Describe alternatives you've considered, if relevant
No response
Additional context
I think this discussion is linked to:
- Inconsistency in AUC ROC and AUPR API #24381 that is associated to an unmerged PR. My proposed solution would also enable to fix this issue in a more consistent way (by returning a positive
0.0
and a a zero division warning) - FIX Fix recall in multilabel classification when true labels are all negative #19085
Also to add more context (though changing this could break some existing code): roc_curve
and precision_recall_curve
do not handle this problem consistently:
import numpy as np
import sklearn.metrics as skmet
y_true = np.zeros(10)
y_pred = np.random.uniform(size=y_true.shape)
skmet.roc_curve(y_true,y_pred,pos_label=1)
# UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
# warnings.warn(
# Out:
# (array([0. , 0.1, 1. ]),
# array([nan, nan, nan]), # Recall or True positive Rate
# array([1.82341255, 0.82341255, 0.0795866 ]))
skmet.precision_recall_curve(y_true,y_pred,pos_label=1)
#UserWarning: No positive class found in y_true, recall is set to one for all thresholds.
# warnings.warn(
#Out:
#(array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]), # Precision
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.]), # Recall or True positive Rate
# array([0.0795866 , 0.3813231 , 0.41105316, 0.56378517, 0.56951648,
# 0.60346455, 0.61754398, 0.61861517, 0.70285933, 0.82341255]))
# Compute AUC for the the PR curve
prec, recall, thresh = skmet.precision_recall_curve(y_true,y_pred,pos_label=1)
skmet
# UserWarning: No positive class found in y_true, recall is set to one for all thresholds.
skmet.auc(recall, prec)
# Out: 0.5
Which translates into an undefined AUC for ROC and a weird 0.5
value (nor 0 nor 1) for the AUC of the PR curve. I also find the warning confusing, it states that Recall is set to 1 for all thresholds, however the last value of the recall vector is 0.
...
On the same theme I think the ndcg_score is also somewhat inconsistent (ideal DCG is 0.0, hence a zero division somewhere, in the binary relevance case without positive example):
y_true = np.zeros((10,5))
y_pred = np.random.uniform(size=y_true_almnull.shape)
skmet.ndcg_score(y_true,y_pred)
Out: 0.0
If one wanted to perform sample averaging of the three different metrics (though for the PR curve AP and LRAP are the way to go to perform sample averaging), they would all behave differently upon encounter of a sample with zero positive example...