Skip to content

Inconsistency in zero_division handling between precision/recall/f1 and precision_recall_curve/roc_curve related metrics #27047

@qmarcou

Description

@qmarcou

Describe the workflow you want to enable

The API offers the possibility to set the behavior upon stumbling upon a zero division issue when no positive label is present in the dataset, upon computing precision_score, recall_score or f1_score using the keyword argument:

zero_division{“warn”, 0.0, 1.0, np.nan}, default=”warn”

Sets the value to return when there is a zero division.

Notes: - If set to “warn”, this acts like 0, but a warning is also raised. - If set to np.nan, such values will be excluded from the average.

New in version 1.3: np.nan option was added.

precision_recall_curve, roc_curve, roc_auc_score, average_precision_score , label_ranking_average_precision_score, despite having to compute precision or recall under the hood do not offer the same possibility.
While this is unlikely to pose problem in the micro averaging setting, it becomes more likely in the sample (AP and LRAP, possibly roc_auc_score) or macro averaging setting.

For instance precision_recall_curve is not using the precision_score or recall_score functions:

ps = tps + fps
# Initialize the result array with zeros to make sure that precision[ps == 0]
# does not contain uninitialized values.
precision = np.zeros_like(tps)
np.divide(tps, ps, out=precision, where=(ps != 0))
# When no positive label in y_true, recall is set to 1 for all thresholds
# tps[-1] == 0 <=> y_true == all negative labels
if tps[-1] == 0:
warnings.warn(
"No positive class found in y_true, "
"recall is set to one for all thresholds."
)
recall = np.ones_like(tps)
else:
recall = tps / tps[-1]

In this implementation recall=1 and precision=0 when there is no positive example.

Describe your proposed solution

Use the implemented precision_score and recall_score in all precision-recall and ROC curve functions. Add the same zero_division kwarg and forward it to the precision_score and recall_score functions

Describe alternatives you've considered, if relevant

No response

Additional context

I think this discussion is linked to:

Also to add more context (though changing this could break some existing code): roc_curve and precision_recall_curve do not handle this problem consistently:

import numpy as np
import sklearn.metrics as skmet

y_true = np.zeros(10)
y_pred = np.random.uniform(size=y_true.shape)

skmet.roc_curve(y_true,y_pred,pos_label=1)
# UndefinedMetricWarning: No positive samples in y_true, true positive value should be meaningless
#  warnings.warn(
# Out: 
# (array([0. , 0.1, 1. ]),
# array([nan, nan, nan]),  # Recall or True positive Rate
# array([1.82341255, 0.82341255, 0.0795866 ]))

skmet.precision_recall_curve(y_true,y_pred,pos_label=1)
#UserWarning: No positive class found in y_true, recall is set to one for all thresholds.
#  warnings.warn(
#Out: 
#(array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]),  # Precision
# array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0.]),  # Recall or True positive Rate
# array([0.0795866 , 0.3813231 , 0.41105316, 0.56378517, 0.56951648,
#        0.60346455, 0.61754398, 0.61861517, 0.70285933, 0.82341255]))

# Compute AUC for the the PR curve
prec, recall, thresh = skmet.precision_recall_curve(y_true,y_pred,pos_label=1)
skmet
# UserWarning: No positive class found in y_true, recall is set to one for all thresholds.
skmet.auc(recall, prec)
# Out: 0.5

Which translates into an undefined AUC for ROC and a weird 0.5 value (nor 0 nor 1) for the AUC of the PR curve. I also find the warning confusing, it states that Recall is set to 1 for all thresholds, however the last value of the recall vector is 0....

On the same theme I think the ndcg_score is also somewhat inconsistent (ideal DCG is 0.0, hence a zero division somewhere, in the binary relevance case without positive example):

y_true = np.zeros((10,5))
y_pred = np.random.uniform(size=y_true_almnull.shape)

skmet.ndcg_score(y_true,y_pred)
Out: 0.0

If one wanted to perform sample averaging of the three different metrics (though for the PR curve AP and LRAP are the way to go to perform sample averaging), they would all behave differently upon encounter of a sample with zero positive example...

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions