Skip to content

average_precision_score() overestimates AUC value #13074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
barrydebruin opened this issue Jan 31, 2019 · 5 comments
Closed

average_precision_score() overestimates AUC value #13074

barrydebruin opened this issue Jan 31, 2019 · 5 comments
Labels

Comments

@barrydebruin
Copy link

barrydebruin commented Jan 31, 2019

Description

The average_precision_score() function in sklearn doesn't return a correct AUC value.

Steps/Code to Reproduce

Example:

import numpy as np
"""
    Desc: average_precision_score returns overestimated AUC of precision-recall curve
"""
# pathological example
p = [0.833, 0.800] # precision
r = [0.294, 0.235] # recall

# computation of average_precision_score()
print("AUC       = {:3f}".format(-np.sum(np.diff(r) * np.array(p)[:-1]))) # _binary_uninterpolated_average_precision()

# computation of auc() with trapezoid interpolation
print("AUC TRAP. = {:3f}".format(-np.trapz(p, r)))

# possible fix in _binary_uninterpolated_average_precision() **(edited)**
print("AUC FIX   = {:3f}".format(-np.sum(np.diff(r) * np.minimum(p[:-1], p[1:])))

#>> AUC       = 0.049147
#>> AUC TRAP. = 0.048174
#>> AUC FIX   = 0.047200

Expected Results

AUC without interpolation = (0.294 - 0.235) * 0.800 = 0.472
AUC with trapezoidal interpolation = 0.472 + (0.294 - 0.235) * (0.833 - 0.800) / 2 = 0.0482

Actual Results

This is what sklearn implements for AUC without interpolation (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html):

sum((r[i] - r[i+1]) * p[i] for i in range(len(p)-1))
>> 0.049147

This is what I think is correct (no longer; see edit):

sum((r[i] - r[i+1]) * p[i+1] for i in range(len(p)-1))
>> 0.047200

EDIT: I found that the above 'correct' implementation doesn't always underestimate. It depends on the input. Therefore I have revised the uninterpolated AUC calculation to this:

sum((r[i] - r[i+1]) * min(p[i] + p[i+1]) for i in range(len(p)-1)) 
>> 0.047200

This has the advantage that the AUC calculation is more consistent; it is either equal or underestimated, but never overestimated (compared to the current uninterpolated AUC function). Below I show some examples on what it does:

  • Example 1: all work fine
p = [0.3, 1.0]
r = [1.0, 0.0]

#Results:
>> 0.30    # sklearn's _binary_uninterpolated_average_precision()
>> 0.30    # my consistent _binary_uninterpolated_average_precision()
>> 0.65    # np.trapz() (trapezoidal interpolation)

pr_curve1

  • Example 2: sklearn's _binary_uninterpolated_average_precision returns inaccurate number
p = [1.0, 0.3]
r = [1.0, 0.0]

#Results:
>> 1.00    # sklearn's _binary_uninterpolated_average_precision()
>> 0.30    # my consistent _binary_uninterpolated_average_precision()
>> 0.65    # np.trapz() (trapezoidal interpolation)

pr_curve2

  • Example 3: extra example
p = [0.4, 0.1, 1.0]
r = [1.0, 0.9, 0.0]

#Results:
>> 0.13      # sklearn's _binary_uninterpolated_average_precision()
>> 0.10      # my consistent _binary_uninterpolated_average_precision()
>> 0.52      # np.trapz() (trapezoidal interpolation)

pr_curve3

Versions

Windows-10-10.0.17134-SP0
Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 10:22:32) [MSC v.1900 64 bit (AMD64)]
NumPy 1.14.0
SciPy 1.0.0
Scikit-Learn 0.19.1

jeremiedbb pushed a commit to jeremiedbb/scikit-learn that referenced this issue Jan 31, 2019
@jnothman
Copy link
Member

jnothman commented Jan 31, 2019 via email

@barrydebruin
Copy link
Author

Thank you for your comment. I did see some of these older issues, but not all of them. I did actually find some cases where the AUC value is underestimated as well, which makes the problem a bit more complex than I initially thought.

For datasets with a small number of precision and recall thresholds, it seems better for now to use the interpolated area under the curve (i.e. sklearn.metrics.auc() or np.trapz()), or am I mistaken?

@jnothman
Copy link
Member

jnothman commented Jan 31, 2019 via email

thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this issue Feb 6, 2019
thomasjpfan pushed a commit to thomasjpfan/scikit-learn that referenced this issue Feb 7, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this issue Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this issue Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this issue Jul 12, 2019
@lucyleeow lucyleeow added the Needs Decision Requires decision label Nov 8, 2020
@ndingwall
Copy link
Contributor

I don't think the current implementation overestimates. Consider your second example:

p = [1.0, 0.3]
r = [1.0, 0.0]

If I told you I wanted a recall of 0.5, you couldn't give me the second operating point (p=0.3, r=0.0) because the recall is too low. You could give me the first one (p=1.0, r=1.0) because r=1.0 implies that (more than) half of the positive datapoints have been recalled. Therefore, the most reasonable precision value we can choose to correspond to r=0.5 is p=1.0. That's the motivation for using the precision value at the next operating point (i.e. the one with the next-largest recall value).

Your implementation uses min(p[i] + p[i+1]), which inconsistently chooses either the previous or subsequent operating point, depending on which one does worse. That systematically under estimates the true average precision score.

@thomasjpfan
Copy link
Member

As part of scikit-learn's triaging guidelines, I am closing this issue because it is a duplicate of #4577.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants