-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Add average precision definitions and cross references #9583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Related to #9557 but there should not be any conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM, thanks.
@amueller, this is also worth having in 0.19.1.
I support #9586 for the sake of consistency. @rhiever I'd appreciate your thoughts on phrasing here, especially:
I also suggest that if we say AP is not equivalent to area under the curve, somewhere in the documentation we define what that area is and which scikit-learn function computes it. AUCPR is a much more commonly reported metric than AP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be good for 0.19.1. But I'm not entirely satisfied ;) It took me too long to figure out what's happening, I don't wish that on anyone else. Right not the docs still say we're computing the area under the PR curve, which is misleading. Also, we have less references now?
doc/modules/model_evaluation.rst
Outdated
@@ -636,7 +636,14 @@ by varying a decision threshold. | |||
The :func:`average_precision_score` function computes the average precision | |||
(AP) from prediction scores. This score corresponds to the area under the | |||
precision-recall curve. The value is between 0 and 1 and higher is better. | |||
With random predictions, the AP is the fraction of positive samples. | |||
With random predictions, the AP is the fraction of positive samples. AP is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move the With random predictions
sentence later. Otherwise it seem like it relates to the definition. I would mention here that there are interpolated variance and we're not doing that. Actually, I would rephrase the sentence above that this score corresponds to the area under the precision-recall curve. Because that is not entirely true / depends a lot on how you define that integral. Plotting the precision-recall curve, it's natural to do linear interpolation for visualization purposes (even though that might be misleading). But that's not how we calculate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add references to the "relationship of PR curve and ROC curve" paper and the IR book and the pascal VOC paper / code (saying it's different). And pointing out that linear interpolation leads to realizable classifiers for ROC but not PR curve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a reference to http://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf because it also argues against linear interpolation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support these changes and will make the revisions.
If we remove the AUCPR statement here, I recommend that somewhere we state what the canonical AUCPR calculation is in scikit-learn. A user who wants to report AUCPR should be able to quickly find the appropriate function. @amueller do you suggest that the auc
function on the precision recall operating points is the best option?
@@ -69,7 +69,8 @@ | |||
|
|||
where :math:`P_n` and :math:`R_n` are the precision and recall at the | |||
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an | |||
*operating point*. | |||
*operating point*. When summarizing a precision-recall curve, AP is preferable | |||
to computing the trapezoidal area under the operating points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because linear interpolation does not correspond to a realizable classifier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main gripe is the corner case where a classifier predicts only a single score, or scores that are close enough to be treated as equivalent. Computing the area with the trapezoid rule in that case gives area > 0.5 to a poor classifier. I've seen it cause problems for students trying to do automated hyperparameter searches based on AUCPR.
But whatever is going to be the reference AUCPR value should be recommended here. If that's not AP, I'll change it.
<http://citeseerx.ist.psu.edu/viewdoc/ | ||
download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_ | ||
<http://en.wikipedia.org/w/index.php?title=Information_retrieval& | ||
oldid=793358396#Average_precision>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you remove the references? I think these references should go into the user guide and we should explain exactly the relation between different approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally found it confusing to reference alternative implementations without explicitly stating they different from the AP implementation here. I'll add them to the user guide. I could also add them back here and present them as alternative interpolated approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. References in the implementation should be specific to the implementation. The user guide can give more commentary.
@amueller The latest commits add the references you suggested with some discussion, remove the area under the curve equivalence, and move the random performance comment. You can preview the changes here. I think the docs still need some work before merging. The new version points out problems with linear interpolation but doesn't suggest a better approach. I would like to add clear guidance about the recommended way to summarize a precision-recall curve. |
just leave auc out for now?
…On 23 Aug 2017 4:24 am, "Anthony Gitter" ***@***.***> wrote:
@amueller <https://github.com/amueller> The latest commits add the
references you suggested with some discussion, remove the area under the
curve equivalence, and move the random performance comment. You can preview
the changes here
<https://13019-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/model_evaluation.html#precision-recall-and-f-measures>
.
I think the docs still need some work before merging. The new version
points out problems with linear interpolation but doesn't suggest a better
approach. I would like to add clear guidance about the recommended way to
summarize a precision-recall curve. average_precision_score and auc seem
to be the options.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9583 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz65HzM630FXnjdZapIXKJhuddKVyNks5saxz6gaJpZM4O7vYk>
.
|
That would be fine to leave out AUC now and open a new issue to discuss AUCPR recommendations. Should I remove the tentative recommendations to use AP from this pull request then?
|
What's wrong with recommending average precision? I hope we now believe
that what we've got is something worth using!
… |
I agree that the average precision is worth using. I was getting confused about whether we should recommend AP as the preferred way to summarize a PR curve if many users will be accustomed to seeing AUCPR as the more common summarization. If you think it reads well as-is, I'm happy with it too. |
I think that depends entirely on which field you're coming from.
…On 23 Aug 2017 9:40 pm, "Anthony Gitter" ***@***.***> wrote:
I agree that the average precision is worth using. I was getting confused
about whether we should recommend AP as the preferred way to summarize a PR
curve if many users will be accustomed to seeing AUCPR as the more common
summarization.
If you think it reads well as-is, I'm happy with it too.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9583 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6yutPm2jqGWBRbF6nvUx9ZRKjadZks5sbA_BgaJpZM4O7vYk>
.
|
@jnothman Good point. Commits f7e7cc1 and 169a2a6 are small changes to make it clear that both AP and AUC are reasonable ways to summarize a PR curve and AP != AUCPR. I changed the title of the example precision-recall curves to call average precision AP instead of AUC. The user guide can discuss the merits of the different options, including those not implemented in scikit-learn. I have no more revisions planned. |
thanks. I'll try to review shortly.
…On 24 Aug 2017 1:11 am, "Anthony Gitter" ***@***.***> wrote:
I think that depends entirely on which field you're coming from.
@jnothman <https://github.com/jnothman> Good point. Commits f7e7cc1
<f7e7cc1>
and 169a2a6
<169a2a6>
are small changes to make it clear that both AP and AUC are reasonable ways
to summarize a PR curve and AP != AUCPR. I changed the title of the example
precision-recall curves to call average precision AP instead of AUC. The
user guide can discuss the merits of the different options, including those
not implemented in scikit-learn.
I have no more revisions planned.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9583 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62dcHM4UtwQoZk2zY8_1QM9VdXsqks5sbEEEgaJpZM4O7vYk>
.
|
For review convenience, here are the previews of the docs I revised in the last two commits: |
LGTM! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks. Looks pretty good overall!
|
||
:math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n` | ||
|
||
where :math:`P_n` and :math:`R_n` are the precision and recall at the | ||
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an | ||
*operating point*. | ||
|
||
AP and the trapezoidal area under the operating points | ||
(:func:`sklearn.metrics.auc`) are common ways to summarize a precision-recall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe emphasize that they lead to different results?
@@ -40,7 +40,9 @@ def auc(x, y, reorder=False): | |||
"""Compute Area Under the Curve (AUC) using the trapezoidal rule | |||
|
|||
This is a general function, given points on a curve. For computing the | |||
area under the ROC-curve, see :func:`roc_auc_score`. | |||
area under the ROC-curve, see :func:`roc_auc_score`. For an alternative | |||
way to summarize a precision-recall curve, see |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. AUC is area under the ROC curve, not PR curve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Per This is a general function, given points on a curve
and the Davis and Goadrich paper, my understanding was that AUC can refer to trapezoidal area under any curve. In the Davis and Goadrich terminology we can have AUC-ROC or AUC-PR.
What change do you recommend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never mind, I think your version is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm updating the roc_auc_score
description to say it computes ROC AUC (the acronym used in _binary_roc_auc_score
) instead of the more general AUC.
sklearn/metrics/ranking.py
Outdated
\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n | ||
|
||
where :math:`P_n` and :math:`R_n` are the precision and recall at the nth | ||
threshold [1]_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe explicitly say that this is not interpolated and that this is different from computing an interpolated area under the precision curve, which can be too optimistic.
@@ -396,6 +404,10 @@ def precision_recall_curve(y_true, probas_pred, pos_label=None, | |||
Increasing thresholds on the decision function used to compute | |||
precision and recall. | |||
|
|||
See also | |||
-------- | |||
average_precision_score : Compute average precision from prediction scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also add roc_curve? but also fine as-is.
doc/modules/model_evaluation.rst
Outdated
|
||
.. topic:: References: | ||
|
||
* C.D. Manning, P. Raghavan, H. Schütze, `Introduction to Information Retrieval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the RST citation syntax and use references in the text, so that someone reading it knows what citation belongs to what statement? wiki and IR book should go to the definition. Then you can say "some areas such as computer vision use the interpolated area under the precision-recall curve instead [cite pascal VOC]. This can lead to an overly optimistic measure of classifier performance and has been criticized in the literature [Davis, Flach]. Right now, average_precision_score does not implement any interpolated variant".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Yes, I can use RST citation syntax. I haven't used it before, but the docs make it look like I only need to add tags like [Everingham2010]
.
I can also modify the text as suggested. However, my understanding is that the Pascal VOC paper and Davis / Flach papers discuss two different concepts. The VOC paper presents the 11-point interpolated average precision. The Davis / Flach papers do not discuss that but instead criticize linear interpolation of a PR curve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VOC doesn't use 11 point in their code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller Commit 0c0f4bf includes all of the changes you recommended during your last review. I added citation links in model_evaluation.rst
but used different phrasing. My reading is that [Manning2008] and [Everingham2010] discuss interpolated AP whereas [Davis2006] and [Flach2015] criticize linear interpolation for AUC of precision-recall curves.
I'll confirm the citation links work once the docs finish building.
This is ready for re-review. The last commit fixed the RST citation links, which you can preview here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks!
Reference Issue
Adds documentation requested in #4577 regarding pull request #9017. Improved documentation may also address the concerns in #5379.
What does this implement/fix? Explain your changes.
The changes include:
average_precision_score
docstringaverage_precision_score
docstringplot_precision_recall.py
to theaverage_precision_score
docstring andmodel_evaluation.rst
average_precision_score
references to docstrings of related functionsAny other comments?
There has been some confusion about area under the curve for precision-recall curves in scikit-learn. Currently, some documents refer to the AP computed by
average_precision_score
as the area. Others refer to it as a summarization. It would improve consistency to call AP the official area under the precision-recall curve in all documentation, but is that generally accepted?