Skip to content

[MRG+1] Add average precision definitions and cross references #9583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 25, 2017

Conversation

agitter
Copy link
Contributor

@agitter agitter commented Aug 18, 2017

Reference Issue

Adds documentation requested in #4577 regarding pull request #9017. Improved documentation may also address the concerns in #5379.

What does this implement/fix? Explain your changes.

The changes include:

  • Remove both references to interpolated precision in the average_precision_score docstring
  • Use a Wikipedia permalink for the AP reference in the average_precision_score docstring
  • Copy the AP definition from plot_precision_recall.py to the average_precision_score docstring and model_evaluation.rst
  • Recommend AP instead of trapezoidal AUC on PR curve operating points
  • Add average_precision_score references to docstrings of related functions

Any other comments?

There has been some confusion about area under the curve for precision-recall curves in scikit-learn. Currently, some documents refer to the AP computed by average_precision_score as the area. Others refer to it as a summarization. It would improve consistency to call AP the official area under the precision-recall curve in all documentation, but is that generally accepted?

@agitter
Copy link
Contributor Author

agitter commented Aug 18, 2017

Related to #9557 but there should not be any conflicts.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM, thanks.

@amueller, this is also worth having in 0.19.1.

@jnothman jnothman added this to the 0.19.1 milestone Aug 20, 2017
@agitter
Copy link
Contributor Author

agitter commented Aug 21, 2017

I support #9586 for the sake of consistency.

@rhiever I'd appreciate your thoughts on phrasing here, especially:

When summarizing a precision-recall curve, AP is preferable to computing the trapezoidal area under the operating points.

I also suggest that if we say AP is not equivalent to area under the curve, somewhere in the documentation we define what that area is and which scikit-learn function computes it. AUCPR is a much more commonly reported metric than AP.

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be good for 0.19.1. But I'm not entirely satisfied ;) It took me too long to figure out what's happening, I don't wish that on anyone else. Right not the docs still say we're computing the area under the PR curve, which is misleading. Also, we have less references now?

@@ -636,7 +636,14 @@ by varying a decision threshold.
The :func:`average_precision_score` function computes the average precision
(AP) from prediction scores. This score corresponds to the area under the
precision-recall curve. The value is between 0 and 1 and higher is better.
With random predictions, the AP is the fraction of positive samples.
With random predictions, the AP is the fraction of positive samples. AP is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move the With random predictions sentence later. Otherwise it seem like it relates to the definition. I would mention here that there are interpolated variance and we're not doing that. Actually, I would rephrase the sentence above that this score corresponds to the area under the precision-recall curve. Because that is not entirely true / depends a lot on how you define that integral. Plotting the precision-recall curve, it's natural to do linear interpolation for visualization purposes (even though that might be misleading). But that's not how we calculate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add references to the "relationship of PR curve and ROC curve" paper and the IR book and the pascal VOC paper / code (saying it's different). And pointing out that linear interpolation leads to realizable classifiers for ROC but not PR curve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a reference to http://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf because it also argues against linear interpolation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support these changes and will make the revisions.

If we remove the AUCPR statement here, I recommend that somewhere we state what the canonical AUCPR calculation is in scikit-learn. A user who wants to report AUCPR should be able to quickly find the appropriate function. @amueller do you suggest that the auc function on the precision recall operating points is the best option?

@@ -69,7 +69,8 @@

where :math:`P_n` and :math:`R_n` are the precision and recall at the
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
*operating point*.
*operating point*. When summarizing a precision-recall curve, AP is preferable
to computing the trapezoidal area under the operating points.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because linear interpolation does not correspond to a realizable classifier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main gripe is the corner case where a classifier predicts only a single score, or scores that are close enough to be treated as equivalent. Computing the area with the trapezoid rule in that case gives area > 0.5 to a poor classifier. I've seen it cause problems for students trying to do automated hyperparameter searches based on AUCPR.

But whatever is going to be the reference AUCPR value should be recommended here. If that's not AP, I'll change it.

<http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_
<http://en.wikipedia.org/w/index.php?title=Information_retrieval&
oldid=793358396#Average_precision>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove the references? I think these references should go into the user guide and we should explain exactly the relation between different approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally found it confusing to reference alternative implementations without explicitly stating they different from the AP implementation here. I'll add them to the user guide. I could also add them back here and present them as alternative interpolated approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. References in the implementation should be specific to the implementation. The user guide can give more commentary.

@agitter
Copy link
Contributor Author

agitter commented Aug 22, 2017

@amueller The latest commits add the references you suggested with some discussion, remove the area under the curve equivalence, and move the random performance comment. You can preview the changes here.

I think the docs still need some work before merging. The new version points out problems with linear interpolation but doesn't suggest a better approach. I would like to add clear guidance about the recommended way to summarize a precision-recall curve. average_precision_score and auc seem to be the options.

@jnothman
Copy link
Member

jnothman commented Aug 22, 2017 via email

@agitter
Copy link
Contributor Author

agitter commented Aug 23, 2017

That would be fine to leave out AUC now and open a new issue to discuss AUCPR recommendations. Should I remove the tentative recommendations to use AP from this pull request then?

  • In plot_precision_recall.py I added "When summarizing a precision-recall curve, AP is preferable to computing the trapezoidal area under the operating points."
  • In auc I added "For summarizing a precision-recall curve, see :func:average_precision_score."

@jnothman
Copy link
Member

jnothman commented Aug 23, 2017 via email

@agitter
Copy link
Contributor Author

agitter commented Aug 23, 2017

I agree that the average precision is worth using. I was getting confused about whether we should recommend AP as the preferred way to summarize a PR curve if many users will be accustomed to seeing AUCPR as the more common summarization.

If you think it reads well as-is, I'm happy with it too.

@jnothman
Copy link
Member

jnothman commented Aug 23, 2017 via email

@agitter
Copy link
Contributor Author

agitter commented Aug 23, 2017

I think that depends entirely on which field you're coming from.

@jnothman Good point. Commits f7e7cc1 and 169a2a6 are small changes to make it clear that both AP and AUC are reasonable ways to summarize a PR curve and AP != AUCPR. I changed the title of the example precision-recall curves to call average precision AP instead of AUC. The user guide can discuss the merits of the different options, including those not implemented in scikit-learn.

I have no more revisions planned.

@jnothman
Copy link
Member

jnothman commented Aug 23, 2017 via email

@agitter
Copy link
Contributor Author

agitter commented Aug 23, 2017

For review convenience, here are the previews of the docs I revised in the last two commits:

@jnothman
Copy link
Member

LGTM!

@jnothman jnothman changed the title [MRG] Add average precision definitions and cross references [MRG+1] Add average precision definitions and cross references Aug 23, 2017
Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nitpicks. Looks pretty good overall!


:math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n`

where :math:`P_n` and :math:`R_n` are the precision and recall at the
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
*operating point*.

AP and the trapezoidal area under the operating points
(:func:`sklearn.metrics.auc`) are common ways to summarize a precision-recall
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe emphasize that they lead to different results?

@@ -40,7 +40,9 @@ def auc(x, y, reorder=False):
"""Compute Area Under the Curve (AUC) using the trapezoidal rule

This is a general function, given points on a curve. For computing the
area under the ROC-curve, see :func:`roc_auc_score`.
area under the ROC-curve, see :func:`roc_auc_score`. For an alternative
way to summarize a precision-recall curve, see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. AUC is area under the ROC curve, not PR curve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Per This is a general function, given points on a curve and the Davis and Goadrich paper, my understanding was that AUC can refer to trapezoidal area under any curve. In the Davis and Goadrich terminology we can have AUC-ROC or AUC-PR.

What change do you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, I think your version is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm updating the roc_auc_score description to say it computes ROC AUC (the acronym used in _binary_roc_auc_score) instead of the more general AUC.

\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n

where :math:`P_n` and :math:`R_n` are the precision and recall at the nth
threshold [1]_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe explicitly say that this is not interpolated and that this is different from computing an interpolated area under the precision curve, which can be too optimistic.

@@ -396,6 +404,10 @@ def precision_recall_curve(y_true, probas_pred, pos_label=None,
Increasing thresholds on the decision function used to compute
precision and recall.

See also
--------
average_precision_score : Compute average precision from prediction scores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add roc_curve? but also fine as-is.


.. topic:: References:

* C.D. Manning, P. Raghavan, H. Schütze, `Introduction to Information Retrieval
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the RST citation syntax and use references in the text, so that someone reading it knows what citation belongs to what statement? wiki and IR book should go to the definition. Then you can say "some areas such as computer vision use the interpolated area under the precision-recall curve instead [cite pascal VOC]. This can lead to an overly optimistic measure of classifier performance and has been criticized in the literature [Davis, Flach]. Right now, average_precision_score does not implement any interpolated variant".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Yes, I can use RST citation syntax. I haven't used it before, but the docs make it look like I only need to add tags like [Everingham2010].

I can also modify the text as suggested. However, my understanding is that the Pascal VOC paper and Davis / Flach papers discuss two different concepts. The VOC paper presents the 11-point interpolated average precision. The Davis / Flach papers do not discuss that but instead criticize linear interpolation of a PR curve.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VOC doesn't use 11 point in their code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Commit 0c0f4bf includes all of the changes you recommended during your last review. I added citation links in model_evaluation.rst but used different phrasing. My reading is that [Manning2008] and [Everingham2010] discuss interpolated AP whereas [Davis2006] and [Flach2015] criticize linear interpolation for AUC of precision-recall curves.

I'll confirm the citation links work once the docs finish building.

@agitter
Copy link
Contributor Author

agitter commented Sep 24, 2017

This is ready for re-review. The last commit fixed the RST citation links, which you can preview here.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks!

@jnothman jnothman merged commit 59bb32f into scikit-learn:master Sep 25, 2017
@agitter agitter deleted the pr-docs branch September 25, 2017 10:31
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants