[MRG+1] Add average precision definitions and cross references #9583

agitter · 2017-08-18T15:56:55Z

Reference Issue

Adds documentation requested in #4577 regarding pull request #9017. Improved documentation may also address the concerns in #5379.

What does this implement/fix? Explain your changes.

The changes include:

Remove both references to interpolated precision in the average_precision_score docstring
Use a Wikipedia permalink for the AP reference in the average_precision_score docstring
Copy the AP definition from plot_precision_recall.py to the average_precision_score docstring and model_evaluation.rst
Recommend AP instead of trapezoidal AUC on PR curve operating points
Add average_precision_score references to docstrings of related functions

Any other comments?

There has been some confusion about area under the curve for precision-recall curves in scikit-learn. Currently, some documents refer to the AP computed by average_precision_score as the area. Others refer to it as a summarization. It would improve consistency to call AP the official area under the precision-recall curve in all documentation, but is that generally accepted?

agitter · 2017-08-18T16:00:20Z

Related to #9557 but there should not be any conflicts.

jnothman

Otherwise LGTM, thanks.

@amueller, this is also worth having in 0.19.1.

agitter · 2017-08-21T14:32:41Z

I support #9586 for the sake of consistency.

@rhiever I'd appreciate your thoughts on phrasing here, especially:

When summarizing a precision-recall curve, AP is preferable to computing the trapezoidal area under the operating points.

I also suggest that if we say AP is not equivalent to area under the curve, somewhere in the documentation we define what that area is and which scikit-learn function computes it. AUCPR is a much more commonly reported metric than AP.

amueller

I think this would be good for 0.19.1. But I'm not entirely satisfied ;) It took me too long to figure out what's happening, I don't wish that on anyone else. Right not the docs still say we're computing the area under the PR curve, which is misleading. Also, we have less references now?

amueller · 2017-08-21T14:55:05Z

doc/modules/model_evaluation.rst

@@ -636,7 +636,14 @@ by varying a decision threshold.
 The :func:`average_precision_score` function computes the average precision
 (AP) from prediction scores. This score corresponds to the area under the
 precision-recall curve. The value is between 0 and 1 and higher is better.
-With random predictions, the AP is the fraction of positive samples.
+With random predictions, the AP is the fraction of positive samples. AP is


I would move the With random predictions sentence later. Otherwise it seem like it relates to the definition. I would mention here that there are interpolated variance and we're not doing that. Actually, I would rephrase the sentence above that this score corresponds to the area under the precision-recall curve. Because that is not entirely true / depends a lot on how you define that integral. Plotting the precision-recall curve, it's natural to do linear interpolation for visualization purposes (even though that might be misleading). But that's not how we calculate.

Also add references to the "relationship of PR curve and ROC curve" paper and the IR book and the pascal VOC paper / code (saying it's different). And pointing out that linear interpolation leads to realizable classifiers for ROC but not PR curve.

Maybe also add a reference to http://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf because it also argues against linear interpolation.

I support these changes and will make the revisions.

If we remove the AUCPR statement here, I recommend that somewhere we state what the canonical AUCPR calculation is in scikit-learn. A user who wants to report AUCPR should be able to quickly find the appropriate function. @amueller do you suggest that the auc function on the precision recall operating points is the best option?

amueller · 2017-08-21T14:55:35Z

examples/model_selection/plot_precision_recall.py

@@ -69,7 +69,8 @@

 where :math:`P_n` and :math:`R_n` are the precision and recall at the
 nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
-*operating point*.
+*operating point*. When summarizing a precision-recall curve, AP is preferable
+to computing the trapezoidal area under the operating points.


Because linear interpolation does not correspond to a realizable classifier?

My main gripe is the corner case where a classifier predicts only a single score, or scores that are close enough to be treated as equivalent. Computing the area with the trapezoid rule in that case gives area > 0.5 to a poor classifier. I've seen it cause problems for students trying to do automated hyperparameter searches based on AUCPR.

But whatever is going to be the reference AUCPR value should be recommended here. If that's not AP, I'll change it.

amueller · 2017-08-21T14:58:35Z

sklearn/metrics/ranking.py

-            <http://citeseerx.ist.psu.edu/viewdoc/
-            download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_
+           <http://en.wikipedia.org/w/index.php?title=Information_retrieval&
+           oldid=793358396#Average_precision>`_


why did you remove the references? I think these references should go into the user guide and we should explain exactly the relation between different approaches.

I originally found it confusing to reference alternative implementations without explicitly stating they different from the AP implementation here. I'll add them to the user guide. I could also add them back here and present them as alternative interpolated approaches.

I agree. References in the implementation should be specific to the implementation. The user guide can give more commentary.

agitter · 2017-08-22T18:24:55Z

@amueller The latest commits add the references you suggested with some discussion, remove the area under the curve equivalence, and move the random performance comment. You can preview the changes here.

I think the docs still need some work before merging. The new version points out problems with linear interpolation but doesn't suggest a better approach. I would like to add clear guidance about the recommended way to summarize a precision-recall curve. average_precision_score and auc seem to be the options.

jnothman · 2017-08-22T22:55:32Z

just leave auc out for now?

…

On 23 Aug 2017 4:24 am, "Anthony Gitter" ***@***.***> wrote: @amueller <https://github.com/amueller> The latest commits add the references you suggested with some discussion, remove the area under the curve equivalence, and move the random performance comment. You can preview the changes here <https://13019-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/modules/model_evaluation.html#precision-recall-and-f-measures> . I think the docs still need some work before merging. The new version points out problems with linear interpolation but doesn't suggest a better approach. I would like to add clear guidance about the recommended way to summarize a precision-recall curve. average_precision_score and auc seem to be the options. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9583 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz65HzM630FXnjdZapIXKJhuddKVyNks5saxz6gaJpZM4O7vYk> .

agitter · 2017-08-23T11:22:43Z

That would be fine to leave out AUC now and open a new issue to discuss AUCPR recommendations. Should I remove the tentative recommendations to use AP from this pull request then?

In plot_precision_recall.py I added "When summarizing a precision-recall curve, AP is preferable to computing the trapezoidal area under the operating points."
In auc I added "For summarizing a precision-recall curve, see :func:average_precision_score."

jnothman · 2017-08-23T11:33:25Z

What's wrong with recommending average precision? I hope we now believe that what we've got is something worth using!

…

agitter · 2017-08-23T11:40:47Z

I agree that the average precision is worth using. I was getting confused about whether we should recommend AP as the preferred way to summarize a PR curve if many users will be accustomed to seeing AUCPR as the more common summarization.

If you think it reads well as-is, I'm happy with it too.

jnothman · 2017-08-23T12:26:16Z

I think that depends entirely on which field you're coming from.

…

On 23 Aug 2017 9:40 pm, "Anthony Gitter" ***@***.***> wrote: I agree that the average precision is worth using. I was getting confused about whether we should recommend AP as the preferred way to summarize a PR curve if many users will be accustomed to seeing AUCPR as the more common summarization. If you think it reads well as-is, I'm happy with it too. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9583 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6yutPm2jqGWBRbF6nvUx9ZRKjadZks5sbA_BgaJpZM4O7vYk> .

agitter · 2017-08-23T15:10:56Z

I think that depends entirely on which field you're coming from.

@jnothman Good point. Commits f7e7cc1 and 169a2a6 are small changes to make it clear that both AP and AUC are reasonable ways to summarize a PR curve and AP != AUCPR. I changed the title of the example precision-recall curves to call average precision AP instead of AUC. The user guide can discuss the merits of the different options, including those not implemented in scikit-learn.

I have no more revisions planned.

jnothman · 2017-08-23T21:57:58Z

thanks. I'll try to review shortly.

…

On 24 Aug 2017 1:11 am, "Anthony Gitter" ***@***.***> wrote: I think that depends entirely on which field you're coming from. @jnothman <https://github.com/jnothman> Good point. Commits f7e7cc1 <f7e7cc1> and 169a2a6 <169a2a6> are small changes to make it clear that both AP and AUC are reasonable ways to summarize a PR curve and AP != AUCPR. I changed the title of the example precision-recall curves to call average precision AP instead of AUC. The user guide can discuss the merits of the different options, including those not implemented in scikit-learn. I have no more revisions planned. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9583 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62dcHM4UtwQoZk2zY8_1QM9VdXsqks5sbEEEgaJpZM4O7vYk> .

agitter · 2017-08-23T22:37:27Z

For review convenience, here are the previews of the docs I revised in the last two commits:

jnothman · 2017-08-23T23:17:58Z

LGTM!

amueller

Some nitpicks. Looks pretty good overall!

amueller · 2017-08-30T14:34:02Z

examples/model_selection/plot_precision_recall.py


 :math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n`

 where :math:`P_n` and :math:`R_n` are the precision and recall at the
 nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
 *operating point*.

+AP and the trapezoidal area under the operating points
+(:func:`sklearn.metrics.auc`) are common ways to summarize a precision-recall


maybe emphasize that they lead to different results?

amueller · 2017-08-30T14:34:50Z

sklearn/metrics/ranking.py

@@ -40,7 +40,9 @@ def auc(x, y, reorder=False):
    """Compute Area Under the Curve (AUC) using the trapezoidal rule

    This is a general function, given points on a curve.  For computing the
-    area under the ROC-curve, see :func:`roc_auc_score`.
+    area under the ROC-curve, see :func:`roc_auc_score`.  For an alternative
+    way to summarize a precision-recall curve, see


I don't understand. AUC is area under the ROC curve, not PR curve.

@amueller Per This is a general function, given points on a curve and the Davis and Goadrich paper, my understanding was that AUC can refer to trapezoidal area under any curve. In the Davis and Goadrich terminology we can have AUC-ROC or AUC-PR.

What change do you recommend?

never mind, I think your version is fine.

I'm updating the roc_auc_score description to say it computes ROC AUC (the acronym used in _binary_roc_auc_score) instead of the more general AUC.

amueller · 2017-08-30T14:35:52Z

sklearn/metrics/ranking.py

+        \\text{AP} = \\sum_n (R_n - R_{n-1}) P_n
+
+    where :math:`P_n` and :math:`R_n` are the precision and recall at the nth
+    threshold [1]_.


Maybe explicitly say that this is not interpolated and that this is different from computing an interpolated area under the precision curve, which can be too optimistic.

amueller · 2017-08-30T14:37:23Z

sklearn/metrics/ranking.py

@@ -396,6 +404,10 @@ def precision_recall_curve(y_true, probas_pred, pos_label=None,
        Increasing thresholds on the decision function used to compute
        precision and recall.

+    See also
+    --------
+    average_precision_score : Compute average precision from prediction scores


maybe also add roc_curve? but also fine as-is.

amueller · 2017-08-30T14:41:28Z

doc/modules/model_evaluation.rst

+
+.. topic:: References:
+
+  * C.D. Manning, P. Raghavan, H. Schütze, `Introduction to Information Retrieval


Can you use the RST citation syntax and use references in the text, so that someone reading it knows what citation belongs to what statement? wiki and IR book should go to the definition. Then you can say "some areas such as computer vision use the interpolated area under the precision-recall curve instead [cite pascal VOC]. This can lead to an overly optimistic measure of classifier performance and has been criticized in the literature [Davis, Flach]. Right now, average_precision_score does not implement any interpolated variant".

@amueller Yes, I can use RST citation syntax. I haven't used it before, but the docs make it look like I only need to add tags like [Everingham2010].

I can also modify the text as suggested. However, my understanding is that the Pascal VOC paper and Davis / Flach papers discuss two different concepts. The VOC paper presents the 11-point interpolated average precision. The Davis / Flach papers do not discuss that but instead criticize linear interpolation of a PR curve.

VOC doesn't use 11 point in their code

@amueller Commit 0c0f4bf includes all of the changes you recommended during your last review. I added citation links in model_evaluation.rst but used different phrasing. My reading is that [Manning2008] and [Everingham2010] discuss interpolated AP whereas [Davis2006] and [Flach2015] criticize linear interpolation for AUC of precision-recall curves.

I'll confirm the citation links work once the docs finish building.

…ion definitions

agitter · 2017-09-24T13:02:02Z

This is ready for re-review. The last commit fixed the RST citation links, which you can preview here.

jnothman

This is great, thanks!

…arn#9583)

Add average precision definitions and cross references

2f9b2d5

Fix formatting error in AP formula

6b2f66c

jnothman mentioned this pull request Aug 19, 2017

Tiny docs fix for roc_auc_score #9586

Closed

jnothman reviewed Aug 20, 2017

View reviewed changes

jnothman added this to the 0.19.1 milestone Aug 20, 2017

amueller reviewed Aug 21, 2017

View reviewed changes

agitter added 3 commits August 22, 2017 10:22

Add precision recall references

879dfd7

Describe precision recall references and interpolation

3daba6a

Fix reference list formatting

ec47120

agitter added 2 commits August 23, 2017 09:05

Change AUC to AP in precision recall example figure titles

f7e7cc1

Present both AUC and AP as reasonable ways to summarize a PR curve

169a2a6

jnothman changed the title ~~[MRG] Add average precision definitions and cross references~~ [MRG+1] Add average precision definitions and cross references Aug 23, 2017

agitter mentioned this pull request Aug 24, 2017

error in average_precision_score #5379

Closed

amueller reviewed Aug 30, 2017

View reviewed changes

agitter added 2 commits September 23, 2017 16:11

Add citation links to model evaluation guide and refine ranking funct…

0c0f4bf

…ion definitions

Fix RST citation syntax

c46a4d5

jnothman reviewed Sep 25, 2017

View reviewed changes

jnothman merged commit 59bb32f into scikit-learn:master Sep 25, 2017

agitter deleted the pr-docs branch September 25, 2017 10:31

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Oct 3, 2017

DOC Add average precision definitions and cross references (scikit-le…

b5a956a

…arn#9583)

maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017

DOC Add average precision definitions and cross references (scikit-le…

dd03b67

…arn#9583)

jwjohnson314 pushed a commit to jwjohnson314/scikit-learn that referenced this pull request Dec 18, 2017

DOC Add average precision definitions and cross references (scikit-le…

6dbb15d

…arn#9583)

adrinjalali mentioned this pull request Apr 18, 2024

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

Closed


		.. topic:: References:

		* C.D. Manning, P. Raghavan, H. Schütze, `Introduction to Information Retrieval

Uh oh!

[MRG+1] Add average precision definitions and cross references #9583

[MRG+1] Add average precision definitions and cross references #9583

Uh oh!

Conversation

agitter commented Aug 18, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

agitter commented Aug 18, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

agitter commented Aug 21, 2017

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

agitter commented Aug 22, 2017

Uh oh!

jnothman commented Aug 22, 2017 via email

Uh oh!

agitter commented Aug 23, 2017

Uh oh!

jnothman commented Aug 23, 2017 via email

Uh oh!

agitter commented Aug 23, 2017

Uh oh!

jnothman commented Aug 23, 2017 via email

Uh oh!

agitter commented Aug 23, 2017

Uh oh!

jnothman commented Aug 23, 2017 via email

Uh oh!

agitter commented Aug 23, 2017

Uh oh!

jnothman commented Aug 23, 2017

Uh oh!

amueller left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment