Skip to content

[MRG+1] Add average precision definitions and cross references #9583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Sep 25, 2017
41 changes: 37 additions & 4 deletions doc/modules/model_evaluation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -633,10 +633,25 @@ The :func:`precision_recall_curve` computes a precision-recall curve
from the ground truth label and a score given by the classifier
by varying a decision threshold.

The :func:`average_precision_score` function computes the average precision
(AP) from prediction scores. This score corresponds to the area under the
precision-recall curve. The value is between 0 and 1 and higher is better.
With random predictions, the AP is the fraction of positive samples.
The :func:`average_precision_score` function computes the
`average precision <http://en.wikipedia.org/w/index.php?title=Information_retrieval&oldid=793358396#Average_precision>`_
(AP) from prediction scores. The value is between 0 and 1 and higher is better.
AP is defined as

.. math::
\text{AP} = \sum_n (R_n - R_{n-1}) P_n

where :math:`P_n` and :math:`R_n` are the precision and recall at the
nth threshold. With random predictions, the AP is the fraction of positive
samples.

References [Manning2008]_ and [Everingham2010]_ present alternative variants of
AP that interpolate the precision-recall curve. Currently,
:func:`average_precision_score` does not implement any interpolated variant.
References [Davis2006]_ and [Flach2015]_ describe why a linear interpolation of
points on the precision-recall curve provides an overly-optimistic measure of
classifier performance. This linear interpolation is used when computing area
under the curve with the trapezoidal rule in :func:`auc`.

Several functions allow you to analyze the precision, recall and F-measures
score:
Expand Down Expand Up @@ -671,6 +686,24 @@ binary classification and multilabel indicator format.
for an example of :func:`precision_recall_curve` usage to evaluate
classifier output quality.


.. topic:: References:

.. [Manning2008] C.D. Manning, P. Raghavan, H. Schütze, `Introduction to Information Retrieval
<http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-ranked-retrieval-results-1.html>`_,
2008.
.. [Everingham2010] M. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman,
`The Pascal Visual Object Classes (VOC) Challenge
<http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_,
IJCV 2010.
.. [Davis2006] J. Davis, M. Goadrich, `The Relationship Between Precision-Recall and ROC Curves
<http://www.machinelearning.org/proceedings/icml2006/030_The_Relationship_Bet.pdf>`_,
ICML 2006.
.. [Flach2015] P.A. Flach, M. Kull, `Precision-Recall-Gain Curves: PR Analysis Done Right
<http://papers.nips.cc/paper/5867-precision-recall-gain-curves-pr-analysis-done-right.pdf>`_,
NIPS 2015.


Binary classification
^^^^^^^^^^^^^^^^^^^^^

Expand Down
15 changes: 10 additions & 5 deletions examples/model_selection/plot_precision_recall.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,16 +61,21 @@
in the threshold considerably reduces precision, with only a minor gain in
recall.

**Average precision** summarizes such a plot as the weighted mean of precisions
achieved at each threshold, with the increase in recall from the previous
threshold used as the weight:
**Average precision** (AP) summarizes such a plot as the weighted mean of
precisions achieved at each threshold, with the increase in recall from the
previous threshold used as the weight:

:math:`\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n`

where :math:`P_n` and :math:`R_n` are the precision and recall at the
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
*operating point*.

AP and the trapezoidal area under the operating points
(:func:`sklearn.metrics.auc`) are common ways to summarize a precision-recall
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe emphasize that they lead to different results?

curve that lead to different results. Read more in the
:ref:`User Guide <precision_recall_f_measure_metrics>`.

Precision-recall curves are typically used in binary classification to study
the output of a classifier. In order to extend the precision-recall curve and
average precision to multi-class or multi-label classification, it is necessary
Expand Down Expand Up @@ -144,7 +149,7 @@
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('2-class Precision-Recall curve: AUC={0:0.2f}'.format(
plt.title('2-class Precision-Recall curve: AP={0:0.2f}'.format(
average_precision))

###############################################################################
Expand Down Expand Up @@ -215,7 +220,7 @@
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title(
'Average precision score, micro-averaged over all classes: AUC={0:0.2f}'
'Average precision score, micro-averaged over all classes: AP={0:0.2f}'
.format(average_precision["micro"]))

###############################################################################
Expand Down
44 changes: 31 additions & 13 deletions sklearn/metrics/ranking.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@ def auc(x, y, reorder=False):
"""Compute Area Under the Curve (AUC) using the trapezoidal rule

This is a general function, given points on a curve. For computing the
area under the ROC-curve, see :func:`roc_auc_score`.
area under the ROC-curve, see :func:`roc_auc_score`. For an alternative
way to summarize a precision-recall curve, see
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. AUC is area under the ROC curve, not PR curve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller Per This is a general function, given points on a curve and the Davis and Goadrich paper, my understanding was that AUC can refer to trapezoidal area under any curve. In the Davis and Goadrich terminology we can have AUC-ROC or AUC-PR.

What change do you recommend?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never mind, I think your version is fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm updating the roc_auc_score description to say it computes ROC AUC (the acronym used in _binary_roc_auc_score) instead of the more general AUC.

:func:`average_precision_score`.

Parameters
----------
Expand Down Expand Up @@ -68,7 +70,8 @@ def auc(x, y, reorder=False):

See also
--------
roc_auc_score : Computes the area under the ROC curve
roc_auc_score : Compute the area under the ROC curve
average_precision_score : Compute average precision from prediction scores
precision_recall_curve :
Compute precision-recall pairs for different probability thresholds
"""
Expand Down Expand Up @@ -108,6 +111,19 @@ def average_precision_score(y_true, y_score, average="macro",
sample_weight=None):
"""Compute average precision (AP) from prediction scores

AP summarizes a precision-recall curve as the weighted mean of precisions
achieved at each threshold, with the increase in recall from the previous
threshold used as the weight:

.. math::
\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n

where :math:`P_n` and :math:`R_n` are the precision and recall at the nth
threshold [1]_. This implementation is not interpolated and is different
from computing the area under the precision-recall curve with the
trapezoidal rule, which uses linear interpolation and can be too
optimistic.

Note: this implementation is restricted to the binary classification task
or multilabel classification task.

Expand Down Expand Up @@ -149,17 +165,12 @@ def average_precision_score(y_true, y_score, average="macro",
References
----------
.. [1] `Wikipedia entry for the Average precision
<http://en.wikipedia.org/wiki/Average_precision>`_
.. [2] `Stanford Information Retrieval book
<http://nlp.stanford.edu/IR-book/html/htmledition/
evaluation-of-ranked-retrieval-results-1.html>`_
.. [3] `The PASCAL Visual Object Classes (VOC) Challenge
<http://citeseerx.ist.psu.edu/viewdoc/
download?doi=10.1.1.157.5766&rep=rep1&type=pdf>`_
<http://en.wikipedia.org/w/index.php?title=Information_retrieval&
oldid=793358396#Average_precision>`_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you remove the references? I think these references should go into the user guide and we should explain exactly the relation between different approaches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally found it confusing to reference alternative implementations without explicitly stating they different from the AP implementation here. I'll add them to the user guide. I could also add them back here and present them as alternative interpolated approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. References in the implementation should be specific to the implementation. The user guide can give more commentary.


See also
--------
roc_auc_score : Area under the ROC curve
roc_auc_score : Compute the area under the ROC curve

precision_recall_curve :
Compute precision-recall pairs for different probability thresholds
Expand Down Expand Up @@ -190,7 +201,8 @@ def _binary_uninterpolated_average_precision(


def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
"""Compute Area Under the Curve (AUC) from prediction scores
"""Compute Area Under the Receiver Operating Characteristic Curve (ROC AUC)
from prediction scores.

Note: this implementation is restricted to the binary classification task
or multilabel classification task in label indicator format.
Expand Down Expand Up @@ -239,7 +251,7 @@ def roc_auc_score(y_true, y_score, average="macro", sample_weight=None):
--------
average_precision_score : Area under the precision-recall curve

roc_curve : Compute Receiver operating characteristic (ROC)
roc_curve : Compute Receiver operating characteristic (ROC) curve

Examples
--------
Expand Down Expand Up @@ -396,6 +408,12 @@ def precision_recall_curve(y_true, probas_pred, pos_label=None,
Increasing thresholds on the decision function used to compute
precision and recall.

See also
--------
average_precision_score : Compute average precision from prediction scores
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add roc_curve? but also fine as-is.


roc_curve : Compute Receiver operating characteristic (ROC) curve

Examples
--------
>>> import numpy as np
Expand Down Expand Up @@ -477,7 +495,7 @@ def roc_curve(y_true, y_score, pos_label=None, sample_weight=None,

See also
--------
roc_auc_score : Compute Area Under the Curve (AUC) from prediction scores
roc_auc_score : Compute the area under the ROC curve

Notes
-----
Expand Down