[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option (7356 rebased) #9017

GaelVaroquaux · 2017-06-06T16:20:28Z

Rebased version of #7356

What does this implement/fix? Explain your changes.

This adds an optional interpolation parameter to both average_precision_score. By default, the value is set to None which replicates the existing behavior, but there is also a 'eleven_point' option that implements the strategy described in Stanford's Introduction to Information Retrieval.

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

GaelVaroquaux · 2017-06-06T16:21:21Z

I need to address the pending comments of #7356 and make the example pretty again, as it was rendered a bit horrible by the merge to master

GaelVaroquaux · 2017-06-06T17:32:47Z

I reordered a bit the example (no major change though). I find it more readable on the html output.

GaelVaroquaux · 2017-06-07T18:00:35Z

After a lot of back and worth, I have finally convinced myself that I trust this implementation of average_precision_score. I was worried about dealing with duplicates. So I added a stringent test.

This is ready for merge.

Can I have a review, @agramfort, @vene, @amueller

amueller · 2017-06-08T09:07:08Z

doc/whats_new.rst

@@ -6,6 +6,35 @@ Release history
 ===============

 Version 0.19
+Version 0.18.2


rebase issue you need to move this to the stuff below

amueller · 2017-06-08T09:08:05Z

examples/model_selection/plot_precision_recall.py

+possible with a recall value at least equal to the target value.
+The most common choice is 'eleven point' interpolated precision, where the
+desired recall values are [0, 0.1, 0.2, ..., 1.0]. This is the metric used in
+`The PASCAL Visual Object Classes (VOC) Challenge <http://citeseerx.ist.psu.edu


Maybe this metrics was? sometimes? Or just remove that sentence?

amueller · 2017-06-08T09:08:56Z

sklearn/metrics/ranking.py

+            For each of the recall values, r, in {0, 0.1, 0.2, ..., 1.0},
+            compute the arithmetic mean of the first precision value with a
+            corresponding recall >= r. This is the metric used in the Pascal
+            Visual Objects Classes (VOC) Challenge and is as described in the


Maybe not mention Pascal VOC?

Well, they describe it in the docs, though don't use it in the code. Maybe we should mention the docs.

yeah I don't know... but the current text could be misleading.

amueller · 2017-06-08T09:10:07Z

sklearn/metrics/tests/test_ranking.py

+
+def _interpolated_average_precision_slow(y_true, y_score):
+    """A second implementation for the eleven-point interpolated average
+    precision used by Pascal VOC. This should produce identical results to


descripted in ir-book?

amueller

LGTM, gonna check out the example now

amueller · 2017-06-08T09:10:36Z

sklearn/metrics/tests/test_ranking.py

+    precision_recall_auc = _average_precision_slow(y_true, probas_pred)
+    interpolated_average_precision = _interpolated_average_precision_slow(
+        y_true, probas_pred)
+    assert_array_almost_equal(precision_recall_auc, 0.859, 3)


where does that number come from

Agreed. Ping @ndingwall: we don't like to have hard-coded numbers in our test suites without being able to rederive them.

amueller · 2017-06-08T09:10:54Z

sklearn/metrics/tests/test_ranking.py

    assert_equal(p.size, r.size)
    assert_equal(p.size, thresholds.size + 1)
    # Smoke test in the case of proba having only one value
    p, r, thresholds = precision_recall_curve(y_true,
                                              np.zeros_like(probas_pred))
    precision_recall_auc = auc(r, p)
-    assert_array_almost_equal(precision_recall_auc, 0.75, 3)
+    assert_array_almost_equal(precision_recall_auc, 0.75)


where does that number come from?

This number is actually meaningless, as the way above to compute the AUC of the PR curve is not the right way: it does not cater for edge effects. On a constant prediction, the average precision score is the TPR, here .5.

I am removing these lines. Anyhow, I added a correct test (which passes) at a different location.

amueller · 2017-06-08T09:11:46Z

sklearn/metrics/tests/test_ranking.py

@@ -510,15 +552,15 @@ def test_precision_recall_curve_toydata():
        auc_prc = average_precision_score(y_true, y_score)
        assert_array_almost_equal(p, [0.5, 0., 1.])
        assert_array_almost_equal(r, [1., 0.,  0.])
-        assert_almost_equal(auc_prc, 0.25)
+        assert_almost_equal(auc_prc, 0.5)


maybe add a short explanation? Why is this one right and the other one wasn't?

explanation still missing ;)

amueller · 2017-06-08T09:12:09Z

sklearn/metrics/tests/test_ranking.py

+
+def test_average_precision_constant_values():
+    # Check the average_precision_score of a constant predictor is
+    # the tps


not addressed

amueller · 2017-06-08T09:13:38Z

sklearn/metrics/tests/test_ranking.py

+    y_true[::4] = 1
+    # And a constant score
+    y_score = np.ones(100)
+    # The precision is then the fraction of positive whatever the recall


"for all thresholds and recall values"?

No, because there is only one threshold.

amueller · 2017-06-08T09:13:58Z

doc/whats_new.rst

+............
+
+   - Added a `'eleven-point'` interpolated average precision option to
+     :func:`metrics.ranking.average_precision_score` as used in the `PASCAL


amueller · 2017-06-08T09:18:43Z

examples/model_selection/plot_precision_recall.py

 plt.xlabel('Recall')
 plt.ylabel('Precision')
 plt.ylim([0.0, 1.05])
 plt.xlim([0.0, 1.0])
-plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0]))
+plt.title('Precision-Recall example: AUC={0:0.2f}'.format(
+        average_precision["micro"]))
 plt.legend(loc="lower left")
 plt.show()


I'd remove the show so both pop up at once.

amueller · 2017-06-08T09:26:36Z

examples/model_selection/plot_precision_recall.py

    y_score.ravel())
-average_precision["micro"] = average_precision_score(y_test, y_score,
+average_precision["micro"] = average_precision_score(Y_test, y_score,


Is this standard? I'm not sure I understand what that means. Maybe do binary for this example?

So I think multi-label is not that common, and I'm not sure this is a common strategy for multi-class. as a matter of fact, we don't have multi-class average precision (there is a PR for multi-class AUC and there is a bunch of literature on that!).
So I vote going binary here.

I'll do a binary first, and then illustrate micro

I have certainly seen this before in NLP even for multiclass.

amueller · 2017-06-08T09:27:30Z

sklearn/metrics/ranking.py

    """Compute average precision (AP) from prediction scores

-    This score corresponds to the area under the precision-recall curve.
+    Optionally, this will compute an eleven-point interpolated average 


trailing whitespaces

…_score

GaelVaroquaux · 2017-06-08T17:07:00Z

I've addressed the comments and rewritten the example to start with a simple case.

The example can be seen here:
https://10944-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/model_selection/plot_precision_recall.html

New build here:
https://10996-843222-gh.circle-artifacts.com/0/home/ubuntu/scikit-learn/doc/_build/html/stable/auto_examples/model_selection/plot_precision_recall.html

GaelVaroquaux · 2017-06-09T06:59:36Z

All green. Can I haz reviews / mrg ?

amueller · 2017-06-09T09:41:53Z

doc/whats_new.rst

@@ -6,6 +6,35 @@ Release history
 ===============

 Version 0.19
+==============
+


Still rebase issue

amueller

Mostly nitpicks. I think it would be good to have an explanation in the tests and the whatsnew needs to be fixed, but I don't want to delay this because of minor issues in the example.

amueller · 2017-06-09T09:42:09Z

doc/whats_new.rst

+     by the change in recall since the last operating point, as per the
+     `Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_.
+     (`#7356 <https://github.com/scikit-learn/scikit-learn/pull/7356>`_). By
+     `Nick Dingwall`_.


And @GaelVaroquaux ?

amueller · 2017-06-09T09:43:05Z

examples/model_selection/plot_precision_recall.py

-low false negative rate. High scores for both show that the classifier is
-returning accurate results (high precision), as well as returning a majority of
-all positive results (high recall).
+Precision-Recall is a useful measure of success of prediction when the


Precision and recall? Or the precision-recall-curve?

yeah, and I would change the example title too: "Using precision and recall for classifier evaluation" or something. It's very awkward now.

And maybe replace below "very imbalanced" by "imbalanced"?

amueller · 2017-06-09T09:43:24Z

examples/model_selection/plot_precision_recall.py

+relevant results are returned.
+
+The precision-recall curve shows the tradeoff between precision and
+recall for different threshold. A high area under the curve represents


decision thresholds?

high area -> large area?

amueller · 2017-06-09T09:46:02Z

examples/model_selection/plot_precision_recall.py

+                                                    random_state=random_state)
+
+# Create a simple classifier
+classifier = svm.SVC(kernel='linear', probability=True,


Why????? How about LogisticRegression? Or not using probability=True? We don't need probabilities for the precision-recall-curve and many people are not aware of that. So maybe using SVC(linear) is actually good, but we shouldn't use probability=True. What's the reason not to use LinearSVC()?

We're not even using predict_proba, right?

Aaaaaaaaaargh!

amueller · 2017-06-09T09:52:57Z

examples/model_selection/plot_precision_recall.py

                                                    random_state=random_state)

+# We use OneVsRestClassifier for multi-label prediction
+from sklearn.multiclass import OneVsRestClassifier
 # Run classifier
 classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,


again, why probability=True, why SVC(linear) instead of LinearSVC

amueller · 2017-06-09T09:54:35Z

examples/model_selection/plot_precision_recall.py

+
+print("Target recall    Selected recall   Precision")
+for i in range(11):
+    print("  >= {}           {:3.3f}             {:3.3f}".format(i / 10,


Nitpick: if we're using format strings, maybe we should, you know, actually use them for the paddding?

amueller · 2017-06-09T09:55:01Z

examples/model_selection/plot_precision_recall.py

+plt.fill_between(recall[iris_cls], precision[iris_cls], step='post', alpha=0.1,
+                 color='g')
+for i in range(11):
+    plt.annotate('',


If you put the '' on the next line you have more space ;)

Yeah, but I don't feel it improves things.

amueller · 2017-06-09T09:56:01Z

sklearn/metrics/tests/test_ranking.py

@@ -510,15 +552,15 @@ def test_precision_recall_curve_toydata():
        auc_prc = average_precision_score(y_true, y_score)
        assert_array_almost_equal(p, [0.5, 0., 1.])
        assert_array_almost_equal(r, [1., 0.,  0.])
-        assert_almost_equal(auc_prc, 0.25)
+        assert_almost_equal(auc_prc, 0.5)


explanation still missing ;)

amueller · 2017-06-09T09:56:10Z

sklearn/metrics/tests/test_ranking.py

+
+def test_average_precision_constant_values():
+    # Check the average_precision_score of a constant predictor is
+    # the tps


not addressed

amueller · 2017-06-09T09:56:29Z

examples/model_selection/plot_precision_recall.py

+
+where :math:`P_n` and :math:`R_n` are the precision and recall at the
+nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
+*operating point*.


Maybe mention that this is related to the integral under the curve but there are subtle differences?

GaelVaroquaux · 2017-06-09T10:25:47Z

Addressd all @amueller 's comments

GaelVaroquaux · 2017-06-09T13:27:34Z

@amueller @vene : 👍 ?

vene · 2017-06-09T13:53:51Z

doc/whats_new.rst

+   - Added a `'eleven-point'` interpolated average precision option to
+     :func:`metrics.ranking.average_precision_score` as described in the
+     `PASCAL
+     Visual Object Classes (VOC) Challenge <http://citeseerx.ist.psu.edu/viewdoc/


I guess line wrapping here is weird?

vene · 2017-06-09T13:54:46Z

doc/whats_new.rst

@@ -193,6 +201,13 @@ Enhancements
 Bug fixes
 .........

+   - :func:`metrics.ranking.average_precision_score` no longer linearly
+     interpolates between operating points, and instead weights precisions


vene · 2017-06-09T13:59:55Z

examples/model_selection/plot_precision_recall.py


 Precision-recall curves are typically used in binary classification to study
-the output of a classifier. In order to extend Precision-recall curve and
+the output of a classifier. In order to extend the Precision-recall curve and


lowercase p

vene · 2017-06-09T14:04:20Z

examples/model_selection/plot_precision_recall.py

+# In multi-label settings
+# ------------------------
+#
+# Create multli-label data, fit, and predict


vene · 2017-06-09T14:04:53Z

examples/model_selection/plot_precision_recall.py

+
+from sklearn.preprocessing import label_binarize
+
+# Use label_binarize to be multi-label like settings


yeah :/. I don't like this part of the example, but that's what we have.

i just mean i don't understand what the comment is saying

vene · 2017-06-09T14:07:09Z

examples/model_selection/plot_precision_recall.py

                                                    random_state=random_state)

+# We use OneVsRestClassifier for multi-label prediction
+from sklearn.multiclass import OneVsRestClassifier


newline after

vene · 2017-06-09T14:09:03Z

examples/model_selection/plot_precision_recall.py

-# Compute Precision-Recall and plot curve
+
+###############################################################################
+# The precision-Recall score in multi-label settings


inconsistent capitalization. I'd suggest precision-recall everywhere.

I'm very confused now about what "the precision-recall score" is. Is it the same thing as average precision? I've never heard the former before. (just as two distinct scores.) The narrative at the top doesn't make it clear

vene · 2017-06-09T14:11:26Z

examples/model_selection/plot_precision_recall.py

-plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0]))
-plt.legend(loc="lower left")
-plt.show()
+plt.title('Precision-Recall micro-averaged over all classes: AUC={0:0.2f}'


same confusion. Maybe "precision and recall micro-averaged..."?

Or "micro average precision score"?

vene · 2017-06-09T14:14:29Z

examples/model_selection/plot_precision_recall.py

+# ------------------------------
+#
+# In *interpolated* average precision, a set of desired recall values is
+# specified and for each desired value, we average the best precision


i'd move the comma to after the "and" on this line

vene · 2017-06-09T14:15:19Z

sklearn/metrics/ranking.py

    """Compute average precision (AP) from prediction scores

-    This score corresponds to the area under the precision-recall curve.
+    Optionally, this will compute an eleven-point interpolated average
+    precision score: for each of the 11 evenly-spaced target recall values


maybe say eleven in words to make it easier to spot the connection to the interpolation arg below

vene · 2017-06-09T14:19:24Z

sklearn/metrics/ranking.py

        precision, recall, thresholds = precision_recall_curve(
            y_true, y_score, sample_weight=sample_weight)
-        return auc(recall, precision)
+        # Return the step function integral
+        return -np.sum(np.diff(recall) * np.array(precision)[:-1])


I think this works because the last entry of precision is guaranteed to be 1, as written in the docstring of precision_recall_curve; do you think this warrants a note in a comment here, for posterity?

vene · 2017-06-09T14:23:31Z

sklearn/metrics/tests/test_ranking.py

+    precision = list(reversed(precision))
+    recall = list(reversed(recall))
+    indices = np.searchsorted(recall, np.arange(0, 1.1, 0.1))
+    return np.mean([max(precision[i:]) for i in indices])


confused: other than using lists instead of arrays, and other than not ignoring the first p-r pair with p=1.0, this seems identical to the actual implementation in this pr, what am I missing?

Agreed.

Should we remove this test?

vene · 2017-06-09T14:28:32Z

sklearn/metrics/tests/test_ranking.py

-        assert_almost_equal(auc_prc, 0.25)
+        # Here we are doing a terrible prediction: we are always getting
+        # it wrong, hence the average_precision_score is the accuracy at
+        # change: 50%


amueller · 2017-06-09T14:36:41Z

LGTM, I haven't recalculated the test results though.

Remove the eleven average precision score Add better tests.

GaelVaroquaux · 2017-06-09T16:23:29Z

After discussion with @vene, I have removed the eleven-point average precision from this PR. Sorry @ndingwall we feel that get the fixes of this PR merged in is very important, but that we are not sure that we can warrant correctness the eleven point variant.

I will issue a new PR so that the work does not gets lost, and @ndingwall can take over.

vene · 2017-06-09T16:38:11Z

doc/whats_new.rst

@@ -6,7 +6,7 @@ Release history
 ===============

 Version 0.19
-============
+==============


what's up here?

GaelVaroquaux · 2017-06-09T17:13:22Z

@vene: fixed

GaelVaroquaux · 2017-06-09T22:37:59Z

@amueller and @vene gave me 👍 at the bar. Merging

GaelVaroquaux · 2017-06-09T22:57:29Z

Merged

jnothman · 2017-06-10T14:53:44Z

I think this needs more of a loud warning in changelog

GaelVaroquaux mentioned this pull request Jun 7, 2017

precision_recall_curve - assumed limits can be misleading #4223

Closed

GaelVaroquaux force-pushed the pr_7356 branch from 398f31f to df03918 Compare June 8, 2017 09:01

amueller reviewed Jun 8, 2017

View reviewed changes

GaelVaroquaux force-pushed the pr_7356 branch from df03918 to 66a7c0d Compare June 8, 2017 09:19

amueller reviewed Jun 8, 2017

View reviewed changes

GaelVaroquaux force-pushed the pr_7356 branch 2 times, most recently from eecbc86 to bf2830e Compare June 8, 2017 11:55

Adds support for step-wise interpolation to auc and average_precision…

d969100

…_score

GaelVaroquaux force-pushed the pr_7356 branch from bf2830e to 01ac493 Compare June 8, 2017 15:57

GaelVaroquaux force-pushed the pr_7356 branch 3 times, most recently from a554f3f to fec605f Compare June 8, 2017 21:22

amueller reviewed Jun 9, 2017

View reviewed changes

doc/whats_new.rst Outdated

@@ -6,6 +6,35 @@ Release history

===============

Version 0.19

==============

Copy link

Member

amueller Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still rebase issue

amueller reviewed Jun 9, 2017

View reviewed changes

GaelVaroquaux force-pushed the pr_7356 branch from fec605f to f1a03ac Compare June 9, 2017 10:25

vene reviewed Jun 9, 2017

View reviewed changes

DOC: Simpler precision-recall example, remove 11pt

928d2d0

Remove the eleven average precision score Add better tests.

GaelVaroquaux force-pushed the pr_7356 branch from f1a03ac to 928d2d0 Compare June 9, 2017 16:19

GaelVaroquaux mentioned this pull request Jun 9, 2017

[WIP] Eleven point average precision #9091

Closed

vene reviewed Jun 9, 2017

View reviewed changes

doc/whats_new.rst Outdated

@@ -6,7 +6,7 @@ Release history

===============

Version 0.19

============

==============

Copy link

Member

vene Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's up here?

FIX: fix previous commit

7b5dcf0

GaelVaroquaux force-pushed the pr_7356 branch from 9a1c61a to 7b5dcf0 Compare June 9, 2017 17:13

GaelVaroquaux closed this Jun 9, 2017

This was referenced Jun 9, 2017

Precision Recall numbers computed by Scikits are not interpolated (non-standard) #4577

Closed

a bug in average precision function #6377

Closed

This was referenced Aug 18, 2017

[MRG+1] Add average precision definitions and cross references #9583

Merged

error in average_precision_score #5379

Closed

jnothman mentioned this pull request Feb 19, 2018

[MRG+1] Bugfix for precision_recall_curve when all labels are negative #8280

Closed

jnothman mentioned this pull request Jan 31, 2019

average_precision_score() overestimates AUC value #13074

Closed


		from sklearn.preprocessing import label_binarize

		# Use label_binarize to be multi-label like settings

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option (7356 rebased) #9017

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option (7356 rebased) #9017

Conversation

GaelVaroquaux commented Jun 6, 2017 • edited Loading

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

GaelVaroquaux commented Jun 6, 2017

GaelVaroquaux commented Jun 6, 2017

GaelVaroquaux commented Jun 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller Jun 8, 2017 • edited Loading

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 8, 2017 • edited by amueller Loading

GaelVaroquaux commented Jun 9, 2017

Choose a reason for hiding this comment

amueller left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Jun 9, 2017

GaelVaroquaux commented Jun 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux Jun 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Jun 9, 2017

GaelVaroquaux commented Jun 9, 2017

GaelVaroquaux commented Jun 6, 2017 •

edited

Loading

amueller Jun 8, 2017 •

edited

Loading

GaelVaroquaux commented Jun 8, 2017 •

edited by amueller

Loading

GaelVaroquaux Jun 9, 2017 •

edited

Loading