Skip to content

[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option (7356 rebased) #9017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

GaelVaroquaux
Copy link
Member

@GaelVaroquaux GaelVaroquaux commented Jun 6, 2017

Rebased version of #7356

Fixes #4577 and #6377

What does this implement/fix? Explain your changes.

This adds an optional interpolation parameter to both average_precision_score. By default, the value is set to None which replicates the existing behavior, but there is also a 'eleven_point' option that implements the strategy described in Stanford's Introduction to Information Retrieval.

Reference Issue

What does this implement/fix? Explain your changes.

Any other comments?

@GaelVaroquaux
Copy link
Member Author

I need to address the pending comments of #7356 and make the example pretty again, as it was rendered a bit horrible by the merge to master

@GaelVaroquaux
Copy link
Member Author

I reordered a bit the example (no major change though). I find it more readable on the html output.

@GaelVaroquaux
Copy link
Member Author

After a lot of back and worth, I have finally convinced myself that I trust this implementation of average_precision_score. I was worried about dealing with duplicates. So I added a stringent test.

This is ready for merge.

Can I have a review, @agramfort, @vene, @amueller

@@ -6,6 +6,35 @@ Release history
===============

Version 0.19
Version 0.18.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase issue you need to move this to the stuff below

possible with a recall value at least equal to the target value.
The most common choice is 'eleven point' interpolated precision, where the
desired recall values are [0, 0.1, 0.2, ..., 1.0]. This is the metric used in
`The PASCAL Visual Object Classes (VOC) Challenge <http://citeseerx.ist.psu.edu
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this metrics was? sometimes? Or just remove that sentence?

For each of the recall values, r, in {0, 0.1, 0.2, ..., 1.0},
compute the arithmetic mean of the first precision value with a
corresponding recall >= r. This is the metric used in the Pascal
Visual Objects Classes (VOC) Challenge and is as described in the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not mention Pascal VOC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, they describe it in the docs, though don't use it in the code. Maybe we should mention the docs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I don't know... but the current text could be misleading.


def _interpolated_average_precision_slow(y_true, y_score):
"""A second implementation for the eleven-point interpolated average
precision used by Pascal VOC. This should produce identical results to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descripted in ir-book?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, gonna check out the example now

precision_recall_auc = _average_precision_slow(y_true, probas_pred)
interpolated_average_precision = _interpolated_average_precision_slow(
y_true, probas_pred)
assert_array_almost_equal(precision_recall_auc, 0.859, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does that number come from

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Ping @ndingwall: we don't like to have hard-coded numbers in our test suites without being able to rederive them.

assert_equal(p.size, r.size)
assert_equal(p.size, thresholds.size + 1)
# Smoke test in the case of proba having only one value
p, r, thresholds = precision_recall_curve(y_true,
np.zeros_like(probas_pred))
precision_recall_auc = auc(r, p)
assert_array_almost_equal(precision_recall_auc, 0.75, 3)
assert_array_almost_equal(precision_recall_auc, 0.75)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where does that number come from?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This number is actually meaningless, as the way above to compute the AUC of the PR curve is not the right way: it does not cater for edge effects. On a constant prediction, the average precision score is the TPR, here .5.

I am removing these lines. Anyhow, I added a correct test (which passes) at a different location.

@@ -510,15 +552,15 @@ def test_precision_recall_curve_toydata():
auc_prc = average_precision_score(y_true, y_score)
assert_array_almost_equal(p, [0.5, 0., 1.])
assert_array_almost_equal(r, [1., 0., 0.])
assert_almost_equal(auc_prc, 0.25)
assert_almost_equal(auc_prc, 0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a short explanation? Why is this one right and the other one wasn't?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation still missing ;)


def test_average_precision_constant_values():
# Check the average_precision_score of a constant predictor is
# the tps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tps? tpr?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed

y_true[::4] = 1
# And a constant score
y_score = np.ones(100)
# The precision is then the fraction of positive whatever the recall
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"for all thresholds and recall values"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because there is only one threshold.

............

- Added a `'eleven-point'` interpolated average precision option to
:func:`metrics.ranking.average_precision_score` as used in the `PASCAL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it?

plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0]))
plt.title('Precision-Recall example: AUC={0:0.2f}'.format(
average_precision["micro"]))
plt.legend(loc="lower left")
plt.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the show so both pop up at once.

y_score.ravel())
average_precision["micro"] = average_precision_score(y_test, y_score,
average_precision["micro"] = average_precision_score(Y_test, y_score,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this standard? I'm not sure I understand what that means. Maybe do binary for this example?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think multi-label is not that common, and I'm not sure this is a common strategy for multi-class. as a matter of fact, we don't have multi-class average precision (there is a PR for multi-class AUC and there is a bunch of literature on that!).
So I vote going binary here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do a binary first, and then illustrate micro

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have certainly seen this before in NLP even for multiclass.

"""Compute average precision (AP) from prediction scores

This score corresponds to the area under the precision-recall curve.
Optionally, this will compute an eleven-point interpolated average
Copy link
Member

@amueller amueller Jun 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trailing whitespaces

@GaelVaroquaux GaelVaroquaux force-pushed the pr_7356 branch 2 times, most recently from eecbc86 to bf2830e Compare June 8, 2017 11:55
@GaelVaroquaux
Copy link
Member Author

GaelVaroquaux commented Jun 8, 2017

@GaelVaroquaux GaelVaroquaux force-pushed the pr_7356 branch 3 times, most recently from a554f3f to fec605f Compare June 8, 2017 21:22
@GaelVaroquaux
Copy link
Member Author

All green. Can I haz reviews / mrg ?

@@ -6,6 +6,35 @@ Release history
===============

Version 0.19
==============

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still rebase issue

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicks. I think it would be good to have an explanation in the tests and the whatsnew needs to be fixed, but I don't want to delay this because of minor issues in the example.

by the change in recall since the last operating point, as per the
`Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_.
(`#7356 <https://github.com/scikit-learn/scikit-learn/pull/7356>`_). By
`Nick Dingwall`_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low false negative rate. High scores for both show that the classifier is
returning accurate results (high precision), as well as returning a majority of
all positive results (high recall).
Precision-Recall is a useful measure of success of prediction when the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Precision and recall? Or the precision-recall-curve?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and I would change the example title too: "Using precision and recall for classifier evaluation" or something. It's very awkward now.

And maybe replace below "very imbalanced" by "imbalanced"?

relevant results are returned.

The precision-recall curve shows the tradeoff between precision and
recall for different threshold. A high area under the curve represents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decision thresholds?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high area -> large area?

random_state=random_state)

# Create a simple classifier
classifier = svm.SVC(kernel='linear', probability=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why????? How about LogisticRegression? Or not using probability=True? We don't need probabilities for the precision-recall-curve and many people are not aware of that. So maybe using SVC(linear) is actually good, but we shouldn't use probability=True. What's the reason not to use LinearSVC()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not even using predict_proba, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aaaaaaaaaargh!

random_state=random_state)

# We use OneVsRestClassifier for multi-label prediction
from sklearn.multiclass import OneVsRestClassifier
# Run classifier
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, why probability=True, why SVC(linear) instead of LinearSVC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup!


print("Target recall Selected recall Precision")
for i in range(11):
print(" >= {} {:3.3f} {:3.3f}".format(i / 10,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: if we're using format strings, maybe we should, you know, actually use them for the paddding?

plt.fill_between(recall[iris_cls], precision[iris_cls], step='post', alpha=0.1,
color='g')
for i in range(11):
plt.annotate('',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you put the '' on the next line you have more space ;)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I don't feel it improves things.

@@ -510,15 +552,15 @@ def test_precision_recall_curve_toydata():
auc_prc = average_precision_score(y_true, y_score)
assert_array_almost_equal(p, [0.5, 0., 1.])
assert_array_almost_equal(r, [1., 0., 0.])
assert_almost_equal(auc_prc, 0.25)
assert_almost_equal(auc_prc, 0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation still missing ;)


def test_average_precision_constant_values():
# Check the average_precision_score of a constant predictor is
# the tps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not addressed


where :math:`P_n` and :math:`R_n` are the precision and recall at the
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an
*operating point*.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention that this is related to the integral under the curve but there are subtle differences?

@GaelVaroquaux
Copy link
Member Author

Addressd all @amueller 's comments

@GaelVaroquaux
Copy link
Member Author

@amueller @vene : 👍 ?

- Added a `'eleven-point'` interpolated average precision option to
:func:`metrics.ranking.average_precision_score` as described in the
`PASCAL
Visual Object Classes (VOC) Challenge <http://citeseerx.ist.psu.edu/viewdoc/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess line wrapping here is weird?

@@ -193,6 +201,13 @@ Enhancements
Bug fixes
.........

- :func:`metrics.ranking.average_precision_score` no longer linearly
interpolates between operating points, and instead weights precisions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weighs ?


Precision-recall curves are typically used in binary classification to study
the output of a classifier. In order to extend Precision-recall curve and
the output of a classifier. In order to extend the Precision-recall curve and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowercase p

# In multi-label settings
# ------------------------
#
# Create multli-label data, fit, and predict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi


from sklearn.preprocessing import label_binarize

# Use label_binarize to be multi-label like settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wat 😛

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah :/. I don't like this part of the example, but that's what we have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just mean i don't understand what the comment is saying

random_state=random_state)

# We use OneVsRestClassifier for multi-label prediction
from sklearn.multiclass import OneVsRestClassifier
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newline after

# Compute Precision-Recall and plot curve

###############################################################################
# The precision-Recall score in multi-label settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent capitalization. I'd suggest precision-recall everywhere.

I'm very confused now about what "the precision-recall score" is. Is it the same thing as average precision? I've never heard the former before. (just as two distinct scores.) The narrative at the top doesn't make it clear

plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0]))
plt.legend(loc="lower left")
plt.show()
plt.title('Precision-Recall micro-averaged over all classes: AUC={0:0.2f}'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same confusion. Maybe "precision and recall micro-averaged..."?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or "micro average precision score"?

# ------------------------------
#
# In *interpolated* average precision, a set of desired recall values is
# specified and for each desired value, we average the best precision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd move the comma to after the "and" on this line

"""Compute average precision (AP) from prediction scores

This score corresponds to the area under the precision-recall curve.
Optionally, this will compute an eleven-point interpolated average
precision score: for each of the 11 evenly-spaced target recall values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe say eleven in words to make it easier to spot the connection to the interpolation arg below

precision, recall, thresholds = precision_recall_curve(
y_true, y_score, sample_weight=sample_weight)
return auc(recall, precision)
# Return the step function integral
return -np.sum(np.diff(recall) * np.array(precision)[:-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works because the last entry of precision is guaranteed to be 1, as written in the docstring of precision_recall_curve; do you think this warrants a note in a comment here, for posterity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

precision = list(reversed(precision))
recall = list(reversed(recall))
indices = np.searchsorted(recall, np.arange(0, 1.1, 0.1))
return np.mean([max(precision[i:]) for i in indices])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confused: other than using lists instead of arrays, and other than not ignoring the first p-r pair with p=1.0, this seems identical to the actual implementation in this pr, what am I missing?

Copy link
Member Author

@GaelVaroquaux GaelVaroquaux Jun 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Should we remove this test?

assert_almost_equal(auc_prc, 0.25)
# Here we are doing a terrible prediction: we are always getting
# it wrong, hence the average_precision_score is the accuracy at
# change: 50%
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chance

@amueller
Copy link
Member

amueller commented Jun 9, 2017

LGTM, I haven't recalculated the test results though.

Remove the eleven average precision score

Add better tests.
@GaelVaroquaux
Copy link
Member Author

After discussion with @vene, I have removed the eleven-point average precision from this PR. Sorry @ndingwall we feel that get the fixes of this PR merged in is very important, but that we are not sure that we can warrant correctness the eleven point variant.

I will issue a new PR so that the work does not gets lost, and @ndingwall can take over.

@@ -6,7 +6,7 @@ Release history
===============

Version 0.19
============
==============
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's up here?

@GaelVaroquaux
Copy link
Member Author

@vene: fixed

@GaelVaroquaux
Copy link
Member Author

@amueller and @vene gave me 👍 at the bar. Merging

@GaelVaroquaux
Copy link
Member Author

Merged

@jnothman
Copy link
Member

I think this needs more of a loud warning in changelog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Precision Recall numbers computed by Scikits are not interpolated (non-standard)
5 participants