-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Bug fix and new feature: fix implementation of average precision score and add eleven-point interpolated option (7356 rebased) #9017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I need to address the pending comments of #7356 and make the example pretty again, as it was rendered a bit horrible by the merge to master |
I reordered a bit the example (no major change though). I find it more readable on the html output. |
After a lot of back and worth, I have finally convinced myself that I trust this implementation of average_precision_score. I was worried about dealing with duplicates. So I added a stringent test. This is ready for merge. Can I have a review, @agramfort, @vene, @amueller |
doc/whats_new.rst
Outdated
@@ -6,6 +6,35 @@ Release history | |||
=============== | |||
|
|||
Version 0.19 | |||
Version 0.18.2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase issue you need to move this to the stuff below
possible with a recall value at least equal to the target value. | ||
The most common choice is 'eleven point' interpolated precision, where the | ||
desired recall values are [0, 0.1, 0.2, ..., 1.0]. This is the metric used in | ||
`The PASCAL Visual Object Classes (VOC) Challenge <http://citeseerx.ist.psu.edu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this metrics was? sometimes? Or just remove that sentence?
sklearn/metrics/ranking.py
Outdated
For each of the recall values, r, in {0, 0.1, 0.2, ..., 1.0}, | ||
compute the arithmetic mean of the first precision value with a | ||
corresponding recall >= r. This is the metric used in the Pascal | ||
Visual Objects Classes (VOC) Challenge and is as described in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not mention Pascal VOC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, they describe it in the docs, though don't use it in the code. Maybe we should mention the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I don't know... but the current text could be misleading.
|
||
def _interpolated_average_precision_slow(y_true, y_score): | ||
"""A second implementation for the eleven-point interpolated average | ||
precision used by Pascal VOC. This should produce identical results to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
descripted in ir-book?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, gonna check out the example now
precision_recall_auc = _average_precision_slow(y_true, probas_pred) | ||
interpolated_average_precision = _interpolated_average_precision_slow( | ||
y_true, probas_pred) | ||
assert_array_almost_equal(precision_recall_auc, 0.859, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does that number come from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Ping @ndingwall: we don't like to have hard-coded numbers in our test suites without being able to rederive them.
assert_equal(p.size, r.size) | ||
assert_equal(p.size, thresholds.size + 1) | ||
# Smoke test in the case of proba having only one value | ||
p, r, thresholds = precision_recall_curve(y_true, | ||
np.zeros_like(probas_pred)) | ||
precision_recall_auc = auc(r, p) | ||
assert_array_almost_equal(precision_recall_auc, 0.75, 3) | ||
assert_array_almost_equal(precision_recall_auc, 0.75) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does that number come from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number is actually meaningless, as the way above to compute the AUC of the PR curve is not the right way: it does not cater for edge effects. On a constant prediction, the average precision score is the TPR, here .5.
I am removing these lines. Anyhow, I added a correct test (which passes) at a different location.
@@ -510,15 +552,15 @@ def test_precision_recall_curve_toydata(): | |||
auc_prc = average_precision_score(y_true, y_score) | |||
assert_array_almost_equal(p, [0.5, 0., 1.]) | |||
assert_array_almost_equal(r, [1., 0., 0.]) | |||
assert_almost_equal(auc_prc, 0.25) | |||
assert_almost_equal(auc_prc, 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a short explanation? Why is this one right and the other one wasn't?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation still missing ;)
|
||
def test_average_precision_constant_values(): | ||
# Check the average_precision_score of a constant predictor is | ||
# the tps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tps? tpr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not addressed
y_true[::4] = 1 | ||
# And a constant score | ||
y_score = np.ones(100) | ||
# The precision is then the fraction of positive whatever the recall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for all thresholds and recall values"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because there is only one threshold.
doc/whats_new.rst
Outdated
............ | ||
|
||
- Added a `'eleven-point'` interpolated average precision option to | ||
:func:`metrics.ranking.average_precision_score` as used in the `PASCAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it?
plt.xlabel('Recall') | ||
plt.ylabel('Precision') | ||
plt.ylim([0.0, 1.05]) | ||
plt.xlim([0.0, 1.0]) | ||
plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0])) | ||
plt.title('Precision-Recall example: AUC={0:0.2f}'.format( | ||
average_precision["micro"])) | ||
plt.legend(loc="lower left") | ||
plt.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the show so both pop up at once.
y_score.ravel()) | ||
average_precision["micro"] = average_precision_score(y_test, y_score, | ||
average_precision["micro"] = average_precision_score(Y_test, y_score, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this standard? I'm not sure I understand what that means. Maybe do binary for this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I think multi-label is not that common, and I'm not sure this is a common strategy for multi-class. as a matter of fact, we don't have multi-class average precision (there is a PR for multi-class AUC and there is a bunch of literature on that!).
So I vote going binary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do a binary first, and then illustrate micro
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have certainly seen this before in NLP even for multiclass.
sklearn/metrics/ranking.py
Outdated
"""Compute average precision (AP) from prediction scores | ||
|
||
This score corresponds to the area under the precision-recall curve. | ||
Optionally, this will compute an eleven-point interpolated average |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing whitespaces
eecbc86
to
bf2830e
Compare
I've addressed the comments and rewritten the example to start with a simple case. The example can be seen here: |
a554f3f
to
fec605f
Compare
All green. Can I haz reviews / mrg ? |
doc/whats_new.rst
Outdated
@@ -6,6 +6,35 @@ Release history | |||
=============== | |||
|
|||
Version 0.19 | |||
============== | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still rebase issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nitpicks. I think it would be good to have an explanation in the tests and the whatsnew needs to be fixed, but I don't want to delay this because of minor issues in the example.
doc/whats_new.rst
Outdated
by the change in recall since the last operating point, as per the | ||
`Wikipedia entry <http://en.wikipedia.org/wiki/Average_precision>`_. | ||
(`#7356 <https://github.com/scikit-learn/scikit-learn/pull/7356>`_). By | ||
`Nick Dingwall`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And @GaelVaroquaux ?
low false negative rate. High scores for both show that the classifier is | ||
returning accurate results (high precision), as well as returning a majority of | ||
all positive results (high recall). | ||
Precision-Recall is a useful measure of success of prediction when the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Precision and recall? Or the precision-recall-curve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, and I would change the example title too: "Using precision and recall for classifier evaluation" or something. It's very awkward now.
And maybe replace below "very imbalanced" by "imbalanced"?
relevant results are returned. | ||
|
||
The precision-recall curve shows the tradeoff between precision and | ||
recall for different threshold. A high area under the curve represents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decision thresholds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high area -> large area?
random_state=random_state) | ||
|
||
# Create a simple classifier | ||
classifier = svm.SVC(kernel='linear', probability=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why????? How about LogisticRegression
? Or not using probability=True
? We don't need probabilities for the precision-recall-curve and many people are not aware of that. So maybe using SVC(linear)
is actually good, but we shouldn't use probability=True
. What's the reason not to use LinearSVC()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not even using predict_proba
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaaaaaaaargh!
random_state=random_state) | ||
|
||
# We use OneVsRestClassifier for multi-label prediction | ||
from sklearn.multiclass import OneVsRestClassifier | ||
# Run classifier | ||
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, why probability=True
, why SVC(linear) instead of LinearSVC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup!
|
||
print("Target recall Selected recall Precision") | ||
for i in range(11): | ||
print(" >= {} {:3.3f} {:3.3f}".format(i / 10, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: if we're using format strings, maybe we should, you know, actually use them for the paddding?
plt.fill_between(recall[iris_cls], precision[iris_cls], step='post', alpha=0.1, | ||
color='g') | ||
for i in range(11): | ||
plt.annotate('', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you put the '' on the next line you have more space ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but I don't feel it improves things.
@@ -510,15 +552,15 @@ def test_precision_recall_curve_toydata(): | |||
auc_prc = average_precision_score(y_true, y_score) | |||
assert_array_almost_equal(p, [0.5, 0., 1.]) | |||
assert_array_almost_equal(r, [1., 0., 0.]) | |||
assert_almost_equal(auc_prc, 0.25) | |||
assert_almost_equal(auc_prc, 0.5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation still missing ;)
|
||
def test_average_precision_constant_values(): | ||
# Check the average_precision_score of a constant predictor is | ||
# the tps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not addressed
|
||
where :math:`P_n` and :math:`R_n` are the precision and recall at the | ||
nth threshold. A pair :math:`(R_k, P_k)` is referred to as an | ||
*operating point*. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention that this is related to the integral under the curve but there are subtle differences?
Addressd all @amueller 's comments |
doc/whats_new.rst
Outdated
- Added a `'eleven-point'` interpolated average precision option to | ||
:func:`metrics.ranking.average_precision_score` as described in the | ||
`PASCAL | ||
Visual Object Classes (VOC) Challenge <http://citeseerx.ist.psu.edu/viewdoc/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess line wrapping here is weird?
doc/whats_new.rst
Outdated
@@ -193,6 +201,13 @@ Enhancements | |||
Bug fixes | |||
......... | |||
|
|||
- :func:`metrics.ranking.average_precision_score` no longer linearly | |||
interpolates between operating points, and instead weights precisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weighs ?
|
||
Precision-recall curves are typically used in binary classification to study | ||
the output of a classifier. In order to extend Precision-recall curve and | ||
the output of a classifier. In order to extend the Precision-recall curve and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lowercase p
# In multi-label settings | ||
# ------------------------ | ||
# | ||
# Create multli-label data, fit, and predict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multi
|
||
from sklearn.preprocessing import label_binarize | ||
|
||
# Use label_binarize to be multi-label like settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wat 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah :/. I don't like this part of the example, but that's what we have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just mean i don't understand what the comment is saying
random_state=random_state) | ||
|
||
# We use OneVsRestClassifier for multi-label prediction | ||
from sklearn.multiclass import OneVsRestClassifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
newline after
# Compute Precision-Recall and plot curve | ||
|
||
############################################################################### | ||
# The precision-Recall score in multi-label settings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inconsistent capitalization. I'd suggest precision-recall
everywhere.
I'm very confused now about what "the precision-recall score" is. Is it the same thing as average precision? I've never heard the former before. (just as two distinct scores.) The narrative at the top doesn't make it clear
plt.title('Precision-Recall example: AUC={0:0.2f}'.format(average_precision[0])) | ||
plt.legend(loc="lower left") | ||
plt.show() | ||
plt.title('Precision-Recall micro-averaged over all classes: AUC={0:0.2f}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same confusion. Maybe "precision and recall micro-averaged..."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or "micro average precision score"?
# ------------------------------ | ||
# | ||
# In *interpolated* average precision, a set of desired recall values is | ||
# specified and for each desired value, we average the best precision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'd move the comma to after the "and" on this line
sklearn/metrics/ranking.py
Outdated
"""Compute average precision (AP) from prediction scores | ||
|
||
This score corresponds to the area under the precision-recall curve. | ||
Optionally, this will compute an eleven-point interpolated average | ||
precision score: for each of the 11 evenly-spaced target recall values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe say eleven in words to make it easier to spot the connection to the interpolation arg below
sklearn/metrics/ranking.py
Outdated
precision, recall, thresholds = precision_recall_curve( | ||
y_true, y_score, sample_weight=sample_weight) | ||
return auc(recall, precision) | ||
# Return the step function integral | ||
return -np.sum(np.diff(recall) * np.array(precision)[:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this works because the last entry of precision is guaranteed to be 1, as written in the docstring of precision_recall_curve
; do you think this warrants a note in a comment here, for posterity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed
precision = list(reversed(precision)) | ||
recall = list(reversed(recall)) | ||
indices = np.searchsorted(recall, np.arange(0, 1.1, 0.1)) | ||
return np.mean([max(precision[i:]) for i in indices]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
confused: other than using lists instead of arrays, and other than not ignoring the first p-r pair with p=1.0, this seems identical to the actual implementation in this pr, what am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
Should we remove this test?
assert_almost_equal(auc_prc, 0.25) | ||
# Here we are doing a terrible prediction: we are always getting | ||
# it wrong, hence the average_precision_score is the accuracy at | ||
# change: 50% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chance
LGTM, I haven't recalculated the test results though. |
Remove the eleven average precision score Add better tests.
After discussion with @vene, I have removed the eleven-point average precision from this PR. Sorry @ndingwall we feel that get the fixes of this PR merged in is very important, but that we are not sure that we can warrant correctness the eleven point variant. I will issue a new PR so that the work does not gets lost, and @ndingwall can take over. |
doc/whats_new.rst
Outdated
@@ -6,7 +6,7 @@ Release history | |||
=============== | |||
|
|||
Version 0.19 | |||
============ | |||
============== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's up here?
@vene: fixed |
Merged |
I think this needs more of a loud warning in changelog |
Rebased version of #7356
Fixes #4577 and #6377
What does this implement/fix? Explain your changes.
This adds an optional interpolation parameter to both average_precision_score. By default, the value is set to None which replicates the existing behavior, but there is also a 'eleven_point' option that implements the strategy described in Stanford's Introduction to Information Retrieval.
Reference Issue
What does this implement/fix? Explain your changes.
Any other comments?