Implemented step-function interpolation for average_precision_score #6065
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current implementation of
average_precision_score
(inmetrics
) computes the area under the curve by linearly interpolating between operating points. This systematically mis-estimates the area under the precision-recall curve. As an example, consider a 'do-nothing'y_score
of 0.5 for all examples. This is guaranteed to have an average precision score of > 0.5 (assuming at least one ground-truth positive datum) under the current implementation because it linearly interpolates between the two operating points (0, 1) and (1, x), where x is the fraction of positive examples in the sample.This is particularly damaging in the case of a sparse sample with very few positive examples. It is very difficult to beat an average precision score of 0.5 with any genuine model, because we have to turn more false negatives into true positives than true negatives into false positives. (See this ppt for a concrete example.)
I propose a first-step solution, which is to interpolate by using the precision value of the next operating point (i.e. fill in gaps with a horizontal line from the operating point to the right). There are two justifications for this:
y_scores
are continuous and we have rounded versions of them, the best guess we can make for the 'original'y_scores
is to order them randomly. With a large sample, a random ordering ofy_scores
approximates a horizontal line (with precision equal to the positive rate in the subsample).For more discussion of this, see for example the discussion on the LingPipe blog or the chapter in Manning et al's Information Retrieval book. The latter of these proposes a more extreme solution: choosing the highest p' for all r' > r, but this may be lead to unexpected behavior since many operating points will be ignored.
To recreate the problem, consider:
The current implementation gives average precision scores of 0.55... and 0.10... respectively, while my implementation gives 0.10... for both.
A problem with making this change would be that the graph produced by naively plotting the recall and precision lists from
precision_recall_curves
would produce a curve with an AOC clearly different from the reported average_precision_score. The amended docstring should help to clarify the difference.