FIX remove np.divide with where and without out argument in precision_recall_curve
#24382
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this implement/fix? Explain your changes.
np.divide
withwhere
and wihtout anout
argument has unitialised values outside of the where condition. After #24245 I think this PR removes the only place where we use this in scikit-learn.See https://numpy.org/doc/stable/reference/ufuncs.html#ufunc and search "where" (sorry no direct link)
Any other comments?
This
np.divide
was introduced in #19085 to fix the recall wheny_true
has only zeros this did not affect precision. The behaviour was to have zeros so I used this an an initialisation.Having looked at the code I am fairly confident that actually
tps + fps
can not be zero (unless strictly negative sample weights) but it seems like this is a bit tricky to convince yourself so maybe leave thenp.divide(..., where=denominator != 0)
? Alternatively I could add a comment trying to summarise the following (not that easy).The code is in
_binary_clf_curve
:scikit-learn/sklearn/metrics/_ranking.py
Line 703 in fd0e815
sample_weight
case:tps + fps = 1 + threshold_idxs
which is never zerosample_weight
case is a bit more tricky:(sample_weight > 0).all()
,tps + fps
> 0 we are all good(sample_weight == 0).any()
, zerosample_weight
get actually masked out inscikit-learn/sklearn/metrics/_ranking.py
Lines 748 to 755 in fd0e815
(sample_weight < 0).any()
tps + fps
could be zero, but the consensus is that it does not make any sense to have negative sample weights 🤷. There seem to be some use cases in High Energy Physics but I assume that this is more for an estimator than computing a precision-recall curve. Let's leave this discussion for a separate PR anyway this is a can full of worms.I tried pragmatically to find a case where
tps + fps == 0
and the only way I managed to get this is with negativesample_weight
. Here is a snippet if you want to try for yourself:Output: