-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX remove np.divide with where and without out argument in precision_recall_curve
#24382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX remove np.divide with where and without out argument in precision_recall_curve
#24382
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
To summarize, tps + fps can never be 0 (unless using negative sample weight, which doesn't really make sense here). We still use np.divide instead of just dividing in case someone really wants to use negative sample weights (although not officially supported). So this fix should not impact the behavior in the officially supported case and thus doesn't require a what's new entry.
I'm fine with that.
For my education: if I can't quite convince myself either that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with an inline comment:
I quickly check when did we tackle the problem of getting some non-finite values and it seems it was in this PR: #9980 Without negative sample weights, I would foresee the case where we are in a CV and +1 for the change even thought we might never end up in this situation. |
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…n_recall_curve` (scikit-learn#24382) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
What does this implement/fix? Explain your changes.
np.divide
withwhere
and wihtout anout
argument has unitialised values outside of the where condition. After #24245 I think this PR removes the only place where we use this in scikit-learn.See https://numpy.org/doc/stable/reference/ufuncs.html#ufunc and search "where" (sorry no direct link)
Any other comments?
This
np.divide
was introduced in #19085 to fix the recall wheny_true
has only zeros this did not affect precision. The behaviour was to have zeros so I used this an an initialisation.Having looked at the code I am fairly confident that actually
tps + fps
can not be zero (unless strictly negative sample weights) but it seems like this is a bit tricky to convince yourself so maybe leave thenp.divide(..., where=denominator != 0)
? Alternatively I could add a comment trying to summarise the following (not that easy).The code is in
_binary_clf_curve
:scikit-learn/sklearn/metrics/_ranking.py
Line 703 in fd0e815
sample_weight
case:tps + fps = 1 + threshold_idxs
which is never zerosample_weight
case is a bit more tricky:(sample_weight > 0).all()
,tps + fps
> 0 we are all good(sample_weight == 0).any()
, zerosample_weight
get actually masked out inscikit-learn/sklearn/metrics/_ranking.py
Lines 748 to 755 in fd0e815
(sample_weight < 0).any()
tps + fps
could be zero, but the consensus is that it does not make any sense to have negative sample weights 🤷. There seem to be some use cases in High Energy Physics but I assume that this is more for an estimator than computing a precision-recall curve. Let's leave this discussion for a separate PR anyway this is a can full of worms.I tried pragmatically to find a case where
tps + fps == 0
and the only way I managed to get this is with negativesample_weight
. Here is a snippet if you want to try for yourself:Output: