Skip to content

FIX remove np.divide with where and without out argument in precision_recall_curve #24382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 7, 2022

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Sep 7, 2022

What does this implement/fix? Explain your changes.

np.divide with where and wihtout an out argument has unitialised values outside of the where condition. After #24245 I think this PR removes the only place where we use this in scikit-learn.

See https://numpy.org/doc/stable/reference/ufuncs.html#ufunc and search "where" (sorry no direct link)

Note that if an uninitialized return array is created, values of where=False will leave those values uninitialized.

Any other comments?

This np.divide was introduced in #19085 to fix the recall when y_true has only zeros this did not affect precision. The behaviour was to have zeros so I used this an an initialisation.

Having looked at the code I am fairly confident that actually tps + fps can not be zero (unless strictly negative sample weights) but it seems like this is a bit tricky to convince yourself so maybe leave the np.divide(..., where=denominator != 0)? Alternatively I could add a comment trying to summarise the following (not that easy).

The code is in _binary_clf_curve:

def _binary_clf_curve(y_true, y_score, pos_label=None, sample_weight=None):

  • no sample_weight case: tps + fps = 1 + threshold_idxs which is never zero
  • sample_weight case is a bit more tricky:
    tps + fps = stable_cumsum(y_true * weight) + stable_cumsum((1 - y_true) * weight)
    
    • if (sample_weight > 0).all(), tps + fps > 0 we are all good
    • if (sample_weight == 0).any(), zero sample_weight get actually masked out in
      # Filter out zero-weighted samples, as they should not impact the result
      if sample_weight is not None:
      sample_weight = column_or_1d(sample_weight)
      sample_weight = _check_sample_weight(sample_weight, y_true)
      nonzero_weight_mask = sample_weight != 0
      y_true = y_true[nonzero_weight_mask]
      y_score = y_score[nonzero_weight_mask]
      sample_weight = sample_weight[nonzero_weight_mask]
      so you never see them in the cumulative sum
    • if (sample_weight < 0).any() tps + fps could be zero, but the consensus is that it does not make any sense to have negative sample weights 🤷. There seem to be some use cases in High Energy Physics but I assume that this is more for an estimator than computing a precision-recall curve. Let's leave this discussion for a separate PR anyway this is a can full of worms.

I tried pragmatically to find a case where tps + fps == 0 and the only way I managed to get this is with negative sample_weight. Here is a snippet if you want to try for yourself:

from sklearn.metrics._ranking import _binary_clf_curve, precision_recall_curve

y_true = [0, 0, 0, 0]
probas_pred = [0.7, 0.6, 0.5, 0.4]
sample_weight = [-1, 1, -1, 1]
fps, tps, thresholds = _binary_clf_curve(
    y_true, probas_pred, pos_label=1, sample_weight=sample_weight
)
print(f"{fps=}")
print(f"{tps=}")
print(f"{thresholds=}")
print(f"{fps+tps == 0=}")

precision, recall, threshold = precision_recall_curve(y_true, probas_pred, pos_label=1, sample_weight=sample_weight)
print(f"{precision=}")

Output:

fps=array([-1.,  0., -1.,  0.])
tps=array([-0.,  0.,  0.,  0.])
thresholds=array([0.7, 0.6, 0.5, 0.4])
fps+tps == 0=array([False,  True, False,  True])

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

To summarize, tps + fps can never be 0 (unless using negative sample weight, which doesn't really make sense here). We still use np.divide instead of just dividing in case someone really wants to use negative sample weights (although not officially supported). So this fix should not impact the behavior in the officially supported case and thus doesn't require a what's new entry.
I'm fine with that.

@betatim
Copy link
Member

betatim commented Sep 7, 2022

For my education: if ps can't be zero, why does the where= exist?

I can't quite convince myself either that ps can never be zero (w/o using -ve weights) but somehow it makes sense from looking at the code of _binary_clf_curve(). The bit that makes me think it can't be zero is that fps is (⚠️ massive simplification) calculated as 1-tps.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with an inline comment:

@glemaitre
Copy link
Member

I quickly check when did we tackle the problem of getting some non-finite values and it seems it was in this PR: #9980

Without negative sample weights, I would foresee the case where we are in a CV and sum(sample_weight) == 0 (i.e. selecting samples with null weights). I would be surprised that nothing during the fit is breaking in this case.

+1 for the change even thought we might never end up in this situation.

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
@jeremiedbb jeremiedbb merged commit fb22b4f into scikit-learn:main Sep 7, 2022
@lesteve lesteve deleted the np-divide-precision-recall branch September 7, 2022 12:29
glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Sep 12, 2022
…n_recall_curve` (scikit-learn#24382)

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants