Skip to content

Fix calibration_curve docstring for empty bins #12926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kms15
Copy link
Contributor

@kms15 kms15 commented Jan 4, 2019

The current docstring for calibration.calibration_curve states that the return values are of shape (n_bins,) The code for calibration_curve however, will remove bins that are empty (which is not mentioned in the current docstring), thus the number of bins in the return value may be smaller than n_bins. This commit fixes the docstring to document the (existing) behavior of removing empty bins.

What does this implement/fix? Explain your changes.

When a bin is empty, calibration_curve will remove it from the list of bins, resulting in return values with less than n_bins:

>>> from sklearn.calibration import calibration_curve
>>> y_true = [True, True, False]
>>> y_prob = [0.9, 0.8, 0.2]
>>> prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=3)
>>> prob_true.shape
(2,)

The existing documentation for this function makes no mention of removing empty bins, however, and states that the return values with be of shape (n_bins,):

    prob_true : array, shape (n_bins,)
        The true probability in each bin (fraction of positives).
 
    prob_pred : array, shape (n_bins,)
        The mean predicted probability in each bin.

Obviously it's best of the documentation matches the behavior of the code, and changing the behavior of the code would risk breaking existing uses of the code. This check in thus updates the documentation to include the behavior of dropping empty bins.

Any other comments?

I encountered this in with real data when trying to generate confidence intervals for a calibration curve using cross validation. The model used weakly correlating features and thus high predicted probabilities were rare and not present in some folds. This lead to differing numbers of bins in each fold causing errors downstream, and the undocumented behavior slowed down the debugging process.

It some cases (such as mine) it would be better to support the behavior of the original documentation (always returning the requested number of bins, even if some are empty) by returning NaNs for empty bins which the end user can then handle appropriately. This could be safely added with a parameter such as drop_empty which could default to True for compatibility. I would be happy to create the pull request if there is interest, but I am assuming there is not enough interest to justify maintaining and testing another control path in the code.

The code for calibration_curve will remove bins that are empty, thus the
number of bins in the return value may be smaller than n_bins. This
commit fixes the docstring to document this (existing) behavior.
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!


Returns
-------
prob_true : array, shape (n_bins,)
The true probability in each bin (fraction of positives).
prob_true : array, shape (n_non_empty_bins,)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be expressed as (n_bins,) or smaller to avoid confusion. or (n_bins,) with a comment below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (n_bins,) or smaller would work well. If we just had (n_bins,) on the first line, however, I'd worry that it would be easy to miss the caveat/comment below. A footnote would be another option, but your first suggestion seems clearest to me. Should I modify the pull request?

@jnothman
Copy link
Member

jnothman commented Jan 8, 2019 via email

Changed `shape (n_non_empty_bins)` to `shape (n_bins,) or smaller` based
on reviewer feedback of pull request.
@jnothman jnothman merged commit f3b2579 into scikit-learn:master Jan 13, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants