Fix calibration_curve docstring for empty bins #12926

kms15 · 2019-01-04T18:17:13Z

The current docstring for calibration.calibration_curve states that the return values are of shape (n_bins,) The code for calibration_curve however, will remove bins that are empty (which is not mentioned in the current docstring), thus the number of bins in the return value may be smaller than n_bins. This commit fixes the docstring to document the (existing) behavior of removing empty bins.

What does this implement/fix? Explain your changes.

When a bin is empty, calibration_curve will remove it from the list of bins, resulting in return values with less than n_bins:

>>> from sklearn.calibration import calibration_curve
>>> y_true = [True, True, False]
>>> y_prob = [0.9, 0.8, 0.2]
>>> prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=3)
>>> prob_true.shape
(2,)

The existing documentation for this function makes no mention of removing empty bins, however, and states that the return values with be of shape (n_bins,):

    prob_true : array, shape (n_bins,)
        The true probability in each bin (fraction of positives).
 
    prob_pred : array, shape (n_bins,)
        The mean predicted probability in each bin.

Obviously it's best of the documentation matches the behavior of the code, and changing the behavior of the code would risk breaking existing uses of the code. This check in thus updates the documentation to include the behavior of dropping empty bins.

Any other comments?

I encountered this in with real data when trying to generate confidence intervals for a calibration curve using cross validation. The model used weakly correlating features and thus high predicted probabilities were rare and not present in some folds. This lead to differing numbers of bins in each fold causing errors downstream, and the undocumented behavior slowed down the debugging process.

It some cases (such as mine) it would be better to support the behavior of the original documentation (always returning the requested number of bins, even if some are empty) by returning NaNs for empty bins which the end user can then handle appropriately. This could be safely added with a parameter such as drop_empty which could default to True for compatibility. I would be happy to create the pull request if there is interest, but I am assuming there is not enough interest to justify maintaining and testing another control path in the code.

The code for calibration_curve will remove bins that are empty, thus the number of bins in the return value may be smaller than n_bins. This commit fixes the docstring to document this (existing) behavior.

jnothman

Thanks!

jnothman · 2019-01-07T20:14:25Z

sklearn/calibration.py


    Returns
    -------
-    prob_true : array, shape (n_bins,)
-        The true probability in each bin (fraction of positives).
+    prob_true : array, shape (n_non_empty_bins,)


I wonder if this should be expressed as (n_bins,) or smaller to avoid confusion. or (n_bins,) with a comment below.

I think (n_bins,) or smaller would work well. If we just had (n_bins,) on the first line, however, I'd worry that it would be easy to miss the caveat/comment below. A footnote would be another option, but your first suggestion seems clearest to me. Should I modify the pull request?

jnothman · 2019-01-08T08:02:29Z

Yes, please add a commit

Changed `shape (n_non_empty_bins)` to `shape (n_bins,) or smaller` based on reviewer feedback of pull request.

…kit-learn#12926)

…ing (scikit-learn#12926)" This reverts commit f1e78a6.

…kit-learn#12926)

Fix calibration_curve docstring for empty bins

2a9d3f7

The code for calibration_curve will remove bins that are empty, thus the number of bins in the return value may be smaller than n_bins. This commit fixes the docstring to document this (existing) behavior.

jnothman approved these changes Jan 7, 2019

View reviewed changes

jnothman reviewed Jan 7, 2019

View reviewed changes

Improve clarity of docstring for calibration_curve

ef3a16b

Changed `shape (n_non_empty_bins)` to `shape (n_bins,) or smaller` based on reviewer feedback of pull request.

jnothman approved these changes Jan 13, 2019

View reviewed changes

jnothman merged commit f3b2579 into scikit-learn:master Jan 13, 2019

jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 19, 2019

DOC mention handlng of empty bins in calibration_curve docstring (sci…

18f156b

…kit-learn#12926)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

DOC mention handlng of empty bins in calibration_curve docstring (sci…

f1e78a6

…kit-learn#12926)

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC mention handlng of empty bins in calibration_curve docstr…

62a7620

…ing (scikit-learn#12926)" This reverts commit f1e78a6.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC mention handlng of empty bins in calibration_curve docstr…

6c938fb

…ing (scikit-learn#12926)" This reverts commit f1e78a6.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

DOC mention handlng of empty bins in calibration_curve docstring (sci…

0370945

…kit-learn#12926)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix calibration_curve docstring for empty bins #12926

Fix calibration_curve docstring for empty bins #12926

Uh oh!

kms15 commented Jan 4, 2019

Uh oh!

jnothman left a comment

Uh oh!

jnothman Jan 7, 2019

Uh oh!

kms15 Jan 7, 2019

Uh oh!

jnothman commented Jan 8, 2019 via email

Uh oh!

Uh oh!

Uh oh!

Fix calibration_curve docstring for empty bins #12926

Fix calibration_curve docstring for empty bins #12926

Uh oh!

Conversation

kms15 commented Jan 4, 2019

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

kms15 Jan 7, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 8, 2019 via email

Uh oh!

Uh oh!