Skip to content

Calibration/infinite probability #17758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

yashika51
Copy link
Contributor

Reference Issues/PRs

Fixes #10903

What does this implement/fix? Explain your changes.

This fixes the issue of infinite probabilities returned by predict_proba. I have added np.clip() to clip values in the range of [0,1].
Changes in sklearn/calibration.py.
One test is also added in sklearn/tests/test_isotonic.py.

Any other comments?

@yashika51 yashika51 marked this pull request as draft June 27, 2020 14:59
@yashika51 yashika51 marked this pull request as ready for review June 28, 2020 19:50
@NicolasHug
Copy link
Member

Thanks for the PR @yashika51

The proposed test does not fail on master so it doesn't seem like a proper regression test. Were you able to reproduce the original issue from #10903 ?

@yashika51
Copy link
Contributor Author

yashika51 commented Jun 28, 2020

Thanks for the PR @yashika51

The proposed test does not fail on master so it doesn't seem like a proper regression test. Were you able to reproduce the original issue from #10903 ?

Hi @NicolasHug,
Yes, I got infinite probabilities returned multiple times, but the issue is reproducible with large dataset. With a small array, it might not fail. I picked some values involved in returning infinity and put it in the test function as I read in comments that the dataset cannot be included in the repository.

@yashika51
Copy link
Contributor Author

yashika51 commented Jun 29, 2020

Hi @NicolasHug, it's difficult to find a very small array that returns infinity with predict_proba. However, it is clearly visible with a larger dataset.
cc: @flosincapite

@yashika51
Copy link
Contributor Author

Hi @NicolasHug, adding a test with a small array to reproduce the issue is not working.
Can you please review the colab link once? I have reproduced the same error with a larger dataset there.
Thanks.

@yashika51
Copy link
Contributor Author

Hi @NicolasHug and @thomasjpfan, it would be great if you can please review this PR and suggest changes if any :)

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happening because there are some elements in np.sum(proba, axis=1) that is zero? (In predict_proba)

@@ -444,7 +444,8 @@ def predict_proba(self, X):
proba[np.isnan(proba)] = 1. / n_classes

# Deal with cases where the predicted probability minimally exceeds 1.0
proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0

np.clip(proba, 0.0, 1.0, proba)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.clip(proba, 0.0, 1.0, proba)
np.clip(proba, 0.0, 1.0, out=proba)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even say:

Suggested change
np.clip(proba, 0.0, 1.0, proba)
np.clip(proba, 0, 1, out=proba)

@@ -444,7 +444,8 @@ def predict_proba(self, X):
proba[np.isnan(proba)] = 1. / n_classes

# Deal with cases where the predicted probability minimally exceeds 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this comment now that it does not apply

@@ -535,3 +536,19 @@ def test_isotonic_thresholds(increasing):
assert all(np.diff(y_thresholds) >= 0)
else:
assert all(np.diff(y_thresholds) <= 0)


def test_infinite_probabilities():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this test passes on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happening because there are some elements in np.sum(proba, axis=1) that is zero? (In predict_proba)

That might be possible. I am trying to figure out the values with which the issue can be recreated but it only shows up with a large dataset.
What alternatives can we take?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if np.sum(proba, axis=1) is zero for the large dataset and see why this is happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Contributor Author

@yashika51 yashika51 Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Thomas9292, so it seems like the infinity is returned by the classifiers

for calibrated_classifier in self.calibrated_classifiers_:
proba = calibrated_classifier.predict_proba(X)
mean_proba += proba

One way to show this issue is by serializing. Would that be a good idea?

Copy link
Member

@thomasjpfan thomasjpfan Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to guide us to: "How do we generate a dataset so we can have a non-regression test?" We can not use a serialized model, because we do not support unserializing between different versions of sklearn. Understanding the underlying issue, can help us generate this dataset.

If the goal is to nail down exactly where this problem first arises, then the alternative will be a fair amount of archaeology--digging through the implementation to discern the ultimate source of these infinities.

Yes, this is the case here. Seeing where the bug came form would help us figure out if clipping is the correct solution.

After going through the data, it looks like np.inf comes from IsotonicRegression:

from sklearn.isotonic import IsotonicRegression
import numpy as np

X = np.array([0., 4.1e-320, 4.4e-314, 1.])
y = np.array([0.42, 0.42, 0.44, 0.44])

iso = IsotonicRegression().fit(X, y)

iso.predict(np.array([0, 2.1e-319, 5.4e-316, 1e-10]))
# array([0.42,  inf,  inf, 0.44])

In this case the np.clip from this PR would map the middle two values to 1, which which would be incorrect.

Copy link
Contributor Author

@yashika51 yashika51 Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that @thomasjpfan . I am working on the categorical feature support issue currently. I'll try to resolve this later. Thanks for the reviews!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My username is @thomasjpfan 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry about that :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem @yashika51, always happy to help and even happier to get credit for @thomasjpfan ’s work 😉

@glemaitre
Copy link
Member

Is there any connection with: #16321

@lucyleeow
Copy link
Member

lucyleeow commented Aug 22, 2020

Is there any connection with: #16321

I actually just had a look at this and using the solution in #17790 with strict=True does not fix the example thomas gave above. Though if the example data had another 'step', it might fix it. Will look into it.

@lucyleeow
Copy link
Member

lucyleeow commented Aug 27, 2020

I've had a look into this and as mentioned prior, I think it's due to isotonic regression, specifically scipy.interpolate.interp1d, not dealing well with very small numbers. Specifically here:

slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]

we end up dividing by a very small number, resulting in inf result.

In the example presented in #10903, setting random_state=1 for consistent results (sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2, random_state=1)) gives us inf and -inf for 2 samples. These are both the result of very small proba values from GaussianNB (2.17e-322 and 1.67e-319). The cause of the -inf is because for binary classification we calculate the proba of the negative class:

if n_classes == 2:
proba[:, 0] = 1. - proba[:, 1]

and 1 - np.inf is -inf.

Although there is some debate about the use of interp1d (see scipy/scipy#4304 (comment)), UnivariateSpline has the same problem but returns nan instead of inf.

I propose if method='isotonic', inf probas should be changed to 0.

WDYT @glemaitre @ogrisel ?

@ogrisel
Copy link
Member

ogrisel commented Sep 7, 2020

Thanks for your analysis @lucyleeow.

I propose if method='isotonic', inf probas should be changed to 0.

Unfortunately I don't think it would work. Replacing the inf by 0 the following example would make the prediction outside of the correct range:

>>> from sklearn.isotonic import IsotonicRegression
... import numpy as np
...
... X = np.array([0., 1e-320, 1e-314, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])
... IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316, 1e-313, 1e-10]))
array([0.42, 0.42,  inf,  inf, 0.44, 0.44])

we need to find a way to either unbreak the interpolation or preprocess the input of the interpolator to prevent this from happening.

Here is another example that also fail but I am not sure why:

>>> from sklearn.isotonic import IsotonicRegression
... import numpy as np
...
... X = np.array([0., 1e-315, 1e-315, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])
... IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316, 1e-313, 1e-10]))
array([0.42,  inf,  inf,  inf, 0.43, 0.43])

@lucyleeow
Copy link
Member

lucyleeow commented Sep 8, 2020

would make the prediction outside of the correct range:

Could we replace it with y_min instead of 0 ? nevermind, that won't work

@lucyleeow
Copy link
Member

lucyleeow commented Sep 10, 2020

Here is another example that also fail but I am not sure why:

Here any number between 0 and 1e-315 will be inf.

In the first example:

... X = np.array([0., 1e-320, 1e-314, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])

any number between 0 and 1e-314 will be inf.

It seems that the issue is that small numbers below and above 0 will remain after the _make_unique function:

unique_X, unique_y, unique_sample_weight = _make_unique(
X, y, sample_weight)

e.g.,
if X is array([0.e+000, 1.e-320, 1.e-314, 1.e+000]),
unique_X is array([0.e+000, 1.e-320, 1.e-314, 1.e+000])

if X is array([-1, (0. - 1e-315), 0, 0.1, 0.2, 1.])
unique_X is array([-1.e+000, -1.e-315, 0.e+000, 1.e-001, 2.e-001, 1.e+000])

whereas if X is array([0., 0.1, 0.2, (0.2 + 1e-320), (0.2 + 1e-320 + 1e-310), 1.])
unique_X is array([0. , 0.1, 0.2, 1. ])

(I think replacing with the y_min value would actually be a reasonable solution)

@lucyleeow
Copy link
Member

Actually I realised that changing inf outputs to y_min, wouldn't work if the range of X includes negative numbers.

Thoughts @ogrisel ?

@lucyleeow
Copy link
Member

So I think this is a limitation of representation of floating point numbers:

>>> 0 + 1e-310 == 0                                                                
False

whereas

>>> 3 + 1e-310 == 3
True

we could change inf values to predict(0) , or ?

@ogrisel
Copy link
Member

ogrisel commented Sep 16, 2020

In the method _call_linear (https://github.com/scipy/scipy/blob/3b4a30cbf580a1ea07bf48abf4bbea708b2018dd/scipy/interpolate/interpolate.py#L589-L616), what are the values of x_new, x_hi, x_lo, y_hi and y_low when the resulting y_new is inf?

@lucyleeow
Copy link
Member

lucyleeow commented Sep 16, 2020

(Edited)

Using:

>>> X = np.array([0., 1e-320, 1e-314, 1.])
>>> y = np.array([0.42, 0.42, 0.44, 0.44])
>>> IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316]))
array([0.42, 0.42,  inf,  inf])

The values in:

        slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]

are:

y_hi = array([[0.42], [0.42], [0.44], [0.44]])
y_lo = array([[0.42], [0.42], [0.42], [0.42]])
x_hi = array([1.e-320, 1.e-320, 1.e-314, 1.e-314])
x_lo = array([0.e+000, 0.e+000, 1.e-320, 1.e-320])

Note that if certain conditions are met, scipy.interpolate.interp1d delegates to numpy.interp due to speed, (I think the numpy version is written in C). In this case I forced it to use the scipy _call_linear (by setting fill_value='extrapolate') but the calculation should be exactly the same (as in numpy).

@ogrisel
Copy link
Member

ogrisel commented Oct 27, 2020

The source problem was fixed in #18639. Thanks for your contribution @yashika51 !

@ogrisel ogrisel closed this Oct 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CalibratedClassifierCV with mode = 'isotonic' has predict_proba return infinite probabilities
8 participants