Calibration/infinite probability #17758

yashika51 · 2020-06-27T14:23:36Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This fixes the issue of infinite probabilities returned by predict_proba. I have added np.clip() to clip values in the range of [0,1].
Changes in sklearn/calibration.py.
One test is also added in sklearn/tests/test_isotonic.py.

Any other comments?

…into calibration/infinite_probability

NicolasHug · 2020-06-28T21:36:29Z

Thanks for the PR @yashika51

The proposed test does not fail on master so it doesn't seem like a proper regression test. Were you able to reproduce the original issue from #10903 ?

yashika51 · 2020-06-28T23:26:41Z

Thanks for the PR @yashika51

The proposed test does not fail on master so it doesn't seem like a proper regression test. Were you able to reproduce the original issue from #10903 ?

Hi @NicolasHug,
Yes, I got infinite probabilities returned multiple times, but the issue is reproducible with large dataset. With a small array, it might not fail. I picked some values involved in returning infinity and put it in the test function as I read in comments that the dataset cannot be included in the repository.

yashika51 · 2020-06-29T20:03:14Z

Hi @NicolasHug, it's difficult to find a very small array that returns infinity with predict_proba. However, it is clearly visible with a larger dataset.
cc: @flosincapite

yashika51 · 2020-07-02T18:32:48Z

Hi @NicolasHug, adding a test with a small array to reproduce the issue is not working.
Can you please review the colab link once? I have reproduced the same error with a larger dataset there.
Thanks.

yashika51 · 2020-07-07T14:45:46Z

Hi @NicolasHug and @thomasjpfan, it would be great if you can please review this PR and suggest changes if any :)

thomasjpfan

Is this happening because there are some elements in np.sum(proba, axis=1) that is zero? (In predict_proba)

thomasjpfan · 2020-07-12T19:41:53Z

sklearn/calibration.py

@@ -444,7 +444,8 @@ def predict_proba(self, X):
        proba[np.isnan(proba)] = 1. / n_classes

        # Deal with cases where the predicted probability minimally exceeds 1.0
-        proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0
+
+        np.clip(proba, 0.0, 1.0, proba)


Suggested change

np.clip(proba, 0.0, 1.0, proba)

np.clip(proba, 0.0, 1.0, out=proba)

I would even say:

Suggested change

np.clip(proba, 0.0, 1.0, proba)

np.clip(proba, 0, 1, out=proba)

thomasjpfan · 2020-07-12T19:47:15Z

sklearn/calibration.py

@@ -444,7 +444,8 @@ def predict_proba(self, X):
        proba[np.isnan(proba)] = 1. / n_classes

        # Deal with cases where the predicted probability minimally exceeds 1.0


We should remove this comment now that it does not apply

thomasjpfan · 2020-07-12T19:49:59Z

sklearn/tests/test_isotonic.py

@@ -535,3 +536,19 @@ def test_isotonic_thresholds(increasing):
        assert all(np.diff(y_thresholds) >= 0)
    else:
        assert all(np.diff(y_thresholds) <= 0)
+
+
+def test_infinite_probabilities():


Currently this test passes on master.

Is this happening because there are some elements in np.sum(proba, axis=1) that is zero? (In predict_proba)

That might be possible. I am trying to figure out the values with which the issue can be recreated but it only shows up with a large dataset.
What alternatives can we take?

See if np.sum(proba, axis=1) is zero for the large dataset and see why this is happening.

Hi @Thomas9292, so it seems like the infinity is returned by the classifiers

scikit-learn/sklearn/calibration.py

Lines 345 to 347 in f642ff7

for calibrated_classifier in self.calibrated_classifiers_:

proba = calibrated_classifier.predict_proba(X)

mean_proba += proba

One way to show this issue is by serializing. Would that be a good idea?

I am trying to guide us to: "How do we generate a dataset so we can have a non-regression test?" We can not use a serialized model, because we do not support unserializing between different versions of sklearn. Understanding the underlying issue, can help us generate this dataset.

If the goal is to nail down exactly where this problem first arises, then the alternative will be a fair amount of archaeology--digging through the implementation to discern the ultimate source of these infinities.

Yes, this is the case here. Seeing where the bug came form would help us figure out if clipping is the correct solution.

After going through the data, it looks like np.inf comes from IsotonicRegression:

from sklearn.isotonic import IsotonicRegression import numpy as np X = np.array([0., 4.1e-320, 4.4e-314, 1.]) y = np.array([0.42, 0.42, 0.44, 0.44]) iso = IsotonicRegression().fit(X, y) iso.predict(np.array([0, 2.1e-319, 5.4e-316, 1e-10])) # array([0.42, inf, inf, 0.44])

In this case the np.clip from this PR would map the middle two values to 1, which which would be incorrect.

Thanks for clarifying that @thomasjpfan . I am working on the categorical feature support issue currently. I'll try to resolve this later. Thanks for the reviews!

My username is @thomasjpfan 😅

Oops sorry about that :(

No problem @yashika51, always happy to help and even happier to get credit for @thomasjpfan ’s work 😉

glemaitre · 2020-08-21T18:23:59Z

Is there any connection with: #16321

lucyleeow · 2020-08-22T07:38:55Z

Is there any connection with: #16321

I actually just had a look at this and using the solution in #17790 with strict=True does not fix the example thomas gave above. Though if the example data had another 'step', it might fix it. Will look into it.

lucyleeow · 2020-08-27T15:02:54Z

I've had a look into this and as mentioned prior, I think it's due to isotonic regression, specifically scipy.interpolate.interp1d, not dealing well with very small numbers. Specifically here:

slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]

we end up dividing by a very small number, resulting in inf result.

In the example presented in #10903, setting random_state=1 for consistent results (sss = StratifiedShuffleSplit(n_splits = 10, test_size = 0.2, random_state=1)) gives us inf and -inf for 2 samples. These are both the result of very small proba values from GaussianNB (2.17e-322 and 1.67e-319). The cause of the -inf is because for binary classification we calculate the proba of the negative class:

scikit-learn/sklearn/calibration.py

Lines 493 to 494 in 19c8b19

    
           if n_classes == 2: 
        
               proba[:, 0] = 1. - proba[:, 1]

and 1 - np.inf is -inf.

Although there is some debate about the use of interp1d (see scipy/scipy#4304 (comment)), UnivariateSpline has the same problem but returns nan instead of inf.

I propose if method='isotonic', inf probas should be changed to 0.

WDYT @glemaitre @ogrisel ?

ogrisel · 2020-09-07T14:29:45Z

Thanks for your analysis @lucyleeow.

I propose if method='isotonic', inf probas should be changed to 0.

Unfortunately I don't think it would work. Replacing the inf by 0 the following example would make the prediction outside of the correct range:

>>> from sklearn.isotonic import IsotonicRegression
... import numpy as np
...
... X = np.array([0., 1e-320, 1e-314, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])
... IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316, 1e-313, 1e-10]))
array([0.42, 0.42,  inf,  inf, 0.44, 0.44])

we need to find a way to either unbreak the interpolation or preprocess the input of the interpolator to prevent this from happening.

Here is another example that also fail but I am not sure why:

>>> from sklearn.isotonic import IsotonicRegression
... import numpy as np
...
... X = np.array([0., 1e-315, 1e-315, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])
... IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316, 1e-313, 1e-10]))
array([0.42,  inf,  inf,  inf, 0.43, 0.43])

lucyleeow · 2020-09-08T11:59:39Z

would make the prediction outside of the correct range:

~~Could we replace it with y_min instead of 0 ?~~ nevermind, that won't work

lucyleeow · 2020-09-10T10:04:33Z

Here is another example that also fail but I am not sure why:

Here any number between 0 and 1e-315 will be inf.

In the first example:

... X = np.array([0., 1e-320, 1e-314, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])

any number between 0 and 1e-314 will be inf.

It seems that the issue is that small numbers below and above 0 will remain after the _make_unique function:

scikit-learn/sklearn/isotonic.py

Lines 269 to 270 in fd1ff73

    
           unique_X, unique_y, unique_sample_weight = _make_unique( 
        
               X, y, sample_weight)

e.g.,
if X is array([0.e+000, 1.e-320, 1.e-314, 1.e+000]),
unique_X is array([0.e+000, 1.e-320, 1.e-314, 1.e+000])

if X is array([-1, (0. - 1e-315), 0, 0.1, 0.2, 1.])
unique_X is array([-1.e+000, -1.e-315, 0.e+000, 1.e-001, 2.e-001, 1.e+000])

whereas if X is array([0., 0.1, 0.2, (0.2 + 1e-320), (0.2 + 1e-320 + 1e-310), 1.])
unique_X is array([0. , 0.1, 0.2, 1. ])

~~(I think replacing with the y_min value would actually be a reasonable solution)~~

lucyleeow · 2020-09-10T11:33:56Z

Actually I realised that changing inf outputs to y_min, wouldn't work if the range of X includes negative numbers.

Thoughts @ogrisel ?

lucyleeow · 2020-09-16T11:20:41Z

So I think this is a limitation of representation of floating point numbers:

>>> 0 + 1e-310 == 0                                                                
False

whereas

>>> 3 + 1e-310 == 3
True

we could change inf values to predict(0) , or ?

ogrisel · 2020-09-16T13:03:14Z

In the method _call_linear (https://github.com/scipy/scipy/blob/3b4a30cbf580a1ea07bf48abf4bbea708b2018dd/scipy/interpolate/interpolate.py#L589-L616), what are the values of x_new, x_hi, x_lo, y_hi and y_low when the resulting y_new is inf?

lucyleeow · 2020-09-16T16:23:10Z

(Edited)

Using:

>>> X = np.array([0., 1e-320, 1e-314, 1.])
>>> y = np.array([0.42, 0.42, 0.44, 0.44])
>>> IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316]))
array([0.42, 0.42,  inf,  inf])

The values in:

        slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]

are:

y_hi = array([[0.42], [0.42], [0.44], [0.44]])
y_lo = array([[0.42], [0.42], [0.42], [0.42]])
x_hi = array([1.e-320, 1.e-320, 1.e-314, 1.e-314])
x_lo = array([0.e+000, 0.e+000, 1.e-320, 1.e-320])

Note that if certain conditions are met, scipy.interpolate.interp1d delegates to numpy.interp due to speed, (I think the numpy version is written in C). In this case I forced it to use the scipy _call_linear (by setting fill_value='extrapolate') but the calculation should be exactly the same (as in numpy).

ogrisel · 2020-10-27T14:55:30Z

The source problem was fixed in #18639. Thanks for your contribution @yashika51 !

yashika51 added 4 commits June 27, 2020 06:57

Added Test for infinite probability

dbd0efb

Minor Fixes

f2c3c4a

resolving conflicts

09f164c

resolving conflicts

b4b7c3a

yashika51 marked this pull request as draft June 27, 2020 14:59

yashika51 added 3 commits June 28, 2020 14:40

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

604e7f6

…into calibration/infinite_probability

resolving conflicts

9405d2d

Merge branch 'master' of https://github.com/scikit-learn/scikit-learn …

1420344

…into calibration/infinite_probability

yashika51 marked this pull request as ready for review June 28, 2020 19:50

final commit

2527597

thomasjpfan reviewed Jul 12, 2020

View reviewed changes

lucyleeow mentioned this pull request Sep 16, 2020

numpy.unqiue inconsistent for float close to 0 numpy/numpy#17327

Closed

lucyleeow added the Superseded PR has been replace by a newer PR label Oct 19, 2020

lucyleeow mentioned this pull request Oct 19, 2020

BUG Add tol to _make_unique to avoid inf values in IsotonicRegression #18639

Merged

ogrisel closed this Oct 27, 2020

	np.clip(proba, 0.0, 1.0, proba)
	np.clip(proba, 0.0, 1.0, out=proba)

	np.clip(proba, 0.0, 1.0, proba)
	np.clip(proba, 0, 1, out=proba)

		@@ -444,7 +444,8 @@ def predict_proba(self, X):
		proba[np.isnan(proba)] = 1. / n_classes

		# Deal with cases where the predicted probability minimally exceeds 1.0

	for calibrated_classifier in self.calibrated_classifiers_:
	proba = calibrated_classifier.predict_proba(X)
	mean_proba += proba

Uh oh!

Calibration/infinite probability #17758

Calibration/infinite probability #17758

Uh oh!

Conversation

yashika51 commented Jun 27, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

NicolasHug commented Jun 28, 2020

Uh oh!

yashika51 commented Jun 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashika51 commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yashika51 commented Jul 2, 2020

Uh oh!

yashika51 commented Jul 7, 2020

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashika51 Jul 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jul 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashika51 Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glemaitre commented Aug 21, 2020

Uh oh!

lucyleeow commented Aug 22, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Aug 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented Sep 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Sep 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Sep 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucyleeow commented Sep 10, 2020

Uh oh!

lucyleeow commented Sep 16, 2020

Uh oh!

ogrisel commented Sep 16, 2020

Uh oh!

yashika51 commented Jun 28, 2020 •

edited

Loading

yashika51 commented Jun 29, 2020 •

edited

Loading

yashika51 Jul 27, 2020 •

edited

Loading

thomasjpfan Jul 28, 2020 •

edited

Loading

yashika51 Jul 29, 2020 •

edited

Loading

lucyleeow commented Aug 22, 2020 •

edited

Loading

lucyleeow commented Aug 27, 2020 •

edited

Loading

ogrisel commented Sep 7, 2020 •

edited

Loading

lucyleeow commented Sep 8, 2020 •

edited

Loading

lucyleeow commented Sep 10, 2020 •

edited

Loading

lucyleeow commented Sep 16, 2020 •

edited

Loading