-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Calibration/infinite probability #17758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calibration/infinite probability #17758
Conversation
…into calibration/infinite_probability
…into calibration/infinite_probability
Thanks for the PR @yashika51 The proposed test does not fail on master so it doesn't seem like a proper regression test. Were you able to reproduce the original issue from #10903 ? |
Hi @NicolasHug, |
Hi @NicolasHug, it's difficult to find a very small array that returns infinity with predict_proba. However, it is clearly visible with a larger dataset. |
Hi @NicolasHug, adding a test with a small array to reproduce the issue is not working. |
Hi @NicolasHug and @thomasjpfan, it would be great if you can please review this PR and suggest changes if any :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this happening because there are some elements in np.sum(proba, axis=1)
that is zero? (In predict_proba
)
@@ -444,7 +444,8 @@ def predict_proba(self, X): | |||
proba[np.isnan(proba)] = 1. / n_classes | |||
|
|||
# Deal with cases where the predicted probability minimally exceeds 1.0 | |||
proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0 | |||
|
|||
np.clip(proba, 0.0, 1.0, proba) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.clip(proba, 0.0, 1.0, proba) | |
np.clip(proba, 0.0, 1.0, out=proba) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even say:
np.clip(proba, 0.0, 1.0, proba) | |
np.clip(proba, 0, 1, out=proba) |
@@ -444,7 +444,8 @@ def predict_proba(self, X): | |||
proba[np.isnan(proba)] = 1. / n_classes | |||
|
|||
# Deal with cases where the predicted probability minimally exceeds 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this comment now that it does not apply
@@ -535,3 +536,19 @@ def test_isotonic_thresholds(increasing): | |||
assert all(np.diff(y_thresholds) >= 0) | |||
else: | |||
assert all(np.diff(y_thresholds) <= 0) | |||
|
|||
|
|||
def test_infinite_probabilities(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this test passes on master.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this happening because there are some elements in np.sum(proba, axis=1) that is zero? (In predict_proba)
That might be possible. I am trying to figure out the values with which the issue can be recreated but it only shows up with a large dataset.
What alternatives can we take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if np.sum(proba, axis=1)
is zero for the large dataset and see why this is happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Thomas9292, so it seems like the infinity is returned by the classifiers
scikit-learn/sklearn/calibration.py
Lines 345 to 347 in f642ff7
for calibrated_classifier in self.calibrated_classifiers_: | |
proba = calibrated_classifier.predict_proba(X) | |
mean_proba += proba |
One way to show this issue is by serializing. Would that be a good idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to guide us to: "How do we generate a dataset so we can have a non-regression test?" We can not use a serialized model, because we do not support unserializing between different versions of sklearn. Understanding the underlying issue, can help us generate this dataset.
If the goal is to nail down exactly where this problem first arises, then the alternative will be a fair amount of archaeology--digging through the implementation to discern the ultimate source of these infinities.
Yes, this is the case here. Seeing where the bug came form would help us figure out if clipping is the correct solution.
After going through the data, it looks like np.inf
comes from IsotonicRegression
:
from sklearn.isotonic import IsotonicRegression
import numpy as np
X = np.array([0., 4.1e-320, 4.4e-314, 1.])
y = np.array([0.42, 0.42, 0.44, 0.44])
iso = IsotonicRegression().fit(X, y)
iso.predict(np.array([0, 2.1e-319, 5.4e-316, 1e-10]))
# array([0.42, inf, inf, 0.44])
In this case the np.clip
from this PR would map the middle two values to 1, which which would be incorrect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying that @thomasjpfan . I am working on the categorical feature support issue currently. I'll try to resolve this later. Thanks for the reviews!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My username is @thomasjpfan 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry about that :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem @yashika51, always happy to help and even happier to get credit for @thomasjpfan ’s work 😉
Is there any connection with: #16321 |
I've had a look into this and as mentioned prior, I think it's due to isotonic regression, specifically slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None] we end up dividing by a very small number, resulting in inf result. In the example presented in #10903, setting scikit-learn/sklearn/calibration.py Lines 493 to 494 in 19c8b19
and Although there is some debate about the use of I propose if WDYT @glemaitre @ogrisel ? |
Thanks for your analysis @lucyleeow.
Unfortunately I don't think it would work. Replacing the inf by 0 the following example would make the prediction outside of the correct range: >>> from sklearn.isotonic import IsotonicRegression
... import numpy as np
...
... X = np.array([0., 1e-320, 1e-314, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])
... IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316, 1e-313, 1e-10]))
array([0.42, 0.42, inf, inf, 0.44, 0.44]) we need to find a way to either unbreak the interpolation or preprocess the input of the interpolator to prevent this from happening. Here is another example that also fail but I am not sure why: >>> from sklearn.isotonic import IsotonicRegression
... import numpy as np
...
... X = np.array([0., 1e-315, 1e-315, 1.])
... y = np.array([0.42, 0.42, 0.44, 0.44])
... IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316, 1e-313, 1e-10]))
array([0.42, inf, inf, inf, 0.43, 0.43]) |
|
Here any number between 0 and 1e-315 will be inf. In the first example:
any number between 0 and 1e-314 will be inf. It seems that the issue is that small numbers below and above 0 will remain after the scikit-learn/sklearn/isotonic.py Lines 269 to 270 in fd1ff73
e.g., if whereas if
|
Actually I realised that changing inf outputs to y_min, wouldn't work if the range of Thoughts @ogrisel ? |
So I think this is a limitation of representation of floating point numbers:
whereas
we could change inf values to |
In the method |
(Edited) Using: >>> X = np.array([0., 1e-320, 1e-314, 1.])
>>> y = np.array([0.42, 0.42, 0.44, 0.44])
>>> IsotonicRegression().fit(X, y).predict(np.array([0, 1e-321, 1e-317, 1e-316]))
array([0.42, 0.42, inf, inf]) The values in: slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None] are:
Note that if certain conditions are met, |
The source problem was fixed in #18639. Thanks for your contribution @yashika51 ! |
Reference Issues/PRs
Fixes #10903
What does this implement/fix? Explain your changes.
This fixes the issue of infinite probabilities returned by predict_proba. I have added np.clip() to clip values in the range of [0,1].
Changes in sklearn/calibration.py.
One test is also added in sklearn/tests/test_isotonic.py.
Any other comments?