-
-
Notifications
You must be signed in to change notification settings - Fork 25.9k
Calibration/infinite probability #17758
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
dbd0efb
f2c3c4a
09f164c
b4b7c3a
604e7f6
9405d2d
1420344
2527597
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -444,7 +444,8 @@ def predict_proba(self, X): | |||||||||
proba[np.isnan(proba)] = 1. / n_classes | ||||||||||
|
||||||||||
# Deal with cases where the predicted probability minimally exceeds 1.0 | ||||||||||
proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0 | ||||||||||
|
||||||||||
np.clip(proba, 0.0, 1.0, proba) | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would even say:
Suggested change
|
||||||||||
|
||||||||||
return proba | ||||||||||
|
||||||||||
|
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
@@ -1,3 +1,4 @@ | ||||||||
|
||||||||
import warnings | ||||||||
import numpy as np | ||||||||
import pickle | ||||||||
|
@@ -13,7 +14,8 @@ | |||||||
assert_array_almost_equal, | ||||||||
assert_warns_message, assert_no_warnings) | ||||||||
from sklearn.utils import shuffle | ||||||||
|
||||||||
from sklearn.naive_bayes import GaussianNB | ||||||||
from sklearn.calibration import CalibratedClassifierCV | ||||||||
from scipy.special import expit | ||||||||
|
||||||||
|
||||||||
|
@@ -525,7 +527,6 @@ def test_isotonic_thresholds(increasing): | |||||||
# this random data) | ||||||||
assert X_thresholds.shape[0] < X.shape[0] | ||||||||
assert np.in1d(X_thresholds, X).all() | ||||||||
|
||||||||
# Output thresholds lie in the range of the training set: | ||||||||
assert y_thresholds.max() <= y.max() | ||||||||
assert y_thresholds.min() >= y.min() | ||||||||
|
@@ -535,3 +536,19 @@ def test_isotonic_thresholds(increasing): | |||||||
assert all(np.diff(y_thresholds) >= 0) | ||||||||
else: | ||||||||
assert all(np.diff(y_thresholds) <= 0) | ||||||||
|
||||||||
|
||||||||
def test_infinite_probabilities(): | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently this test passes on master. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
That might be possible. I am trying to figure out the values with which the issue can be recreated but it only shows up with a large dataset. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. okay There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @Thomas9292, so it seems like the infinity is returned by the classifiers scikit-learn/sklearn/calibration.py Lines 345 to 347 in f642ff7
One way to show this issue is by serializing. Would that be a good idea? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am trying to guide us to: "How do we generate a dataset so we can have a non-regression test?" We can not use a serialized model, because we do not support unserializing between different versions of sklearn. Understanding the underlying issue, can help us generate this dataset.
Yes, this is the case here. Seeing where the bug came form would help us figure out if clipping is the correct solution. After going through the data, it looks like from sklearn.isotonic import IsotonicRegression
import numpy as np
X = np.array([0., 4.1e-320, 4.4e-314, 1.])
y = np.array([0.42, 0.42, 0.44, 0.44])
iso = IsotonicRegression().fit(X, y)
iso.predict(np.array([0, 2.1e-319, 5.4e-316, 1e-10]))
# array([0.42, inf, inf, 0.44]) In this case the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for clarifying that @thomasjpfan . I am working on the categorical feature support issue currently. I'll try to resolve this later. Thanks for the reviews! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My username is @thomasjpfan 😅 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oops sorry about that :( There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No problem @yashika51, always happy to help and even happier to get credit for @thomasjpfan ’s work 😉 |
||||||||
# Test from https://github.com/scikit-learn/scikit-learn/issues/10903 | ||||||||
|
||||||||
X_train = np.array([[1.9, 1.18], [1.34, 1.06], [2.22, 6.8], | ||||||||
[-1.37, 0.87], [0.12, -2.94]]) | ||||||||
X_test = np.array([[-1.28, 0.23], [1.67, -1.36], [1.82, -2.92]]) | ||||||||
y_train = np.array([1, 0, 1, 1, 0]) | ||||||||
|
||||||||
clf_c = CalibratedClassifierCV(GaussianNB(), method='isotonic', cv=2) | ||||||||
clf_fit = clf_c.fit(X_train, y_train) | ||||||||
y_pred = clf_fit.predict_proba(X_test)[:, 1] | ||||||||
y_pred = clf_fit.predict_proba(X_test)[:, 1] | ||||||||
assert(np.all(y_pred >= 0)) | ||||||||
assert(np.all(y_pred <= 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove this comment now that it does not apply