Skip to content

Calibration/infinite probability #17758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion sklearn/calibration.py
Original file line number Diff line number Diff line change
Expand Up @@ -444,7 +444,8 @@ def predict_proba(self, X):
proba[np.isnan(proba)] = 1. / n_classes

# Deal with cases where the predicted probability minimally exceeds 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this comment now that it does not apply

proba[(1.0 < proba) & (proba <= 1.0 + 1e-5)] = 1.0

np.clip(proba, 0.0, 1.0, proba)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
np.clip(proba, 0.0, 1.0, proba)
np.clip(proba, 0.0, 1.0, out=proba)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even say:

Suggested change
np.clip(proba, 0.0, 1.0, proba)
np.clip(proba, 0, 1, out=proba)


return proba

Expand Down
21 changes: 19 additions & 2 deletions sklearn/tests/test_isotonic.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@

import warnings
import numpy as np
import pickle
Expand All @@ -13,7 +14,8 @@
assert_array_almost_equal,
assert_warns_message, assert_no_warnings)
from sklearn.utils import shuffle

from sklearn.naive_bayes import GaussianNB
from sklearn.calibration import CalibratedClassifierCV
from scipy.special import expit


Expand Down Expand Up @@ -525,7 +527,6 @@ def test_isotonic_thresholds(increasing):
# this random data)
assert X_thresholds.shape[0] < X.shape[0]
assert np.in1d(X_thresholds, X).all()

# Output thresholds lie in the range of the training set:
assert y_thresholds.max() <= y.max()
assert y_thresholds.min() >= y.min()
Expand All @@ -535,3 +536,19 @@ def test_isotonic_thresholds(increasing):
assert all(np.diff(y_thresholds) >= 0)
else:
assert all(np.diff(y_thresholds) <= 0)


def test_infinite_probabilities():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this test passes on master.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this happening because there are some elements in np.sum(proba, axis=1) that is zero? (In predict_proba)

That might be possible. I am trying to figure out the values with which the issue can be recreated but it only shows up with a large dataset.
What alternatives can we take?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if np.sum(proba, axis=1) is zero for the large dataset and see why this is happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Copy link
Contributor Author

@yashika51 yashika51 Jul 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Thomas9292, so it seems like the infinity is returned by the classifiers

for calibrated_classifier in self.calibrated_classifiers_:
proba = calibrated_classifier.predict_proba(X)
mean_proba += proba

One way to show this issue is by serializing. Would that be a good idea?

Copy link
Member

@thomasjpfan thomasjpfan Jul 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to guide us to: "How do we generate a dataset so we can have a non-regression test?" We can not use a serialized model, because we do not support unserializing between different versions of sklearn. Understanding the underlying issue, can help us generate this dataset.

If the goal is to nail down exactly where this problem first arises, then the alternative will be a fair amount of archaeology--digging through the implementation to discern the ultimate source of these infinities.

Yes, this is the case here. Seeing where the bug came form would help us figure out if clipping is the correct solution.

After going through the data, it looks like np.inf comes from IsotonicRegression:

from sklearn.isotonic import IsotonicRegression
import numpy as np

X = np.array([0., 4.1e-320, 4.4e-314, 1.])
y = np.array([0.42, 0.42, 0.44, 0.44])

iso = IsotonicRegression().fit(X, y)

iso.predict(np.array([0, 2.1e-319, 5.4e-316, 1e-10]))
# array([0.42,  inf,  inf, 0.44])

In this case the np.clip from this PR would map the middle two values to 1, which which would be incorrect.

Copy link
Contributor Author

@yashika51 yashika51 Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that @thomasjpfan . I am working on the categorical feature support issue currently. I'll try to resolve this later. Thanks for the reviews!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My username is @thomasjpfan 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry about that :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem @yashika51, always happy to help and even happier to get credit for @thomasjpfan ’s work 😉

# Test from https://github.com/scikit-learn/scikit-learn/issues/10903

X_train = np.array([[1.9, 1.18], [1.34, 1.06], [2.22, 6.8],
[-1.37, 0.87], [0.12, -2.94]])
X_test = np.array([[-1.28, 0.23], [1.67, -1.36], [1.82, -2.92]])
y_train = np.array([1, 0, 1, 1, 0])

clf_c = CalibratedClassifierCV(GaussianNB(), method='isotonic', cv=2)
clf_fit = clf_c.fit(X_train, y_train)
y_pred = clf_fit.predict_proba(X_test)[:, 1]
y_pred = clf_fit.predict_proba(X_test)[:, 1]
assert(np.all(y_pred >= 0))
assert(np.all(y_pred <= 1))