-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX CalibratedClassifierCV with sigmoid and large confidence scores #26913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX CalibratedClassifierCV with sigmoid and large confidence scores #26913
Conversation
What is the root cause that lbfgs fails? I would like to understand that better before introducing a fallback solver (use |
Will using |
Yes, I think so (maybe play with the verbosity level). |
@lorentzenchr I don't think we are using lbfgs here but instead simply bfgs. Also I don't think there is any verbose option in scipy.optimize.minimize. This is how it is done in the example https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.minimize.html
This shows that disp: True is the option for getting a verbose output. |
@OmarManzoor You're right. What I meant was scipy.optimize.minimize(func, x0, method="L-BFGS-B", options={"iprint": 101,)) This only works for lbfgs, but I would switch from bfgs to lbfgs anyway. |
@lorentzenchr for simplicity I used the public function to keep it consistent with the file. AB0 = np.array([0.0, log((prior0 + 1.0) / (prior1 + 1.0))])
AB_ = fmin_l_bfgs_b(objective, AB0, fprime=grad, iprint=101)
return AB_[0], AB_[1] Here is the output for this script
|
It seems there is happening something. The ValueError in the output is a good hint. |
Okay I fixed it. It was a simple matter of just picking the first value in the result array. AB_ = fmin_l_bfgs_b(objective, AB0, fprime=grad, iprint=101)
return AB_[0][0], AB_[0][1] Here is the output |
@OmarManzoor @lorentzenchr Thanks for looking into this. Since the original submission of the issue, I took another look at what's going on in the optimization. The failure comes from within https://github.com/scipy/scipy/blob/v1.11.1/scipy/optimize/_linesearch.py#L539 Increasing from scipy.optimize._linesearch import _cubicmin, _quadmin
def _zoom_debug(a_lo, a_hi, phi_lo, phi_hi, derphi_lo,
phi, derphi, phi0, derphi0, c1, c2, extra_condition):
"""Zoom stage of approximate linesearch satisfying strong Wolfe conditions.
Part of the optimization algorithm in `scalar_search_wolfe2`.
Notes
-----
Implements Algorithm 3.6 (zoom) in Wright and Nocedal,
'Numerical Optimization', 1999, pp. 61.
"""
maxiter = 100 # DEBUG: Increase from 10 to 100
i = 0
delta1 = 0.2 # cubic interpolant check
delta2 = 0.1 # quadratic interpolant check
phi_rec = phi0
a_rec = 0
while True:
# interpolate to find a trial step length between a_lo and
# a_hi Need to choose interpolation here. Use cubic
# interpolation and then if the result is within delta *
# dalpha or outside of the interval bounded by a_lo or a_hi
# then use quadratic interpolation, if the result is still too
# close, then use bisection
dalpha = a_hi - a_lo
if dalpha < 0:
a, b = a_hi, a_lo
else:
a, b = a_lo, a_hi
# minimizer of cubic interpolant
# (uses phi_lo, derphi_lo, phi_hi, and the most recent value of phi)
#
# if the result is too close to the end points (or out of the
# interval), then use quadratic interpolation with phi_lo,
# derphi_lo and phi_hi if the result is still too close to the
# end points (or out of the interval) then use bisection
if (i > 0):
cchk = delta1 * dalpha
a_j = _cubicmin(a_lo, phi_lo, derphi_lo, a_hi, phi_hi,
a_rec, phi_rec)
if (i == 0) or (a_j is None) or (a_j > b - cchk) or (a_j < a + cchk):
qchk = delta2 * dalpha
a_j = _quadmin(a_lo, phi_lo, derphi_lo, a_hi, phi_hi)
if (a_j is None) or (a_j > b-qchk) or (a_j < a+qchk):
a_j = a_lo + 0.5*dalpha
# Check new value of a_j
phi_aj = phi(a_j)
if (phi_aj > phi0 + c1*a_j*derphi0) or (phi_aj >= phi_lo):
phi_rec = phi_hi
a_rec = a_hi
a_hi = a_j
phi_hi = phi_aj
else:
derphi_aj = derphi(a_j)
if abs(derphi_aj) <= -c2*derphi0 and extra_condition(a_j, phi_aj):
a_star = a_j
val_star = phi_aj
valprime_star = derphi_aj
break
if derphi_aj*(a_hi - a_lo) >= 0:
phi_rec = phi_hi
a_rec = a_hi
a_hi = a_lo
phi_hi = phi_lo
else:
phi_rec = phi_lo
a_rec = a_lo
a_lo = a_j
phi_lo = phi_aj
derphi_lo = derphi_aj
i += 1
if (i > maxiter):
# Failed to find a conforming step size
print('Failed to find a conforming step size') # DEBUG: maxiter exceeded
a_star = None
val_star = None
valprime_star = None
break
return a_star, val_star, valprime_star
import scipy
scipy.optimize._linesearch._zoom = _zoom_debug model = CalibratedClassifierCV(SGDClassifier(loss='squared_hinge',random_state=42))
print('Logistic calibration: ', cross_val_score(model,X_train,y_train,scoring='roc_auc').mean()) Expected output:
|
Please have a look on how the inner max iter is set in lbfgs in scikit-learn/sklearn/linear_model/_glm/glm.py Line 265 in 594475a
Maybe that solves the issue. |
Is there anything to do in this PR then? If not maybe we can close it. |
Yes. The added test is valuable. And I would like to see the change from the current solver to (copy & paste) the GLM way to to it hoping that the tests pass. |
@lorentzenchr I tried using this. All the tests pass aside from the one that we added in this PR. Even with this we get a cross val score of 0.5. opt_res = minimize(
objective,
AB0,
method="L-BFGS-B",
jac=grad,
options={
"maxiter": 100,
"maxls": 100,
"gtol": 1e-6,
# The constant 64 was found empirically to pass the test suite.
# The point is that ftol is very small, but a bit larger than
# machine precision for float64, which is the dtype used by lbfgs.
"ftol": 64 * np.finfo(float).eps,
}
)
AB_ = opt_res.x
return AB_[0], AB_[1] |
@OmarManzoor Could you try replacing the objective and grad by the ones from def loss_grad(AB):
l, g = loss.loss_gradient(y_true=T, raw_prediction=(AB[0] * F + AB[1]))
grad = [F @ g, g.sum()]
return l.sum(), grad
opt_res = minimize(
loss_grad,
AB0,
method="L-BFGS-B",
jac=True,
options={"iprint": 1, #"gtol": tol, "maxiter": max_iter},
)
AB_ = opt_res.x It might helpt to compare to |
Code: loss = HalfBinomialLoss()
def loss_grad(AB):
l, g = loss.loss_gradient(y_true=T, raw_prediction=-(AB[0] * F + AB[1]))
grad = np.array([F @ g, g.sum()])
return l.sum(), grad
opt_res = minimize(
loss_grad,
AB0,
method="L-BFGS-B",
jac=True,
options={
"iprint": 1,
"maxiter": 15000,
"maxls": 100,
"gtol": 1e-6,
"ftol": 64 * np.finfo(float).eps,
},
)
AB_ = opt_res.x
return AB_[0], AB_[1] Output of test output.txt |
@OmarManzoor Thanks for testing. The idea of using the What happens, if we fit a @ogrisel pointed out that we might also supply the hessian as it is very easy to calculate. This corresponds to Another idea, already mentioned somewhere, is to re-scale After that I run out of ideas. |
I tried this lg = LogisticRegression(solver="newton-cholesky", penalty=None).fit(T[:, None], F[:, None]) This is the error we get ValueError: Unknown label type: continuous. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values. This is a preview of the T array [0.99090909 0.99090909 0.99090909 0.99090909 0.99090909 0.99090909, 0.99090909 0.99090909 0.99090909 0.99090909 ..................... ] This is a preview of the F array [ 3.65029719e+13 3.65029719e+13 3.65029719e+13 3.65029719e+13, 3.65029719e+13 3.65029719e+13 3.65029719e+13 3.65029719e+13, 3.65029719e+13 3.65029719e+13 .................] |
@lorentzenchr @OmarManzoor, @ogrisel The way I see it, if we are to keep the quasi-Newton approach (instead of a derivative-free approach), the only robust fix would be to scale the confidence scores if they are too large. We can scale only if the scores are really large (e.g., larger than For example, the following code should work (feel free to adjust according to sklearn coding standard). def _sigmoid_calibration(predictions, y, sample_weight=None):
"""Probability Calibration with sigmoid method (Platt 2000)
Parameters
----------
predictions : ndarray of shape (n_samples,)
The decision function or predict proba for the samples.
y : ndarray of shape (n_samples,)
The targets.
sample_weight : array-like of shape (n_samples,), default=None
Sample weights. If None, then samples are equally weighted.
Returns
-------
a : float
The slope.
b : float
The intercept.
References
----------
Platt, "Probabilistic Outputs for Support Vector Machines"
"""
predictions = column_or_1d(predictions)
y = column_or_1d(y)
F = predictions # F follows Platt's notations
### New code begins
M0 = 1e7 # or some suitably chosen threshold
M = np.amax(np.abs(F))
if M > M0:
F /= M
### New code ends
# Bayesian priors (see Platt end of section 2.2):
# It corresponds to the number of samples, taking into account the
# `sample_weight`.
mask_negative_samples = y <= 0
if sample_weight is not None:
prior0 = (sample_weight[mask_negative_samples]).sum()
prior1 = (sample_weight[~mask_negative_samples]).sum()
else:
prior0 = float(np.sum(mask_negative_samples))
prior1 = y.shape[0] - prior0
T = np.zeros_like(y, dtype=np.float64)
T[y > 0] = (prior1 + 1.0) / (prior1 + 2.0)
T[y <= 0] = 1.0 / (prior0 + 2.0)
T1 = 1.0 - T
def objective(AB):
# From Platt (beginning of Section 2.2)
P = expit(-(AB[0] * F + AB[1]))
loss = -(xlogy(T, P) + xlogy(T1, 1.0 - P))
if sample_weight is not None:
return (sample_weight * loss).sum()
else:
return loss.sum()
def grad(AB):
# gradient of the objective function
P = expit(-(AB[0] * F + AB[1]))
TEP_minus_T1P = T - P
if sample_weight is not None:
TEP_minus_T1P *= sample_weight
dA = np.dot(TEP_minus_T1P, F)
dB = np.sum(TEP_minus_T1P)
return np.array([dA, dB])
AB0 = np.array([0.0, log((prior0 + 1.0) / (prior1 + 1.0))])
AB_ = fmin_bfgs(objective, AB0, fprime=grad, disp=False)
### Changed code begins
return AB_[0]/(M if M>M0 else 1), AB_[1]
### Changed code ends |
Thank you for sharing. @lorentzenchr What do you suggest? |
Let's try the scaling approach and fix the threshold later. Additional, I'd like to see the loss replaced by the existing _loss module like in the above snippets. |
…more than threshold
@lorentzenchr It seems to work now. The test passes with using re-scaling. |
BTW, continuous targets are supported by the private from sklearn._loss import HalfBinomialLoss
from sklearn.linear_model._glm import _GeneralizedLinearRegressor
class BinomialRegressor(_GeneralizedLinearRegressor):
def _get_loss(self):
return HalfBinomialLoss() See https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/linear_model/_glm/tests/test_glm.py. |
@lorentzenchr Could you kindly have a look and see if this looks okay? Then we can integrate the _loss module methods to replace the current obj and grad functions. |
@OmarManzoor Can you give mark the resolved comments as resolved, and may also add a 👍 in the thread? |
…er returning large values for the cross validated data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, I like the minimally invasive scaling solution. I think the tests could be improved as suggested below:
Has anyone tried to see if #26913 (comment) has a similar numerical stability problem in the end? If so we need to make sure that the error message is informative enough, e.g. to tell the user to scale their data or increase the penalization. But this should be done a dedicated PR if needed (not to delay the merge of this fix). |
import numpy as np
from sklearn._loss import HalfBinomialLoss
from sklearn.linear_model._glm import _GeneralizedLinearRegressor
from sklearn.calibration import CalibratedClassifierCV
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
class BinomialRegressor(_GeneralizedLinearRegressor):
def _get_loss(self):
return HalfBinomialLoss()
r = 0.67
N = 1000
y_train = np.array([1] * int(N * r)+ [0] * (N - int(N * r)))
X_train = 1e5 * y_train.reshape((-1,1)) + np.random.default_rng(42).normal(size=N)
sgd = SGDClassifier(loss='squared_hinge',random_state=42).fit(X_train, y_train)
F = sgd.decision_function(X_train)
print(f"{F.min()=}, {F.max()=}") #(-7221324.4004922705, 25003110945558.523)
glm = BinomialRegressor(alpha=0)
glm.fit(F[:, None], y_train)
print(f"{glm.intercept_=} {glm.coef_}") # 0.7081850579244856, 3.47115448e-06 No errors, no warnings. |
…l predictions [-1, 1)
scale_constant = max_prediction | ||
# We rescale the features in a copy: inplace rescaling could confuse | ||
# the caller and make the code harder to reason about. | ||
F = F / scale_constant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice 👍 I did not realize that the way I had done it actually modified predictions too. Thanks for catching this! I guess this was the main reason why I was getting wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (assuming green CI)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@OmarManzoor Are you interested in a follow up PR replacing obj and grad by our _loss?
@lorentzenchr Yes sure. |
…cikit-learn#26913) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…cikit-learn#26913) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…26913) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
…cikit-learn#26913) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Reference Issues/PRs
Fixes: #26766
What does this implement/fix? Explain your changes.
Any other comments?
CC: @ogrisel Could you kindly check if this looks suitable?