-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FEA add temperature scaling to CalibratedClassifierCV
#31068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEA add temperature scaling to CalibratedClassifierCV
#31068
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A follow-up to my comment on the Array API: I don't think we can support the Array API here, as scipy.optimize.minimize
does not appear to support it.
If I missed anything, please let me know—I'd be happy to investigate further.
sklearn/tests/test_calibration.py
Outdated
@@ -401,6 +413,44 @@ def test_sigmoid_calibration(): | |||
_SigmoidCalibration().fit(np.vstack((exF, exF)), exY) | |||
|
|||
|
|||
def test_temperature_scaling(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test verifies that temperature scaling does not affect accuracy and that the optimised temperature is always positive.
I also noticed that the Brier score may improve or worsen depending on the dataset and the classifier being calibrated. Therefore, I did not include temperature scaling in the test_calibration
function.
This seems align with the remark made on page 3245 of Classifier calibration: a survey on how to assess and improve predicted class probabilities.
If there are any test cases I should add, please let me know—I’d be happy to include them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (single-parameter) temperature scaling provably improves Brier score (or log-loss) by only reducing its calibration error term and not changing the refinement error term on datasets/tasks with very specific structures, e.g balanced binary classification problem where means of the classes are symmetric around zero and the covariances are identical as explained in section 5 of https://arxiv.org/html/2501.19195v1#S5.
We could write a test based on this analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Here is a first pass of feedback:
sklearn/calibration.py
Outdated
negative_log_likelihood, | ||
np.array([beta_0]), | ||
args=(logits, labels, max_logits), | ||
method="L-BFGS-B", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In cases where there is a single-element parameter array, it is probably much more efficient to use a dedicated scalar optimizer as discussed in #28574 (comment) and subsequent comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this would deserve some benchmarking to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up choosing minimize
over minimize_scalar
, even though it's more expensive (e.g., it requires computing the gradient).
This is because minimize_scalar
doesn’t allow us to provide an initial guess---beta_0 = 1.0
---when optimising the inverse temperature beta
using method = "Bounded"
. Even for method = "Brent"
, it expects the initial guess xb
to be a local minimum between two other points: func(xb) < func(xa) and func(xb) < func(xc)
, which is impossible to determine beforehand when fitting the calibrator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass bracket=(.5, 2)
or bracket=(0.1, 10.)
since we work on a multiplicative scale around inverse temperature beta_0=1.0
?
EDIT: thinking about this, we could re-parametrize as theta = log(beta)
(hence beta = exp(theta)
) and set bracket=(-1, 1)
: this way bisections / additive increments in theta space would naturally map to multiplicative update in the beta / temperature space which can probably lead to fewer iterations in the solver.
I think we should add verbose
attribute to CalibratedClassifierCV
to display convergence information (and number of iterations) of the underlying solvers.
sklearn/calibration.py
Outdated
Parameters | ||
---------- | ||
predictions : ndarray of shape (n_samples,) or (n_samples, n_classes) | ||
The decision function or predict proba for the samples. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be explicit if we expect one or the other and update the code (e.g. variable names) accordingly.
Here have the impression that this code always expects logits (that is the output of decision_function
and never the output of predict_proba
). If the latter, we would need to take the log of it (+ and epsilon to avoid nans).
This info is originally present in the method_name
variable of:
scikit-learn/sklearn/calibration.py
Lines 440 to 443 in af41352
method_name = _check_response_method( | |
this_estimator, | |
["decision_function", "predict_proba"], | |
).__name__ |
I think _fit_calibrator
should propagate this info to the calibrator to avoid any ambiguity.
Maybe we should also undo pre-processing such as:
scikit-learn/sklearn/calibration.py
Lines 455 to 463 in af41352
if method_name == "predict_proba": | |
# Select the probability column of the positive class | |
predictions = _process_predict_proba( | |
y_pred=predictions, | |
target_type="binary", | |
classes=self.classes_, | |
pos_label=self.classes_[1], | |
) | |
predictions = predictions.reshape(-1, 1) |
and let the calibrator do what they find most suitable with either kind of input.
Maybe the calibrators could expose explicit tags to make it explicit on whether they can handle multiclass natively, or if the caller should be in charge of reducing a multiclass calibration problem into binary calibration sub problems with the OvR hack we currently implement for isotonic calibration and Platt scaling.
EDIT: the existing calibrator.__sklearn_tags___().input_tags
might be enough to express this:
tags = calibrator.__sklearn_tags()
supports_multiclass_logits = tags.input_tags.two_d_array and not tags.input_tags.one_d_array
as we have the following for IsotonicRegression
:
scikit-learn/sklearn/isotonic.py
Lines 513 to 517 in af41352
def __sklearn_tags__(self): | |
tags = super().__sklearn_tags__() | |
tags.input_tags.one_d_array = True | |
tags.input_tags.two_d_array = False | |
return tags |
The _SigmoidCalibrator
can be udpated accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can rename the ambiguous "predictions" variable every-where to "logits" and take the log
of predict_proba
+ eps as early as possible.
But this would introduce a behavior change.
sklearn/tests/test_calibration.py
Outdated
@@ -401,6 +413,44 @@ def test_sigmoid_calibration(): | |||
_SigmoidCalibration().fit(np.vstack((exF, exF)), exY) | |||
|
|||
|
|||
def test_temperature_scaling(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (single-parameter) temperature scaling provably improves Brier score (or log-loss) by only reducing its calibration error term and not changing the refinement error term on datasets/tasks with very specific structures, e.g balanced binary classification problem where means of the classes are symmetric around zero and the covariances are identical as explained in section 5 of https://arxiv.org/html/2501.19195v1#S5.
We could write a test based on this analysis.
|
||
return beta_minimizer.x[0] | ||
|
||
|
||
class _SigmoidCalibration(RegressorMixin, BaseEstimator): | ||
"""Sigmoid regression model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should override __sklearn_tags__
to set the same input tags as IsotonicRegression
:
scikit-learn/sklearn/isotonic.py
Lines 513 to 517 in af41352
def __sklearn_tags__(self): | |
tags = super().__sklearn_tags__() | |
tags.input_tags.one_d_array = True | |
tags.input_tags.two_d_array = False | |
return tags |
See:
scikit-learn/sklearn/utils/_tags.py
Lines 20 to 25 in af41352
one_d_array : bool, default=False | |
Whether the input can be a 1D array. | |
two_d_array : bool, default=True | |
Whether the input can be a 2D array. Note that most common | |
tests currently run only if this flag is set to ``True``. |
…enting_temperature_scaling
…fier`. Updated constructor of `_TemperatureScaling` class. Updated `test_temperature_scaling` in `test_calibration.py`. Added `__sklearn_tags__` to `_TemperatureScaling` class.
…enting_temperature_scaling
…enting_temperature_scaling
…enting_temperature_scaling
…enting_temperature_scaling
…enting_temperature_scaling
…Updated doc-strings of temperature scaling in `calibration.py`. Updated formatting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still working on addressing the feedback, but I also wanted to share some findings related to it and provide an update.
sklearn/calibration.py
Outdated
negative_log_likelihood, | ||
np.array([beta_0]), | ||
args=(logits, labels, max_logits), | ||
method="L-BFGS-B", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up choosing minimize
over minimize_scalar
, even though it's more expensive (e.g., it requires computing the gradient).
This is because minimize_scalar
doesn’t allow us to provide an initial guess---beta_0 = 1.0
---when optimising the inverse temperature beta
using method = "Bounded"
. Even for method = "Brent"
, it expects the initial guess xb
to be a local minimum between two other points: func(xb) < func(xa) and func(xb) < func(xc)
, which is impossible to determine beforehand when fitting the calibrator.
…enting_temperature_scaling
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I few computational things seem off.
sklearn/calibration.py
Outdated
negative_log_likelihood, | ||
np.array([beta_0]), | ||
args=(logits, labels, max_logits), | ||
method="L-BFGS-B", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass bracket=(.5, 2)
or bracket=(0.1, 10.)
since we work on a multiplicative scale around inverse temperature beta_0=1.0
?
EDIT: thinking about this, we could re-parametrize as theta = log(beta)
(hence beta = exp(theta)
) and set bracket=(-1, 1)
: this way bisections / additive increments in theta space would naturally map to multiplicative update in the beta / temperature space which can probably lead to fewer iterations in the solver.
I think we should add verbose
attribute to CalibratedClassifierCV
to display convergence information (and number of iterations) of the underlying solvers.
…enting_temperature_scaling
Update `minimize` in `_temperture_scaling` to `minimize.scalar`. Update `test_calibration.py` to check the optimised inverse temperature is between 0.1 and 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some CI failures—I'll fix those shortly.
Also considering adding a verbose
parameter to CalibratedClassifierCV
to optionally display convergence info when optimising the inverse temperature beta
.
|
||
return l.sum() | ||
|
||
beta_minimizer = minimize_scalar(beta_loss, bounds=(0.1, 10.0), method="bounded") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass
bracket=(.5, 2)
orbracket=(0.1, 10.)
since we work on a multiplicative scale around inverse temperaturebeta_0=1.0
?
I think this might work — I've set bounds=(0.1, 10.0)
and used the test_temperature_scaling
test function to confirm that the optimised beta stays within that range.
That said, I do wonder if users might ask for a way to set a custom initial guess beta_0
when performing temperature scaling.
# Ensure raw_prediction has the same dtype as labels using .astype(). | ||
# Without this, dtype promotion rules differ across NumPy versions: | ||
# | ||
# beta = np.float64(0) | ||
# logits = np.array([1, 2], dtype=np.float32) | ||
# | ||
# result = beta * logits | ||
# - NumPy < 2: result.dtype is float32 | ||
# - NumPy 2+: result.dtype is float64 | ||
# | ||
# This can cause dtype mismatch errors downstream (e.g., buffer dtype). | ||
raw_prediction = xp.astype(beta * logits, dtype_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the logits should already have a suitable dtype.
I had to explicitly set the dtype here—otherwise scipy.optimize.minimize_scalar
fails when
(beta * logits).dtype == np.float32
.
This issue is caught by our own test_float32_predict_proba
.
I noticed Sigmoid Calibration handles the same issue, so I reused their fix and comment for consistency.
Happy to adjust if I missed anything!
…enting_temperature_scaling
…id `method` parameter.
Reference Issues/PRs
Closes #28574
What does this implement/fix? Explain your changes.
This PR adds temperature scaling to scikit-learn's
CalibratedClassifierCV
:Temperature scaling can be enabled by setting
method = "temperature"
inCalibratedClassifierCV
:This method supports both binary and multi-class classification.
Any other comments?
Cc @adrinjalali, @lorentzenchr in advance.