Skip to content

FEA add temperature scaling to CalibratedClassifierCV #31068

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

virchan
Copy link
Member

@virchan virchan commented Mar 25, 2025

Reference Issues/PRs

Closes #28574

What does this implement/fix? Explain your changes.

This PR adds temperature scaling to scikit-learn's CalibratedClassifierCV:

Temperature scaling can be enabled by setting method = "temperature" in CalibratedClassifierCV:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC

X, y = make_classification(random_state=42)

X_train, X_calib, y_train, y_calib = train_test_split(X, y, random_state=42)

clf = LinearSVC(random_state=42)
clf.fit(X_train, y_train)
cal_clf = CalibratedClassifierCV(clf, method="temperature").fit(X_train, y_train)

This method supports both binary and multi-class classification.

Any other comments?

Cc @adrinjalali, @lorentzenchr in advance.

Copy link

github-actions bot commented Mar 25, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 1a9e307. Link to the linter CI: here

Copy link
Member Author

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A follow-up to my comment on the Array API: I don't think we can support the Array API here, as scipy.optimize.minimize does not appear to support it.

If I missed anything, please let me know—I'd be happy to investigate further.

@@ -401,6 +413,44 @@ def test_sigmoid_calibration():
_SigmoidCalibration().fit(np.vstack((exF, exF)), exY)


def test_temperature_scaling(data):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test verifies that temperature scaling does not affect accuracy and that the optimised temperature is always positive.

I also noticed that the Brier score may improve or worsen depending on the dataset and the classifier being calibrated. Therefore, I did not include temperature scaling in the test_calibration function.

This seems align with the remark made on page 3245 of Classifier calibration: a survey on how to assess and improve predicted class probabilities.

If there are any test cases I should add, please let me know—I’d be happy to include them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (single-parameter) temperature scaling provably improves Brier score (or log-loss) by only reducing its calibration error term and not changing the refinement error term on datasets/tasks with very specific structures, e.g balanced binary classification problem where means of the classes are symmetric around zero and the covariances are identical as explained in section 5 of https://arxiv.org/html/2501.19195v1#S5.

We could write a test based on this analysis.

@virchan virchan marked this pull request as ready for review March 25, 2025 10:55
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here is a first pass of feedback:

negative_log_likelihood,
np.array([beta_0]),
args=(logits, labels, max_logits),
method="L-BFGS-B",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In cases where there is a single-element parameter array, it is probably much more efficient to use a dedicated scalar optimizer as discussed in #28574 (comment) and subsequent comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this would deserve some benchmarking to confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up choosing minimize over minimize_scalar, even though it's more expensive (e.g., it requires computing the gradient).

This is because minimize_scalar doesn’t allow us to provide an initial guess---beta_0 = 1.0---when optimising the inverse temperature beta using method = "Bounded". Even for method = "Brent", it expects the initial guess xb to be a local minimum between two other points: func(xb) < func(xa) and func(xb) < func(xc), which is impossible to determine beforehand when fitting the calibrator.

Copy link
Member

@ogrisel ogrisel Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass bracket=(.5, 2) or bracket=(0.1, 10.) since we work on a multiplicative scale around inverse temperature beta_0=1.0?

EDIT: thinking about this, we could re-parametrize as theta = log(beta) (hence beta = exp(theta)) and set bracket=(-1, 1): this way bisections / additive increments in theta space would naturally map to multiplicative update in the beta / temperature space which can probably lead to fewer iterations in the solver.

I think we should add verbose attribute to CalibratedClassifierCV to display convergence information (and number of iterations) of the underlying solvers.

Parameters
----------
predictions : ndarray of shape (n_samples,) or (n_samples, n_classes)
The decision function or predict proba for the samples.
Copy link
Member

@ogrisel ogrisel Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be explicit if we expect one or the other and update the code (e.g. variable names) accordingly.

Here have the impression that this code always expects logits (that is the output of decision_function and never the output of predict_proba). If the latter, we would need to take the log of it (+ and epsilon to avoid nans).

This info is originally present in the method_name variable of:

method_name = _check_response_method(
this_estimator,
["decision_function", "predict_proba"],
).__name__

I think _fit_calibrator should propagate this info to the calibrator to avoid any ambiguity.

Maybe we should also undo pre-processing such as:

if method_name == "predict_proba":
# Select the probability column of the positive class
predictions = _process_predict_proba(
y_pred=predictions,
target_type="binary",
classes=self.classes_,
pos_label=self.classes_[1],
)
predictions = predictions.reshape(-1, 1)

and let the calibrator do what they find most suitable with either kind of input.

Maybe the calibrators could expose explicit tags to make it explicit on whether they can handle multiclass natively, or if the caller should be in charge of reducing a multiclass calibration problem into binary calibration sub problems with the OvR hack we currently implement for isotonic calibration and Platt scaling.

EDIT: the existing calibrator.__sklearn_tags___().input_tags might be enough to express this:

tags = calibrator.__sklearn_tags()
supports_multiclass_logits = tags.input_tags.two_d_array and not tags.input_tags.one_d_array

as we have the following for IsotonicRegression:

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.one_d_array = True
tags.input_tags.two_d_array = False
return tags

The _SigmoidCalibrator can be udpated accordingly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can rename the ambiguous "predictions" variable every-where to "logits" and take the log of predict_proba + eps as early as possible.

But this would introduce a behavior change.

@@ -401,6 +413,44 @@ def test_sigmoid_calibration():
_SigmoidCalibration().fit(np.vstack((exF, exF)), exY)


def test_temperature_scaling(data):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think (single-parameter) temperature scaling provably improves Brier score (or log-loss) by only reducing its calibration error term and not changing the refinement error term on datasets/tasks with very specific structures, e.g balanced binary classification problem where means of the classes are symmetric around zero and the covariances are identical as explained in section 5 of https://arxiv.org/html/2501.19195v1#S5.

We could write a test based on this analysis.


return beta_minimizer.x[0]


class _SigmoidCalibration(RegressorMixin, BaseEstimator):
"""Sigmoid regression model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should override __sklearn_tags__ to set the same input tags as IsotonicRegression:

def __sklearn_tags__(self):
tags = super().__sklearn_tags__()
tags.input_tags.one_d_array = True
tags.input_tags.two_d_array = False
return tags

See:

one_d_array : bool, default=False
Whether the input can be a 1D array.
two_d_array : bool, default=True
Whether the input can be a 2D array. Note that most common
tests currently run only if this flag is set to ``True``.

virchan added 4 commits March 27, 2025 18:14
…fier`.

Updated constructor of `_TemperatureScaling` class.
Updated `test_temperature_scaling` in `test_calibration.py`.
Added `__sklearn_tags__` to `_TemperatureScaling` class.
Copy link
Member Author

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working on addressing the feedback, but I also wanted to share some findings related to it and provide an update.

negative_log_likelihood,
np.array([beta_0]),
args=(logits, labels, max_logits),
method="L-BFGS-B",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up choosing minimize over minimize_scalar, even though it's more expensive (e.g., it requires computing the gradient).

This is because minimize_scalar doesn’t allow us to provide an initial guess---beta_0 = 1.0---when optimising the inverse temperature beta using method = "Bounded". Even for method = "Brent", it expects the initial guess xb to be a local minimum between two other points: func(xb) < func(xa) and func(xb) < func(xc), which is impossible to determine beforehand when fitting the calibrator.

Copy link
Member

@lorentzenchr lorentzenchr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I few computational things seem off.

negative_log_likelihood,
np.array([beta_0]),
args=(logits, labels, max_logits),
method="L-BFGS-B",
Copy link
Member

@ogrisel ogrisel Apr 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass bracket=(.5, 2) or bracket=(0.1, 10.) since we work on a multiplicative scale around inverse temperature beta_0=1.0?

EDIT: thinking about this, we could re-parametrize as theta = log(beta) (hence beta = exp(theta)) and set bracket=(-1, 1): this way bisections / additive increments in theta space would naturally map to multiplicative update in the beta / temperature space which can probably lead to fewer iterations in the solver.

I think we should add verbose attribute to CalibratedClassifierCV to display convergence information (and number of iterations) of the underlying solvers.

virchan added 2 commits April 25, 2025 22:16
Update `minimize` in `_temperture_scaling` to `minimize.scalar`.
Update `test_calibration.py` to check the optimised inverse temperature is between 0.1 and 10.
Copy link
Member Author

@virchan virchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some CI failures—I'll fix those shortly.

Also considering adding a verbose parameter to CalibratedClassifierCV to optionally display convergence info when optimising the inverse temperature beta.


return l.sum()

beta_minimizer = minimize_scalar(beta_loss, bounds=(0.1, 10.0), method="bounded")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not pass bracket=(.5, 2) or bracket=(0.1, 10.) since we work on a multiplicative scale around inverse temperature beta_0=1.0?

I think this might work — I've set bounds=(0.1, 10.0) and used the test_temperature_scaling test function to confirm that the optimised beta stays within that range.

That said, I do wonder if users might ask for a way to set a custom initial guess beta_0 when performing temperature scaling.

Comment on lines +1093 to +1104
# Ensure raw_prediction has the same dtype as labels using .astype().
# Without this, dtype promotion rules differ across NumPy versions:
#
# beta = np.float64(0)
# logits = np.array([1, 2], dtype=np.float32)
#
# result = beta * logits
# - NumPy < 2: result.dtype is float32
# - NumPy 2+: result.dtype is float64
#
# This can cause dtype mismatch errors downstream (e.g., buffer dtype).
raw_prediction = xp.astype(beta * logits, dtype_)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the logits should already have a suitable dtype.

I had to explicitly set the dtype here—otherwise scipy.optimize.minimize_scalar fails when
(beta * logits).dtype == np.float32.

This issue is caught by our own test_float32_predict_proba.

I noticed Sigmoid Calibration handles the same issue, so I reused their fix and comment for consistency.

Happy to adjust if I missed anything!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement temperature scaling for (multi-class) calibration
3 participants