FEA add temperature scaling to `CalibratedClassifierCV` #31068

virchan · 2025-03-25T07:57:42Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR adds temperature scaling to scikit-learn's CalibratedClassifierCV:

Temperature scaling can be enabled by setting method = "temperature" in CalibratedClassifierCV:

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC

X, y = make_classification(random_state=42)

X_train, X_calib, y_train, y_calib = train_test_split(X, y, random_state=42)

clf = LinearSVC(random_state=42)
clf.fit(X_train, y_train)
cal_clf = CalibratedClassifierCV(clf, method="temperature").fit(X_train, y_train)

This method supports both binary and multi-class classification.

Any other comments?

Cc @adrinjalali, @lorentzenchr in advance.

github-actions · 2025-03-25T07:58:58Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 1a9e307. Link to the linter CI: here}

virchan

A follow-up to my comment on the Array API: I don't think we can support the Array API here, as scipy.optimize.minimize does not appear to support it.

If I missed anything, please let me know—I'd be happy to investigate further.

sklearn/calibration.py

virchan · 2025-03-25T09:45:39Z

sklearn/tests/test_calibration.py

@@ -401,6 +413,44 @@ def test_sigmoid_calibration():
        _SigmoidCalibration().fit(np.vstack((exF, exF)), exY)


+def test_temperature_scaling(data):


This test verifies that temperature scaling does not affect accuracy and that the optimised temperature is always positive.

I also noticed that the Brier score may improve or worsen depending on the dataset and the classifier being calibrated. Therefore, I did not include temperature scaling in the test_calibration function.

This seems align with the remark made on page 3245 of Classifier calibration: a survey on how to assess and improve predicted class probabilities.

If there are any test cases I should add, please let me know—I’d be happy to include them.

I think (single-parameter) temperature scaling provably improves Brier score (or log-loss) by only reducing its calibration error term and not changing the refinement error term on datasets/tasks with very specific structures, e.g balanced binary classification problem where means of the classes are symmetric around zero and the covariances are identical as explained in section 5 of https://arxiv.org/html/2501.19195v1#S5.

We could write a test based on this analysis.

ogrisel

Thanks for the PR. Here is a first pass of feedback:

ogrisel · 2025-03-25T15:48:45Z

sklearn/calibration.py

+        negative_log_likelihood,
+        np.array([beta_0]),
+        args=(logits, labels, max_logits),
+        method="L-BFGS-B",


In cases where there is a single-element parameter array, it is probably much more efficient to use a dedicated scalar optimizer as discussed in #28574 (comment) and subsequent comments.

But this would deserve some benchmarking to confirm.

I ended up choosing minimize over minimize_scalar, even though it's more expensive (e.g., it requires computing the gradient).

This is because minimize_scalar doesn’t allow us to provide an initial guess---beta_0 = 1.0---when optimising the inverse temperature beta using method = "Bounded". Even for method = "Brent", it expects the initial guess xb to be a local minimum between two other points: func(xb) < func(xa) and func(xb) < func(xc), which is impossible to determine beforehand when fitting the calibrator.

Why not pass bracket=(.5, 2) or bracket=(0.1, 10.) since we work on a multiplicative scale around inverse temperature beta_0=1.0?

EDIT: thinking about this, we could re-parametrize as theta = log(beta) (hence beta = exp(theta)) and set bracket=(-1, 1): this way bisections / additive increments in theta space would naturally map to multiplicative update in the beta / temperature space which can probably lead to fewer iterations in the solver.

I think we should add verbose attribute to CalibratedClassifierCV to display convergence information (and number of iterations) of the underlying solvers.

sklearn/calibration.py

ogrisel · 2025-03-25T16:11:59Z

sklearn/calibration.py

+    Parameters
+    ----------
+    predictions : ndarray of shape (n_samples,) or (n_samples, n_classes)
+        The decision function or predict proba for the samples.


Let's be explicit if we expect one or the other and update the code (e.g. variable names) accordingly.

Here have the impression that this code always expects logits (that is the output of decision_function and never the output of predict_proba). If the latter, we would need to take the log of it (+ and epsilon to avoid nans).

This info is originally present in the method_name variable of:

scikit-learn/sklearn/calibration.py

Lines 440 to 443 in af41352

method_name = _check_response_method(

this_estimator,

["decision_function", "predict_proba"],

).__name__

I think _fit_calibrator should propagate this info to the calibrator to avoid any ambiguity.

Maybe we should also undo pre-processing such as:

scikit-learn/sklearn/calibration.py

Lines 455 to 463 in af41352

if method_name == "predict_proba":

# Select the probability column of the positive class

predictions = _process_predict_proba(

y_pred=predictions,

target_type="binary",

classes=self.classes_,

pos_label=self.classes_[1],

)

predictions = predictions.reshape(-1, 1)

and let the calibrator do what they find most suitable with either kind of input.

Maybe the calibrators could expose explicit tags to make it explicit on whether they can handle multiclass natively, or if the caller should be in charge of reducing a multiclass calibration problem into binary calibration sub problems with the OvR hack we currently implement for isotonic calibration and Platt scaling.

EDIT: the existing calibrator.__sklearn_tags___().input_tags might be enough to express this:

tags = calibrator.__sklearn_tags() supports_multiclass_logits = tags.input_tags.two_d_array and not tags.input_tags.one_d_array

as we have the following for IsotonicRegression:

scikit-learn/sklearn/isotonic.py

Lines 513 to 517 in af41352

def __sklearn_tags__(self):

tags = super().__sklearn_tags__()

tags.input_tags.one_d_array = True

tags.input_tags.two_d_array = False

return tags

The _SigmoidCalibrator can be udpated accordingly.

Alternatively, we can rename the ambiguous "predictions" variable every-where to "logits" and take the log of predict_proba + eps as early as possible.

But this would introduce a behavior change.

sklearn/calibration.py

ogrisel · 2025-03-25T16:28:35Z

sklearn/tests/test_calibration.py

@@ -401,6 +413,44 @@ def test_sigmoid_calibration():
        _SigmoidCalibration().fit(np.vstack((exF, exF)), exY)


+def test_temperature_scaling(data):


I think (single-parameter) temperature scaling provably improves Brier score (or log-loss) by only reducing its calibration error term and not changing the refinement error term on datasets/tasks with very specific structures, e.g balanced binary classification problem where means of the classes are symmetric around zero and the covariances are identical as explained in section 5 of https://arxiv.org/html/2501.19195v1#S5.

We could write a test based on this analysis.

sklearn/tests/test_calibration.py

sklearn/calibration.py

ogrisel · 2025-03-25T16:38:47Z

sklearn/calibration.py

+
+    return beta_minimizer.x[0]
+
+
 class _SigmoidCalibration(RegressorMixin, BaseEstimator):
    """Sigmoid regression model.


We should override __sklearn_tags__ to set the same input tags as IsotonicRegression:

scikit-learn/sklearn/isotonic.py

Lines 513 to 517 in af41352

def __sklearn_tags__(self):

tags = super().__sklearn_tags__()

tags.input_tags.one_d_array = True

tags.input_tags.two_d_array = False

return tags

See:

scikit-learn/sklearn/utils/_tags.py

Lines 20 to 25 in af41352

one_d_array : bool, default=False

Whether the input can be a 1D array.

two_d_array : bool, default=True

Whether the input can be a 2D array. Note that most common

tests currently run only if this flag is set to ``True``.

…enting_temperature_scaling

…fier`. Updated constructor of `_TemperatureScaling` class. Updated `test_temperature_scaling` in `test_calibration.py`. Added `__sklearn_tags__` to `_TemperatureScaling` class.

…enting_temperature_scaling

sklearn/calibration.py

…enting_temperature_scaling

…Updated doc-strings of temperature scaling in `calibration.py`. Updated formatting.

virchan

I'm still working on addressing the feedback, but I also wanted to share some findings related to it and provide an update.

sklearn/calibration.py

virchan · 2025-04-10T02:41:28Z

sklearn/calibration.py

+        negative_log_likelihood,
+        np.array([beta_0]),
+        args=(logits, labels, max_logits),
+        method="L-BFGS-B",


I ended up choosing minimize over minimize_scalar, even though it's more expensive (e.g., it requires computing the gradient).

This is because minimize_scalar doesn’t allow us to provide an initial guess---beta_0 = 1.0---when optimising the inverse temperature beta using method = "Bounded". Even for method = "Brent", it expects the initial guess xb to be a local minimum between two other points: func(xb) < func(xa) and func(xb) < func(xc), which is impossible to determine beforehand when fitting the calibrator.

sklearn/tests/test_calibration.py

…enting_temperature_scaling

lorentzenchr

I few computational things seem off.

sklearn/calibration.py

ogrisel · 2025-04-15T15:44:32Z

sklearn/calibration.py

+        negative_log_likelihood,
+        np.array([beta_0]),
+        args=(logits, labels, max_logits),
+        method="L-BFGS-B",


Why not pass bracket=(.5, 2) or bracket=(0.1, 10.) since we work on a multiplicative scale around inverse temperature beta_0=1.0?

EDIT: thinking about this, we could re-parametrize as theta = log(beta) (hence beta = exp(theta)) and set bracket=(-1, 1): this way bisections / additive increments in theta space would naturally map to multiplicative update in the beta / temperature space which can probably lead to fewer iterations in the solver.

I think we should add verbose attribute to CalibratedClassifierCV to display convergence information (and number of iterations) of the underlying solvers.

…enting_temperature_scaling

Update `minimize` in `_temperture_scaling` to `minimize.scalar`. Update `test_calibration.py` to check the optimised inverse temperature is between 0.1 and 10.

virchan

There are some CI failures—I'll fix those shortly.

Also considering adding a verbose parameter to CalibratedClassifierCV to optionally display convergence info when optimising the inverse temperature beta.

virchan · 2025-04-26T06:33:06Z

sklearn/calibration.py

+
+        return l.sum()
+
+    beta_minimizer = minimize_scalar(beta_loss, bounds=(0.1, 10.0), method="bounded")


Why not pass bracket=(.5, 2) or bracket=(0.1, 10.) since we work on a multiplicative scale around inverse temperature beta_0=1.0?

I think this might work — I've set bounds=(0.1, 10.0) and used the test_temperature_scaling test function to confirm that the optimised beta stays within that range.

That said, I do wonder if users might ask for a way to set a custom initial guess beta_0 when performing temperature scaling.

virchan · 2025-04-26T06:47:11Z

sklearn/calibration.py

+        # Ensure raw_prediction has the same dtype as labels using .astype().
+        # Without this, dtype promotion rules differ across NumPy versions:
+        #
+        #   beta = np.float64(0)
+        #   logits = np.array([1, 2], dtype=np.float32)
+        #
+        #   result = beta * logits
+        #   - NumPy < 2: result.dtype is float32
+        #   - NumPy 2+:  result.dtype is float64
+        #
+        #  This can cause dtype mismatch errors downstream (e.g., buffer dtype).
+        raw_prediction = xp.astype(beta * logits, dtype_)


And the logits should already have a suitable dtype.

I had to explicitly set the dtype here—otherwise scipy.optimize.minimize_scalar fails when
(beta * logits).dtype == np.float32.

This issue is caught by our own test_float32_predict_proba.

I noticed Sigmoid Calibration handles the same issue, so I reused their fix and comment for consistency.

Happy to adjust if I missed anything!

…enting_temperature_scaling

…id `method` parameter.

FEA add temperature scaling to CalibratedClassifierCV

604e0da

added changelog

257fd03

virchan added the New Feature label Mar 25, 2025

virchan commented Mar 25, 2025

View reviewed changes

virchan added the module:calibration label Mar 25, 2025

updated docstring.

eb6dd4a

virchan marked this pull request as ready for review March 25, 2025 10:55

ogrisel reviewed Mar 25, 2025

View reviewed changes

virchan added 4 commits March 27, 2025 18:14

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

73d0335

…enting_temperature_scaling

Updated docstrings of CalibratedClassifierCV and `_CalibratedClassi…

6d6963f

…fier`. Updated constructor of `_TemperatureScaling` class. Updated `test_temperature_scaling` in `test_calibration.py`. Added `__sklearn_tags__` to `_TemperatureScaling` class.

Fix typo in _TemperatureScaling's fit method.

c8ffc1b

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

5a09af3

…enting_temperature_scaling

lorentzenchr reviewed Mar 31, 2025

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

sklearn/calibration.py Show resolved Hide resolved

sklearn/calibration.py Show resolved Hide resolved

sklearn/calibration.py Show resolved Hide resolved

sklearn/calibration.py Outdated Show resolved Hide resolved

virchan added 5 commits March 31, 2025 15:11

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

a1be098

…enting_temperature_scaling

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

d715e6e

…enting_temperature_scaling

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

4ce1452

…enting_temperature_scaling

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

b039a5a

…enting_temperature_scaling

Updated test cases for temperature scaling in test_calibration.py. …

93c7972

…Updated doc-strings of temperature scaling in `calibration.py`. Updated formatting.

virchan commented Apr 10, 2025

View reviewed changes

virchan added 2 commits April 14, 2025 15:52

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

7d50ea7

…enting_temperature_scaling

Fix failing test_float32_predict_proba.

dfcaa39

lorentzenchr reviewed Apr 15, 2025

View reviewed changes

sklearn/calibration.py Outdated Show resolved Hide resolved

sklearn/calibration.py Outdated Show resolved Hide resolved

sklearn/calibration.py Outdated Show resolved Hide resolved

ogrisel reviewed Apr 15, 2025

View reviewed changes

virchan added 2 commits April 25, 2025 22:16

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

24c1266

…enting_temperature_scaling

Update HalfMultinomialLoss docstring.

ad8dea5

Update `minimize` in `_temperture_scaling` to `minimize.scalar`. Update `test_calibration.py` to check the optimised inverse temperature is between 0.1 and 10.

virchan commented Apr 26, 2025

View reviewed changes

virchan added 3 commits April 30, 2025 19:36

Update _TemperatureScaling tags.

9f1626c

Merge remote-tracking branch 'upstream/main' into issues/28574_implem…

b4f1aad

…enting_temperature_scaling

Add test_calibration_method in test_calibration.py to check inval…

1a9e307

…id `method` parameter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEA add temperature scaling to `CalibratedClassifierCV` #31068

FEA add temperature scaling to `CalibratedClassifierCV` #31068

virchan commented Mar 25, 2025

github-actions bot commented Mar 25, 2025 •

edited

Loading

virchan left a comment

virchan Mar 25, 2025

ogrisel Mar 25, 2025

ogrisel left a comment

ogrisel Mar 25, 2025

ogrisel Mar 25, 2025

virchan Apr 10, 2025

ogrisel Apr 15, 2025 •

edited

Loading

ogrisel Mar 25, 2025 •

edited

Loading

ogrisel Mar 25, 2025

ogrisel Mar 25, 2025

ogrisel Mar 25, 2025

virchan left a comment

virchan Apr 10, 2025

lorentzenchr left a comment

ogrisel Apr 15, 2025 •

edited

Loading

virchan left a comment

virchan Apr 26, 2025

virchan Apr 26, 2025

		@@ -401,6 +413,44 @@ def test_sigmoid_calibration():
		_SigmoidCalibration().fit(np.vstack((exF, exF)), exY)


		def test_temperature_scaling(data):

	method_name = _check_response_method(
	this_estimator,
	["decision_function", "predict_proba"],
	).__name__

	if method_name == "predict_proba":
	# Select the probability column of the positive class
	predictions = _process_predict_proba(
	y_pred=predictions,
	target_type="binary",
	classes=self.classes_,
	pos_label=self.classes_[1],
	)
	predictions = predictions.reshape(-1, 1)

	def __sklearn_tags__(self):
	tags = super().__sklearn_tags__()
	tags.input_tags.one_d_array = True
	tags.input_tags.two_d_array = False
	return tags

	one_d_array : bool, default=False
	Whether the input can be a 1D array.

	two_d_array : bool, default=True
	Whether the input can be a 2D array. Note that most common
	tests currently run only if this flag is set to ``True``.


		return l.sum()

		beta_minimizer = minimize_scalar(beta_loss, bounds=(0.1, 10.0), method="bounded")

FEA add temperature scaling to CalibratedClassifierCV #31068

Are you sure you want to change the base?

FEA add temperature scaling to CalibratedClassifierCV #31068

Conversation

virchan commented Mar 25, 2025

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented Mar 25, 2025 • edited Loading

✔️ Linting Passed

virchan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel Apr 15, 2025 • edited Loading

Choose a reason for hiding this comment

ogrisel Mar 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

virchan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lorentzenchr left a comment

Choose a reason for hiding this comment

ogrisel Apr 15, 2025 • edited Loading

Choose a reason for hiding this comment

virchan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FEA add temperature scaling to `CalibratedClassifierCV` #31068

FEA add temperature scaling to `CalibratedClassifierCV` #31068

github-actions bot commented Mar 25, 2025 •

edited

Loading

ogrisel Apr 15, 2025 •

edited

Loading

ogrisel Mar 25, 2025 •

edited

Loading

ogrisel Apr 15, 2025 •

edited

Loading