Description
Describe the bug
In CalibratedClassifierCV
when a train split contains 2 classes (binary) but the data contains more (>=3) classes, we assume the data is binary:
scikit-learn/sklearn/calibration.py
Lines 605 to 607 in d20e0b9
and we only end up fitting one calibrator:
scikit-learn/sklearn/calibration.py
Lines 620 to 621 in d20e0b9
Context: noticed when looking #29545 and trying to update test_calibration_less_classes
Steps/Code to Reproduce
import numpy as np
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV
X = np.random.randn(12, 5)
y = [0, 0, 0, 0] + [1, 1, 1, 1] + [2, 2, 2, 2]
clf = DecisionTreeClassifier(random_state=7)
cal_clf = CalibratedClassifierCV(
clf, method="sigmoid", cv=KFold(3), ensemble=True
)
cal_clf.fit(X, y)
for i in range(3):
print(f'Fold: {i}')
proba = cal_clf.calibrated_classifiers_[i].predict_proba(X)
print(proba)
Expected Results
Expect proba to be 0 ONLY for the class not present in the train subset.
Actual Results
Fold: 0 # train contains class 1 and 2, we take the first `pos_class_indices` (1) to be the positive class
[[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]
[0. 1. 0.]]
Fold: 1 # train contains class 0 and 2, 0 is the first `pos_class_indices`
[[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]]
Fold: 2 # train contains class 0 and 1, `0` is the first `pos_class_indices`
[[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]
[1. 0. 0.]]
A reasonable fix is to check when CalibratedClassifierCV.classes_
is greater than estimator.classes_
and output both proba and 1 - proba (assuming we can know which class the estimator deemed to be the positive class).
It does raise the question of whether we should warn when this happens..?
Versions
Used main