Skip to content

BUG Problem when CalibratedClassifierCV train contains 2 classes but data contains more #29551

Open
@lucyleeow

Description

@lucyleeow

Describe the bug

In CalibratedClassifierCV when a train split contains 2 classes (binary) but the data contains more (>=3) classes, we assume the data is binary:

if predictions.ndim == 1:
# Reshape binary output from `(n_samples,)` to `(n_samples, 1)`
predictions = predictions.reshape(-1, 1)

and we only end up fitting one calibrator:

`n_classes` (i.e. `len(clf.classes_)`) calibrators are fitted.
However, if `n_classes` equals 2, one calibrator is fitted.

Context: noticed when looking #29545 and trying to update test_calibration_less_classes

Steps/Code to Reproduce

import numpy as np

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV

X = np.random.randn(12, 5)
y = [0, 0, 0, 0] + [1, 1, 1, 1] + [2, 2, 2, 2]
clf = DecisionTreeClassifier(random_state=7)
cal_clf = CalibratedClassifierCV(
    clf, method="sigmoid", cv=KFold(3), ensemble=True
)
cal_clf.fit(X, y)
for i in range(3):
    print(f'Fold: {i}')
    proba = cal_clf.calibrated_classifiers_[i].predict_proba(X)
    print(proba)

Expected Results

Expect proba to be 0 ONLY for the class not present in the train subset.

Actual Results

Fold: 0  # train contains class 1 and 2, we take the first `pos_class_indices` (1) to be the positive class
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]
Fold: 1  # train contains class 0 and 2, 0 is the first `pos_class_indices`
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]
Fold: 2  # train contains class 0 and 1, `0` is the first `pos_class_indices`
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

A reasonable fix is to check when CalibratedClassifierCV.classes_ is greater than estimator.classes_ and output both proba and 1 - proba (assuming we can know which class the estimator deemed to be the positive class).

It does raise the question of whether we should warn when this happens..?

Versions

Used main

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions