Skip to content

BUG Problem when CalibratedClassifierCV train contains 2 classes but data contains more #29551

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lucyleeow opened this issue Jul 24, 2024 · 7 comments
Labels

Comments

@lucyleeow
Copy link
Member

Describe the bug

In CalibratedClassifierCV when a train split contains 2 classes (binary) but the data contains more (>=3) classes, we assume the data is binary:

if predictions.ndim == 1:
# Reshape binary output from `(n_samples,)` to `(n_samples, 1)`
predictions = predictions.reshape(-1, 1)

and we only end up fitting one calibrator:

`n_classes` (i.e. `len(clf.classes_)`) calibrators are fitted.
However, if `n_classes` equals 2, one calibrator is fitted.

Context: noticed when looking #29545 and trying to update test_calibration_less_classes

Steps/Code to Reproduce

import numpy as np

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV

X = np.random.randn(12, 5)
y = [0, 0, 0, 0] + [1, 1, 1, 1] + [2, 2, 2, 2]
clf = DecisionTreeClassifier(random_state=7)
cal_clf = CalibratedClassifierCV(
    clf, method="sigmoid", cv=KFold(3), ensemble=True
)
cal_clf.fit(X, y)
for i in range(3):
    print(f'Fold: {i}')
    proba = cal_clf.calibrated_classifiers_[i].predict_proba(X)
    print(proba)

Expected Results

Expect proba to be 0 ONLY for the class not present in the train subset.

Actual Results

Fold: 0  # train contains class 1 and 2, we take the first `pos_class_indices` (1) to be the positive class
[[0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 1. 0.]]
Fold: 1  # train contains class 0 and 2, 0 is the first `pos_class_indices`
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]
Fold: 2  # train contains class 0 and 1, `0` is the first `pos_class_indices`
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]

A reasonable fix is to check when CalibratedClassifierCV.classes_ is greater than estimator.classes_ and output both proba and 1 - proba (assuming we can know which class the estimator deemed to be the positive class).

It does raise the question of whether we should warn when this happens..?

Versions

Used main
@lucyleeow lucyleeow added Bug Needs Triage Issue requires triage labels Jul 24, 2024
@glemaitre
Copy link
Member

I'm under the impression that we should raise an error and not even a warning because the estimator obtained is meaningless.

I might imagine that we could detect this case and raise an error in CalibratedClassifierCV. But it looks that this is even related to the cross-validation strategy: all CV strategies could make sure that we are seeing all class at training or should raise a warning mentioning that there are either rare class or ordered target and this is probably inappropriate. One might want to shuffle the data or create larger fold or add more data.

But I'm not sure which road we should take.

@glemaitre glemaitre removed the Needs Triage Issue requires triage label Jul 25, 2024
@lucyleeow
Copy link
Member Author

lucyleeow commented Jul 25, 2024

I totally agree with you that missing one (or more) classes in a train split is not ideal - you end up with 0 proba, which is going to skew the calibration.

But in #7796 and the fix #7799, we do not even question why a user is using LOO cv or why we are allowing a train split to not contain all classes, so maybe I am missing something?

Maybe we are wanting to be more permissive? e.g., if there is one rare class - indeed a user questions this here: #29000 (comment)

all CV strategies could make sure that we are seeing all class at training or should raise a warning mentioning

we can also considering checking that test contains all classes as well. Training a calibrator on data that is missing a class is also not ideal...

@Siddharth-Latthe-07

This comment was marked as spam.

@lucyleeow
Copy link
Member Author

Thanks @Siddharth-Latthe-07, we have a way forward if we wanted to 'fix' the code.

The question here is rather whether we should allow this situation at all (training subset has fewer classes than full data). The options would be:

  • Check when CalibratedClassifierCV.classes_ is greater than estimator.classes_ and output both proba and 1 in this case - we should probably give a warning
  • Error when CalibratedClassifierCV.classes_ is greater than estimator.classes_

But I think we would want some more (core) devs to weigh in on this.

But it looks that this is even related to the cross-validation strategy: all CV strategies could make sure that we are seeing all class at training or should raise a warning mentioning that there are either rare class or ordered target and this is probably inappropriate.

@glemaitre are you suggesting that this should be a check we add to our cv splitters? (as a separate issue?)

@glemaitre
Copy link
Member

glemaitre commented Sep 2, 2024

Opening an issue for the CV should make sense and we probably discussed this problem in a drafting meeting or developer meeting.

@lucyleeow
Copy link
Member Author

I see you have opened an issue about this already: #29558, so just linking

@glemaitre
Copy link
Member

Oh yes indeed. I did not get so much feedback. I should put it in a dev discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

3 participants