-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
BUG Problem when CalibratedClassifierCV
train contains 2 classes but data contains more
#29551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm under the impression that we should raise an error and not even a warning because the estimator obtained is meaningless. I might imagine that we could detect this case and raise an error in But I'm not sure which road we should take. |
I totally agree with you that missing one (or more) classes in a train split is not ideal - you end up with 0 proba, which is going to skew the calibration. But in #7796 and the fix #7799, we do not even question why a user is using LOO cv or why we are allowing a train split to not contain all classes, so maybe I am missing something? Maybe we are wanting to be more permissive? e.g., if there is one rare class - indeed a user questions this here: #29000 (comment)
we can also considering checking that test contains all classes as well. Training a calibrator on data that is missing a class is also not ideal... |
This comment was marked as spam.
This comment was marked as spam.
Thanks @Siddharth-Latthe-07, we have a way forward if we wanted to 'fix' the code. The question here is rather whether we should allow this situation at all (training subset has fewer classes than full data). The options would be:
But I think we would want some more (core) devs to weigh in on this.
@glemaitre are you suggesting that this should be a check we add to our cv splitters? (as a separate issue?) |
Opening an issue for the CV should make sense and we probably discussed this problem in a drafting meeting or developer meeting. |
I see you have opened an issue about this already: #29558, so just linking |
Oh yes indeed. I did not get so much feedback. I should put it in a dev discussion. |
Describe the bug
In
CalibratedClassifierCV
when a train split contains 2 classes (binary) but the data contains more (>=3) classes, we assume the data is binary:scikit-learn/sklearn/calibration.py
Lines 605 to 607 in d20e0b9
and we only end up fitting one calibrator:
scikit-learn/sklearn/calibration.py
Lines 620 to 621 in d20e0b9
Context: noticed when looking #29545 and trying to update
test_calibration_less_classes
Steps/Code to Reproduce
Expected Results
Expect proba to be 0 ONLY for the class not present in the train subset.
Actual Results
A reasonable fix is to check when
CalibratedClassifierCV.classes_
is greater thanestimator.classes_
and output both proba and 1 - proba (assuming we can know which class the estimator deemed to be the positive class).It does raise the question of whether we should warn when this happens..?
Versions
The text was updated successfully, but these errors were encountered: