-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Issue with CalibratedClassifierCV with multiclass classification problems #18709
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
More details about the existing OvR case: we found out that a naive softmax normalization of the raw decision function of the OvR LinearSVC was often competitive with Platt / isotonic calibration of the binary classifier followed by a simple I also think using the logloss to evaluate multiclass calibration was not necessary a good idea: for not perfectly calibrated models, there is not guarantee that the model with the "best" but imperfect calibration has the lowest log loss. I changed the multiclass calibration test in deb75fc to use a multiclass version of the Brier loss which seems slightly more stable (e.g. try to change the seed) but it also seems like an imperfect calibration metric (see #10883). Maybe we should try to extend Expected Calibration Error (#11096) to the multiclass setting but I am not sure whether this is common practice or not. Another baseline we could compare to: stacking the uncalibrated model with a multinomial logistic regression or multinomial gradient boosting, possibly with positivity constrained (for LR) or monotonicity constraints for GBRT. If the later baseline proves to work in extensive benchmarks with various base classifiers on various multiclass classification datasets, it would be worth documenting it as an example and linking this strategy as an alternative to |
Mentioning @lucyleeow @dsleo @samronsin as you might be interested in this and maybe sharing your own insights. |
We could even introduce an additional temperature hyperparemeter in the multinomial loss of LR / HGBRT. It would be set to 1. by default to get regular LR but could be grid searched with multiclass ECE when those models are used as second stage calibrators (instead of relying on a regularization alone). |
Cross-referencing a survey of our community on twitter: https://twitter.com/ogrisel/status/1322119718334013443 with very relevant references in the replies. |
While reviewing #17856, @ogrisel, @lucyleeow and myself found some weird things going on in the
CalibratedClassifierCV
.EDIT by Olivier: in particular we found out that our existing multiclass calibration test was very brittle and had a high likelhood of failing by changing the random seed.
The paper used as a reference is the following:
Zadrozny, Bianca, and Charles Elkan. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the Eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002.
The issues are linked with the way to combine probabilities in multiclass settings.
Issue with classifier natively supporting mutliclass problem
The paper mentioned that the multiclass problem should be tackled as a set of binary problems. However, classifiers that natively supports the multiclass problem, meaning without using one-vs-rest, will not be decoupled into one-vs-rest binary problems. Instead, we calibrated the different probabilities that the classifier output. So we don't implement what is written in the reference paper.
Issue with the performance of one-vs-rest and normalization
The paper states 3 strategies to handle multiclass:
We are implementing 3. However, we should revisit this approach with extensive testing and reproduce the experiment shown in the paper with the Brier score (MSE).
The text was updated successfully, but these errors were encountered: