Issue with CalibratedClassifierCV with multiclass classification problems #18709

glemaitre · 2020-10-29T21:08:21Z

While reviewing #17856, @ogrisel, @lucyleeow and myself found some weird things going on in the CalibratedClassifierCV.

EDIT by Olivier: in particular we found out that our existing multiclass calibration test was very brittle and had a high likelhood of failing by changing the random seed.

The paper used as a reference is the following:

Zadrozny, Bianca, and Charles Elkan. "Transforming classifier scores into accurate multiclass probability estimates." Proceedings of the Eighth ACM SIGKDD international conference on Knowledge discovery and data mining. 2002.

The issues are linked with the way to combine probabilities in multiclass settings.

Issue with classifier natively supporting mutliclass problem

The paper mentioned that the multiclass problem should be tackled as a set of binary problems. However, classifiers that natively supports the multiclass problem, meaning without using one-vs-rest, will not be decoupled into one-vs-rest binary problems. Instead, we calibrated the different probabilities that the classifier output. So we don't implement what is written in the reference paper.

Issue with the performance of one-vs-rest and normalization

The paper states 3 strategies to handle multiclass:

train one-vs-one binary classifiers and use "coupling" for merging binary probabilities;
train one-vs-one binary classifiers and use "least-squares" minimization for merging binary probabilities;
train one-vs-rest binary classifiers and normalize the probabilities to sum to 1.

We are implementing 3. However, we should revisit this approach with extensive testing and reproduce the experiment shown in the paper with the Brier score (MSE).

The text was updated successfully, but these errors were encountered:

ogrisel · 2020-10-30T09:54:09Z

More details about the existing OvR case: we found out that a naive softmax normalization of the raw decision function of the OvR LinearSVC was often competitive with Platt / isotonic calibration of the binary classifier followed by a simple probs.sum(axis=1) normalization in terms of logloss. Maybe the choice of the logloss is the source of the problem (see below) but by re-reading the main paper we reference for multiclass classification Transforming Classifier Scores into Accurate Multiclass Probability Estimates by Bianca Zadrozny and CharlesElkan I have the feeling that the subject of multiclass calibration has not been properly investigated. In particular there are only 2 toy experiments in the paper (pendigits and 20 newsgroups) only with Naive Bayes and Boosted NB as the base classifier. Furthermore the theoretical justification for binary calibration (OvR or other) followed by probs.sum(axis=1) normalization seems very weak.

I also think using the logloss to evaluate multiclass calibration was not necessary a good idea: for not perfectly calibrated models, there is not guarantee that the model with the "best" but imperfect calibration has the lowest log loss. I changed the multiclass calibration test in deb75fc to use a multiclass version of the Brier loss which seems slightly more stable (e.g. try to change the seed) but it also seems like an imperfect calibration metric (see #10883). Maybe we should try to extend Expected Calibration Error (#11096) to the multiclass setting but I am not sure whether this is common practice or not.

Another baseline we could compare to: stacking the uncalibrated model with a multinomial logistic regression or multinomial gradient boosting, possibly with positivity constrained (for LR) or monotonicity constraints for GBRT.

If the later baseline proves to work in extensive benchmarks with various base classifiers on various multiclass classification datasets, it would be worth documenting it as an example and linking this strategy as an alternative to CalibratedClassifierCV in the user guide.

ogrisel · 2020-10-30T09:55:47Z

Mentioning @lucyleeow @dsleo @samronsin as you might be interested in this and maybe sharing your own insights.

ogrisel · 2020-10-30T10:10:23Z

We could even introduce an additional temperature hyperparemeter in the multinomial loss of LR / HGBRT. It would be set to 1. by default to get regular LR but could be grid searched with multiclass ECE when those models are used as second stage calibrators (instead of relying on a regularization alone).

ogrisel · 2020-10-30T11:24:15Z

Cross-referencing a survey of our community on twitter: https://twitter.com/ogrisel/status/1322119718334013443 with very relevant references in the replies.

lucyleeow added this to @lucyleeow's sklearn Apr 19, 2022

lucyleeow moved this to Todo in @lucyleeow's sklearn Apr 19, 2022

thomasjpfan added this to Quansight's scikit-learn Project Board Apr 28, 2022

thomasjpfan moved this to Todo📬 in Quansight's scikit-learn Project Board Apr 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with CalibratedClassifierCV with multiclass classification problems #18709

Issue with CalibratedClassifierCV with multiclass classification problems #18709

glemaitre commented Oct 29, 2020 •

edited by ogrisel

Loading

ogrisel commented Oct 30, 2020

ogrisel commented Oct 30, 2020

ogrisel commented Oct 30, 2020

ogrisel commented Oct 30, 2020

Issue with CalibratedClassifierCV with multiclass classification problems #18709

Issue with CalibratedClassifierCV with multiclass classification problems #18709

Comments

glemaitre commented Oct 29, 2020 • edited by ogrisel Loading

Issue with classifier natively supporting mutliclass problem

Issue with the performance of one-vs-rest and normalization

ogrisel commented Oct 30, 2020

ogrisel commented Oct 30, 2020

ogrisel commented Oct 30, 2020

ogrisel commented Oct 30, 2020

glemaitre commented Oct 29, 2020 •

edited by ogrisel

Loading