DOC: improve plot_compare_calibration.py #28231

ogrisel · 2024-01-23T18:17:50Z

Related to the ongoing discussion on improving the user guide section that is based on this example: #28171.

I realized that using the default value of C=1 was yielding under-confident predictions, contrary to what was said in the user guide and the text of the example. I decided to tune the value of C using LogisticRegressionCV based on a proper scoring rule (I tried both Brier score and nll and both find the same level for C). It seems to lead to better calibrated logistic regression models for most dataset seeds. I think it's a good practice therefore decide to update the example accordingly.

I also noticed that some things we explained in the text of the example would not hold for other choices of the dataset seed (or other dataset parameters that should not matter). So I revised the text accordingly.

ogrisel · 2024-01-23T18:18:33Z

Let's wait for the CI to render the plot to be able to compare the impact of tuning C with the plot in main.

EDIT:

calibration plot with tuned C (this PR):

calibration plot without tuning C:

github-actions · 2024-01-23T18:19:04Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 1f503fb. Link to the linter CI: here}

ogrisel · 2024-01-23T18:42:57Z

With the tuned C, the LR model is slightly under-confident while it was over-confident previously.

Still it looks like an improvement to me.

ogrisel · 2024-01-24T10:31:19Z

~~I found a bug in this example: the non-IID dataset generation and the unshuffled train test split make the analysis of this PR wrong.~~

EDIT: Actually, the shuffle parameter of make_classification itself has a default value of True so the train/test split does not require shuffling to make a valid IID assumption for the train and test samples.

I think this PR is fine for review as it is.

examples/calibration/plot_compare_calibration.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

examples/calibration/plot_compare_calibration.py

… from reviewers

examples/calibration/plot_compare_calibration.py

ogrisel · 2024-02-05T17:37:21Z

Thanks for the reviews. I think I addressed all the comments. The rendering looks good now.

examples/calibration/plot_compare_calibration.py

lorentzenchr · 2024-02-05T20:47:38Z

examples/calibration/plot_compare_calibration.py

+# curve is the closest to the diagonal among the four models.
+#
+# Logistic regression is trained by minimizing the log-loss which is a strictly
+# proper scoring rule: in the limit of infinite training data, strictly proper


We are getting closer;-)

I'm a bit undecided whether or not to mention asymptomatic arguments. We don’t need to, the expectation of proper scoring rules is minimized by the true probabilities. The asymptotic argument comes in only when estimating this expectation.

Note that also the RF uses proper scoring rules, e.g. Gini criterion is the squared error evaluated with the mean of that node as prediction.
I guess that it is more the overfitting difference that we see.

The fact that LR uses a canonical loss-link combination is also very important for the calibration results.

Arrrg, difficult to keep it simple.

I'm a bit undecided whether or not to mention asymptomatic arguments. We don’t need to, the expectation of proper scoring rules is minimized by the true probabilities. The asymptotic argument comes in only when estimating this expectation.

I think it's important because most of the time machine learning practitioners (our main audience) are interested in out-of-sample predictions. This is why we plot the calibration curve on an held-out prediction set and why the interaction with regularization related hyperparameters is important in practice.

Note that also the RF uses proper scoring rules, e.g. Gini criterion is the squared error evaluated with the mean of that node as prediction.

Yes, I was thinking about updating the item about the RF but was worried to make it too complex.

The fact that LR uses a canonical loss-link combination is also very important for the calibration results.

I can try to fit this somewhere, but since do not really expose custom link functions for classifiers, I am not sure if this will make the message too complex without justification in terms of choice of hyperparameters.

I expanded the sections on RFs. I also conducted experiments on synthetic data (outside of this example). I confirm I can reproduce the effect reported by Niculescu-Mizil and Caruana (by playing with max_features and a dataset with a large number of uninformative features) so it's not just a theoretical artifact.

However, RF can also be strongly over-confident if the number of trees is not large enough to counteract the natural overfitting/over-confident behavior of a small number of unbounded trees. So the calibration picture can be really complex for RFs.

EDIT: sorry my push did not go through yesterday as I posted this comment. It's now there 55398ea.

examples/calibration/plot_compare_calibration.py

lorentzenchr · 2024-02-05T20:55:53Z

examples/calibration/plot_compare_calibration.py

+#
+# Feel free to re-run this example with different random seeds and other
+# dataset generation parameters to see how different the calibration plots can
+# look. In general, Logistic Regression and Random Forest will tend to be the


Maybe mention here that both use a proper scoring rule.

In 55398ea, I did that in the item related to RF instead. I don't think we need to repeat it here.

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

examples/calibration/plot_compare_calibration.py

glemaitre · 2024-02-12T18:42:31Z

examples/calibration/plot_compare_calibration.py

+# under-confident: the predicted probabilities are a bit too close to 0.5
+# compared to the true fraction of positive samples.
+#
+# The other methods all output worse-calibrated probabilities:


Something weird in this sentence.

examples/calibration/plot_compare_calibration.py

glemaitre

I like the improvements. I'm wondering if we should have a proper tutorial-based example to go more into details when it comes to models. I kind of like the to have the current example because it offers a 5-minutes to calibration example. But the topic is complex and it could be beneficial to go deeper in.

adrinjalali · 2024-02-13T11:42:53Z

examples/calibration/plot_compare_calibration.py

-lr = LogisticRegression()
+lr = LogisticRegressionCV(
+    Cs=np.logspace(-6, 6, 101), cv=10, scoring="neg_log_loss", max_iter=1_000
+)
 gnb = GaussianNB()
 svc = NaivelyCalibratedLinearSVC(C=1.0, dual="auto")
-rfc = RandomForestClassifier()
+rfc = RandomForestClassifier(random_state=42)


As a user, it's confusing as why one model is doing a CV search and the others don't. If we need to do a search, shouldn't we be doing that for all models with their corresponding search spaces?

I agree that ideally, all methods should have their hyperparameters tuned with a proper scoring rule.

It's very easy (and quite efficient) to do it for Logistic Regression while would require more efforts for the other methods and that would render the example much slower and verbose.

I could add an inline comment to say so.

Also note: tuning C for the naively adapted SVC model would not fix its calibration (I tried).

examples/calibration/plot_compare_calibration.py

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

examples/calibration/plot_compare_calibration.py

adrinjalali

Just a nit. Thanks Olivier.

examples/calibration/plot_compare_calibration.py

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

ogrisel · 2024-02-16T17:11:21Z

In retrospect I agree with @adrinjalali that it would be relevant to study the impact of tuning the parameters for the other models.

In particular, adjusting the smoothing for the Gaussian NB model can make it be as well calibrated as Logistic Regression.

For RFs, adjusting the number of trees and leaf size can make the model behave from over confident (few over fitting trees) to slightly under confident. However the RF parametrization with the best logloss has a very similar calibration curve to the one currently displayed in this example.

The SVC is always badly calibrated, whatever the regularization and that is expected because of the naive minmax scaling of the decision function values into the [0-1] range.

Still, this would be a lot of work: instead of doing a single joint plot with 4 model at once, I would be in favor of doing a more tutorial-style example that studies one model class at a time (with one or more plot per model class) and analyse the impact of the most important hyper-parameters on calibration and logloss.

Therefore am in favor of merging the current state of this PR as is, so as to be able to better back the arguments developed in #28171 and incrementally improve our message on model calibration. And later consider refactoring the 2 examples on model calibration to make them more tutorial-ish.

examples/calibration/plot_compare_calibration.py

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

ogrisel · 2024-02-21T16:48:55Z

@lorentzenchr I think I addressed all the points of your last review.

lorentzenchr · 2024-02-21T17:42:39Z

Those doc updates often take longer than anticipated 😏

glemaitre · 2024-02-22T11:43:42Z

I'll backport this one

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

DOC: improve plot_compare_calibration.py

d9e32d6

ogrisel added Documentation No Changelog Needed labels Jan 23, 2024

ogrisel mentioned this pull request Jan 23, 2024

DOC more precise calibration wording #28171

Merged

ogrisel marked this pull request as draft January 24, 2024 10:30

ogrisel marked this pull request as ready for review January 24, 2024 17:14

Iterate on calibration of LR models

ba8798d

thomasjpfan reviewed Feb 2, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

ogrisel and others added 2 commits February 2, 2024 20:08

Apply suggestions from code review

bf3a336

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Fix linting

991c14b

lorentzenchr reviewed Feb 2, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

Reorganized the beginning of the analysis section to address feedback…

4fd1215

… from reviewers

ogrisel commented Feb 5, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

ogrisel added 4 commits February 5, 2024 14:21

Fixed auto text wrap that broken the formatting of the references

30ac75b

More formatting fix.

89c1f25

Merge branch 'main' into improve-calibration-example

89212e5

One more formating fix and increase max_iter

14c8769

lorentzenchr reviewed Feb 5, 2024

View reviewed changes

glemaitre added this to the 1.4.1 milestone Feb 6, 2024

ogrisel and others added 2 commits February 6, 2024 14:14

Apply suggestions from code review

d5f458a

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

More detailed section about RF calibration

55398ea

glemaitre self-requested a review February 12, 2024 13:45

glemaitre reviewed Feb 12, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

glemaitre reviewed Feb 12, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

glemaitre reviewed Feb 12, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

glemaitre approved these changes Feb 12, 2024

View reviewed changes

adrinjalali reviewed Feb 13, 2024

View reviewed changes

ogrisel commented Feb 13, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

ogrisel and others added 3 commits February 13, 2024 19:45

Apply suggestions from code review

0096bb5

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Apply suggestions from code review

45d110c

Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Wrap paragraph

4dd0078

ogrisel commented Feb 13, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

Add an inline comment about hparam tuning.

4ad948e

adrinjalali approved these changes Feb 14, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

Update examples/calibration/plot_compare_calibration.py

cfc18c4

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>

lorentzenchr approved these changes Feb 16, 2024

View reviewed changes

ogrisel commented Feb 21, 2024

View reviewed changes

examples/calibration/plot_compare_calibration.py Outdated Show resolved Hide resolved

ogrisel and others added 2 commits February 21, 2024 16:02

Apply suggestions from code review

39ec21f

Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>

Wrap paragraph + grammar fix

1f503fb

lorentzenchr merged commit 3a3e746 into scikit-learn:main Feb 21, 2024

ogrisel deleted the improve-calibration-example branch February 22, 2024 06:37

glemaitre mentioned this pull request Feb 22, 2024

DOC: improve plot_compare_calibration.py (#28231) #28502

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: improve plot_compare_calibration.py #28231

DOC: improve plot_compare_calibration.py #28231

ogrisel commented Jan 23, 2024

ogrisel commented Jan 23, 2024 •

edited

Loading

github-actions bot commented Jan 23, 2024 •

edited

Loading

ogrisel commented Jan 23, 2024

ogrisel commented Jan 24, 2024 •

edited

Loading

ogrisel commented Feb 5, 2024

lorentzenchr Feb 5, 2024 •

edited

Loading

ogrisel Feb 6, 2024 •

edited

Loading

ogrisel Feb 6, 2024 •

edited

Loading

lorentzenchr Feb 5, 2024

ogrisel Feb 7, 2024

glemaitre Feb 12, 2024

glemaitre left a comment

adrinjalali Feb 13, 2024

ogrisel Feb 13, 2024

ogrisel Feb 13, 2024

adrinjalali left a comment

ogrisel commented Feb 16, 2024 •

edited

Loading

ogrisel commented Feb 21, 2024

lorentzenchr commented Feb 21, 2024

glemaitre commented Feb 22, 2024

DOC: improve plot_compare_calibration.py #28231

DOC: improve plot_compare_calibration.py #28231

Conversation

ogrisel commented Jan 23, 2024

ogrisel commented Jan 23, 2024 • edited Loading

github-actions bot commented Jan 23, 2024 • edited Loading

✔️ Linting Passed

ogrisel commented Jan 23, 2024

ogrisel commented Jan 24, 2024 • edited Loading

ogrisel commented Feb 5, 2024

lorentzenchr Feb 5, 2024 • edited Loading

Choose a reason for hiding this comment

ogrisel Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

ogrisel Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

lorentzenchr Feb 5, 2024

Choose a reason for hiding this comment

ogrisel Feb 7, 2024

Choose a reason for hiding this comment

glemaitre Feb 12, 2024

Choose a reason for hiding this comment

glemaitre left a comment

Choose a reason for hiding this comment

adrinjalali Feb 13, 2024

Choose a reason for hiding this comment

ogrisel Feb 13, 2024

Choose a reason for hiding this comment

ogrisel Feb 13, 2024

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

ogrisel commented Feb 16, 2024 • edited Loading

ogrisel commented Feb 21, 2024

lorentzenchr commented Feb 21, 2024

glemaitre commented Feb 22, 2024

ogrisel commented Jan 23, 2024 •

edited

Loading

github-actions bot commented Jan 23, 2024 •

edited

Loading

ogrisel commented Jan 24, 2024 •

edited

Loading

lorentzenchr Feb 5, 2024 •

edited

Loading

ogrisel Feb 6, 2024 •

edited

Loading

ogrisel Feb 6, 2024 •

edited

Loading

ogrisel commented Feb 16, 2024 •

edited

Loading