Skip to content

DOC: improve plot_compare_calibration.py #28231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Feb 21, 2024

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Jan 23, 2024

Related to the ongoing discussion on improving the user guide section that is based on this example: #28171.

I realized that using the default value of C=1 was yielding under-confident predictions, contrary to what was said in the user guide and the text of the example. I decided to tune the value of C using LogisticRegressionCV based on a proper scoring rule (I tried both Brier score and nll and both find the same level for C). It seems to lead to better calibrated logistic regression models for most dataset seeds. I think it's a good practice therefore decide to update the example accordingly.

I also noticed that some things we explained in the text of the example would not hold for other choices of the dataset seed (or other dataset parameters that should not matter). So I revised the text accordingly.

@ogrisel
Copy link
Member Author

ogrisel commented Jan 23, 2024

Let's wait for the CI to render the plot to be able to compare the impact of tuning C with the plot in main.

EDIT:

  • calibration plot with tuned C (this PR):

image

  • calibration plot without tuning C:

image

Copy link

github-actions bot commented Jan 23, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 1f503fb. Link to the linter CI: here

@ogrisel
Copy link
Member Author

ogrisel commented Jan 23, 2024

With the tuned C, the LR model is slightly under-confident while it was over-confident previously.

Still it looks like an improvement to me.

@ogrisel ogrisel marked this pull request as draft January 24, 2024 10:30
@ogrisel
Copy link
Member Author

ogrisel commented Jan 24, 2024

I found a bug in this example: the non-IID dataset generation and the unshuffled train test split make the analysis of this PR wrong.

EDIT: Actually, the shuffle parameter of make_classification itself has a default value of True so the train/test split does not require shuffling to make a valid IID assumption for the train and test samples.

I think this PR is fine for review as it is.

@ogrisel ogrisel marked this pull request as ready for review January 24, 2024 17:14
ogrisel and others added 2 commits February 2, 2024 20:08
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
@ogrisel
Copy link
Member Author

ogrisel commented Feb 5, 2024

Thanks for the reviews. I think I addressed all the comments. The rendering looks good now.

# curve is the closest to the diagonal among the four models.
#
# Logistic regression is trained by minimizing the log-loss which is a strictly
# proper scoring rule: in the limit of infinite training data, strictly proper
Copy link
Member

@lorentzenchr lorentzenchr Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting closer;-)

I'm a bit undecided whether or not to mention asymptomatic arguments. We don’t need to, the expectation of proper scoring rules is minimized by the true probabilities. The asymptotic argument comes in only when estimating this expectation.

Note that also the RF uses proper scoring rules, e.g. Gini criterion is the squared error evaluated with the mean of that node as prediction.
I guess that it is more the overfitting difference that we see.

The fact that LR uses a canonical loss-link combination is also very important for the calibration results.

Arrrg, difficult to keep it simple.

Copy link
Member Author

@ogrisel ogrisel Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit undecided whether or not to mention asymptomatic arguments. We don’t need to, the expectation of proper scoring rules is minimized by the true probabilities. The asymptotic argument comes in only when estimating this expectation.

I think it's important because most of the time machine learning practitioners (our main audience) are interested in out-of-sample predictions. This is why we plot the calibration curve on an held-out prediction set and why the interaction with regularization related hyperparameters is important in practice.

Note that also the RF uses proper scoring rules, e.g. Gini criterion is the squared error evaluated with the mean of that node as prediction.

Yes, I was thinking about updating the item about the RF but was worried to make it too complex.

The fact that LR uses a canonical loss-link combination is also very important for the calibration results.

I can try to fit this somewhere, but since do not really expose custom link functions for classifiers, I am not sure if this will make the message too complex without justification in terms of choice of hyperparameters.

Copy link
Member Author

@ogrisel ogrisel Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded the sections on RFs. I also conducted experiments on synthetic data (outside of this example). I confirm I can reproduce the effect reported by Niculescu-Mizil and Caruana (by playing with max_features and a dataset with a large number of uninformative features) so it's not just a theoretical artifact.

However, RF can also be strongly over-confident if the number of trees is not large enough to counteract the natural overfitting/over-confident behavior of a small number of unbounded trees. So the calibration picture can be really complex for RFs.

EDIT: sorry my push did not go through yesterday as I posted this comment. It's now there 55398ea.

#
# Feel free to re-run this example with different random seeds and other
# dataset generation parameters to see how different the calibration plots can
# look. In general, Logistic Regression and Random Forest will tend to be the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention here that both use a proper scoring rule.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 55398ea, I did that in the item related to RF instead. I don't think we need to repeat it here.

@glemaitre glemaitre added this to the 1.4.1 milestone Feb 6, 2024
ogrisel and others added 2 commits February 6, 2024 14:14
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@glemaitre glemaitre self-requested a review February 12, 2024 13:45
# under-confident: the predicted probabilities are a bit too close to 0.5
# compared to the true fraction of positive samples.
#
# The other methods all output worse-calibrated probabilities:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something weird in this sentence.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the improvements. I'm wondering if we should have a proper tutorial-based example to go more into details when it comes to models. I kind of like the to have the current example because it offers a 5-minutes to calibration example. But the topic is complex and it could be beneficial to go deeper in.

Comment on lines 93 to 104
lr = LogisticRegression()
lr = LogisticRegressionCV(
Cs=np.logspace(-6, 6, 101), cv=10, scoring="neg_log_loss", max_iter=1_000
)
gnb = GaussianNB()
svc = NaivelyCalibratedLinearSVC(C=1.0, dual="auto")
rfc = RandomForestClassifier()
rfc = RandomForestClassifier(random_state=42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, it's confusing as why one model is doing a CV search and the others don't. If we need to do a search, shouldn't we be doing that for all models with their corresponding search spaces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that ideally, all methods should have their hyperparameters tuned with a proper scoring rule.

It's very easy (and quite efficient) to do it for Logistic Regression while would require more efforts for the other methods and that would render the example much slower and verbose.

I could add an inline comment to say so.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: tuning C for the naively adapted SVC model would not fix its calibration (I tried).

ogrisel and others added 3 commits February 13, 2024 19:45
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a nit. Thanks Olivier.

Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
@ogrisel
Copy link
Member Author

ogrisel commented Feb 16, 2024

In retrospect I agree with @adrinjalali that it would be relevant to study the impact of tuning the parameters for the other models.

In particular, adjusting the smoothing for the Gaussian NB model can make it be as well calibrated as Logistic Regression.

For RFs, adjusting the number of trees and leaf size can make the model behave from over confident (few over fitting trees) to slightly under confident. However the RF parametrization with the best logloss has a very similar calibration curve to the one currently displayed in this example.

The SVC is always badly calibrated, whatever the regularization and that is expected because of the naive minmax scaling of the decision function values into the [0-1] range.

Still, this would be a lot of work: instead of doing a single joint plot with 4 model at once, I would be in favor of doing a more tutorial-style example that studies one model class at a time (with one or more plot per model class) and analyse the impact of the most important hyper-parameters on calibration and logloss.

Therefore am in favor of merging the current state of this PR as is, so as to be able to better back the arguments developed in #28171 and incrementally improve our message on model calibration. And later consider refactoring the 2 examples on model calibration to make them more tutorial-ish.

ogrisel and others added 2 commits February 21, 2024 16:02
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@ogrisel
Copy link
Member Author

ogrisel commented Feb 21, 2024

@lorentzenchr I think I addressed all the points of your last review.

@lorentzenchr lorentzenchr merged commit 3a3e746 into scikit-learn:main Feb 21, 2024
@lorentzenchr
Copy link
Member

Those doc updates often take longer than anticipated 😏

@ogrisel ogrisel deleted the improve-calibration-example branch February 22, 2024 06:37
@glemaitre
Copy link
Member

I'll backport this one

glemaitre added a commit to glemaitre/scikit-learn that referenced this pull request Feb 22, 2024
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
glemaitre added a commit that referenced this pull request Feb 22, 2024
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants