-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
DOC: improve plot_compare_calibration.py #28231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: improve plot_compare_calibration.py #28231
Conversation
With the tuned C, the LR model is slightly under-confident while it was over-confident previously. Still it looks like an improvement to me. |
EDIT: Actually, the I think this PR is fine for review as it is. |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Thanks for the reviews. I think I addressed all the comments. The rendering looks good now. |
# curve is the closest to the diagonal among the four models. | ||
# | ||
# Logistic regression is trained by minimizing the log-loss which is a strictly | ||
# proper scoring rule: in the limit of infinite training data, strictly proper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting closer;-)
I'm a bit undecided whether or not to mention asymptomatic arguments. We don’t need to, the expectation of proper scoring rules is minimized by the true probabilities. The asymptotic argument comes in only when estimating this expectation.
Note that also the RF uses proper scoring rules, e.g. Gini criterion is the squared error evaluated with the mean of that node as prediction.
I guess that it is more the overfitting difference that we see.
The fact that LR uses a canonical loss-link combination is also very important for the calibration results.
Arrrg, difficult to keep it simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit undecided whether or not to mention asymptomatic arguments. We don’t need to, the expectation of proper scoring rules is minimized by the true probabilities. The asymptotic argument comes in only when estimating this expectation.
I think it's important because most of the time machine learning practitioners (our main audience) are interested in out-of-sample predictions. This is why we plot the calibration curve on an held-out prediction set and why the interaction with regularization related hyperparameters is important in practice.
Note that also the RF uses proper scoring rules, e.g. Gini criterion is the squared error evaluated with the mean of that node as prediction.
Yes, I was thinking about updating the item about the RF but was worried to make it too complex.
The fact that LR uses a canonical loss-link combination is also very important for the calibration results.
I can try to fit this somewhere, but since do not really expose custom link functions for classifiers, I am not sure if this will make the message too complex without justification in terms of choice of hyperparameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expanded the sections on RFs. I also conducted experiments on synthetic data (outside of this example). I confirm I can reproduce the effect reported by Niculescu-Mizil and Caruana (by playing with max_features
and a dataset with a large number of uninformative features) so it's not just a theoretical artifact.
However, RF can also be strongly over-confident if the number of trees is not large enough to counteract the natural overfitting/over-confident behavior of a small number of unbounded trees. So the calibration picture can be really complex for RFs.
EDIT: sorry my push did not go through yesterday as I posted this comment. It's now there 55398ea.
# | ||
# Feel free to re-run this example with different random seeds and other | ||
# dataset generation parameters to see how different the calibration plots can | ||
# look. In general, Logistic Regression and Random Forest will tend to be the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention here that both use a proper scoring rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 55398ea, I did that in the item related to RF instead. I don't think we need to repeat it here.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
# under-confident: the predicted probabilities are a bit too close to 0.5 | ||
# compared to the true fraction of positive samples. | ||
# | ||
# The other methods all output worse-calibrated probabilities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something weird in this sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the improvements. I'm wondering if we should have a proper tutorial-based example to go more into details when it comes to models. I kind of like the to have the current example because it offers a 5-minutes to calibration example. But the topic is complex and it could be beneficial to go deeper in.
lr = LogisticRegression() | ||
lr = LogisticRegressionCV( | ||
Cs=np.logspace(-6, 6, 101), cv=10, scoring="neg_log_loss", max_iter=1_000 | ||
) | ||
gnb = GaussianNB() | ||
svc = NaivelyCalibratedLinearSVC(C=1.0, dual="auto") | ||
rfc = RandomForestClassifier() | ||
rfc = RandomForestClassifier(random_state=42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user, it's confusing as why one model is doing a CV search and the others don't. If we need to do a search, shouldn't we be doing that for all models with their corresponding search spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that ideally, all methods should have their hyperparameters tuned with a proper scoring rule.
It's very easy (and quite efficient) to do it for Logistic Regression while would require more efforts for the other methods and that would render the example much slower and verbose.
I could add an inline comment to say so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note: tuning C for the naively adapted SVC model would not fix its calibration (I tried).
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a nit. Thanks Olivier.
Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
In retrospect I agree with @adrinjalali that it would be relevant to study the impact of tuning the parameters for the other models. In particular, adjusting the smoothing for the Gaussian NB model can make it be as well calibrated as Logistic Regression. For RFs, adjusting the number of trees and leaf size can make the model behave from over confident (few over fitting trees) to slightly under confident. However the RF parametrization with the best logloss has a very similar calibration curve to the one currently displayed in this example. The SVC is always badly calibrated, whatever the regularization and that is expected because of the naive minmax scaling of the decision function values into the [0-1] range. Still, this would be a lot of work: instead of doing a single joint plot with 4 model at once, I would be in favor of doing a more tutorial-style example that studies one model class at a time (with one or more plot per model class) and analyse the impact of the most important hyper-parameters on calibration and logloss. Therefore am in favor of merging the current state of this PR as is, so as to be able to better back the arguments developed in #28171 and incrementally improve our message on model calibration. And later consider refactoring the 2 examples on model calibration to make them more tutorial-ish. |
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
@lorentzenchr I think I addressed all the points of your last review. |
Those doc updates often take longer than anticipated 😏 |
I'll backport this one |
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com> Co-authored-by: Adrin Jalali <adrin.jalali@gmail.com>
Related to the ongoing discussion on improving the user guide section that is based on this example: #28171.
I realized that using the default value of C=1 was yielding under-confident predictions, contrary to what was said in the user guide and the text of the example. I decided to tune the value of C using
LogisticRegressionCV
based on a proper scoring rule (I tried both Brier score and nll and both find the same level for C). It seems to lead to better calibrated logistic regression models for most dataset seeds. I think it's a good practice therefore decide to update the example accordingly.I also noticed that some things we explained in the text of the example would not hold for other choices of the dataset seed (or other dataset parameters that should not matter). So I revised the text accordingly.