Skip to content

Add PAV algorithm for calibration_curve/reliability diagrams #23132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
lorentzenchr opened this issue Apr 14, 2022 · 8 comments · May be fixed by #23824
Open

Add PAV algorithm for calibration_curve/reliability diagrams #23132

lorentzenchr opened this issue Apr 14, 2022 · 8 comments · May be fixed by #23824
Labels

Comments

@lorentzenchr
Copy link
Member

lorentzenchr commented Apr 14, 2022

Describe the workflow you want to enable

import numpy as np
from sklearn.calibration import calibration_curve

y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1])
y_pred = np.array([0.1, 0.2, 0.3, 0.4, 0.65, 0.7, 0.8, 0.9,  1.])
prob_true, prob_pred = calibration_curve(y_true, y_pred, strategy="pav")

Describe your proposed solution

Add the strategy PAV as [1] and [2] to calibration_curve, there called CORP. This is basically applying isotonic regression as binning strategy, which we already have in scikit-learn.

[1] Dimitriadis, T., Gneiting, T., & Jordan, A.I. (2021). Stable reliability diagrams for probabilistic classifiers. Proceedings of the National Academy of Sciences of the United States of America, 118. https://doi.org/10.1073/pnas.2016191118
[2] https://cran.r-project.org/package=reliabilitydiag

Describe alternatives you've considered, if relevant

No response

Additional context

Given the recency of the paper, it has clearly not enough citations (yet). But I have the impression that this is a good strategy for a reliability diagrams with good theoretical and practical properties.

To my knowledge, this strategy is nowhere available in the Python universe as of now.

@lorentzenchr lorentzenchr added New Feature Needs Triage Issue requires triage module:calibration Needs Decision - Include Feature Requires decision regarding including feature labels Apr 14, 2022
@lorentzenchr
Copy link
Member Author

@aijordan For your information.

@lorentzenchr lorentzenchr removed the Needs Triage Issue requires triage label Apr 14, 2022
@ogrisel
Copy link
Member

ogrisel commented Apr 20, 2022

It's possible that using Centered Isotonic Regression (#21454) would make the reliability diagram look even better but might break the theoretical results of the paper you linked above.

@lorentzenchr
Copy link
Member Author

@ogrisel As pointed out in #21454, centered isotonic regression seems to be an invalid option because it is not calibrated itself which is a crucial property of (standard) isotonic regression, in particular important for assessing (auto-) calibration of a model as is done in a reliability diagram.

@lorentzenchr
Copy link
Member Author

Posted by @ogrisel in #23767 (comment)

My main concern with the CORP reliability diagrams is that they are very square-looking on finite size test sets, see for instance:

image

Those diagrams would probably qualitatively look very different in the asymptotic regime of large test sets.

To avoid this finite test sample artifact, we might also what to consider methods such as implemented in:

https://github.com/apple/ml-calibration

Smooth ECE: Principled Reliability Diagrams via Kernel Smoothing
Jarosław Błasiok, Preetum Nakkiran
https://arxiv.org/abs/2309.12236

image

@lorentzenchr
Copy link
Member Author

@ogrisel

My main concern with the CORP reliability diagrams is that they are very square-looking on finite size

As Prof Simon Woods says:

Statistics is the honest interpretation of data

The reliability diagram is a statistical diagnostics/verification tool and not something that needs to be pleasant for the eye, but easy to interpret.
The zigzag is a mere consequence of the underlying (small sample) uncertainty in the estimation of $E[Y|prediction]$.

BTW, I never understood why the PAV (=CORP) approach is good enough for calibration of classifiers (modifies actual model predictions), but not good enough in a reliability diagram (diagnostics) - within scikit-learn 🤨

@ogrisel
Copy link
Member

ogrisel commented Dec 10, 2023

The reliability diagram is a statistical diagnostics/verification tool and not something that needs to be pleasant for the eye, but easy to interpret.

It's not a matter of being pleasing to the eye but a matter about being misleading about the shape of the asymptotic curve. The asymptotic curve will be smooth most of the time, and the CORP finite sample size estimate can induce the reader into thinking otherwise, which I find misleading and a potential source of confusion for our users.

BTW, I never understood why the PAV (=CORP) approach is good enough for calibration of classifiers (modifies actual model predictions), but not good enough in a reliability diagram (diagnostics) - within scikit-learn

I actually have the exact same concern with isotonic regression as a post hoc calibrator. I would much rather use the centered isotonic regression as post hoc classifier which is mostly strictly monotonic (and as a result would not introduce an unexpected change to pure ranking metrics such as ROC AUC / Gini index) and converges to the same solution as the step-wise constant calibrator in the large sample size.

@lorentzenchr
Copy link
Member Author

The asymptotic curve will be smooth most of the time

That‘s not correct. For instance, tree based models or GLMs with categorical features are not smooth in the predictions.

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2023

That‘s not correct. For instance, tree based models or GLMs with categorical features are not smooth in the predictions.

Indeed that might be the case. Although depending on the size of the training set, I suspect that they are still be much smoother than what the CORP reliability diagram suggests. To settle this debate we will need some experiments with a few large datasets we we can subsample both the training set and the test set used to estimate the reliability curve and compare the small test sample CORP/smoothed reliability curves to the CORP curve on the full test set.

We could also have a reliability diagram with a user settable option to decide what strategy they want to use (fixed binning as we do now, CORP induced bin edges and some smooth estimate). Still comparing the methods on a few canonical datasets would help us make informed recommendations in the docstring of that parameter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Discussion
Development

Successfully merging a pull request may close this issue.

2 participants