-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Implement Centered Isotonic Regression #21454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Implement Centered Isotonic Regression #21454
Conversation
Thanks @mathijs02. This sounds like a nice way to deal with #16321. If confirmed we might want to enable centering by default when isotonic regression is used as a classification calibrator. The only problem is that strict monotonicity is not guaranteed near the edges (and beyond the edges) of the training data input range. |
That is indeed the case. In the implementation of the original authors, the focus is on the application in dose-response studies, where this choice matches with the expected behaviour (I am not in that field, so not an expert). There are two reasons why on the edge there is no strict monotonicity:
The implementation from the paper was not aimed at classifier calibration. I think CIR might still have fewer problems (of the type described in #16321) with constant output domains than regular IR has, but I'm happy with any choice we make about the using it as calibration default or not. |
Would you be interested in starting a new PR that builds on top of this one (branched of off it) to explore the ability of using a variant (with an additional, non-default constructor param) to ensure strict monotonicity in the context of using centered isotonic regression for strictly monotic calibration of probabilistic classifiers with |
I would be happy to build further on this. I have limited familiarity with the scikit-learn codebase and with the calibration methods, so I'm afraid I need a bit more context for your suggestion. If I understand correctly, CIR is guaranteed to give outputs in the [0, 1] range when trained on binary labels (just like regular IR is). So if we use the implementation of CIR in this PR (#21454), I think no additional parameter is required for the By your proposal for using it for calibration, do you mean branching from this PR (#21454) to add CIR as one of the options for the parameter To double check: do you think that the current PR (i.e. the method from paper [1]) is sufficiently 'strictly monotonic' for calibration, or do you think the flat areas at the edges pose a problem that would require us to think of adaptations of the algorithm? |
I think they are still problematic but it's worth trying empirically to see how it goes in practice. |
I am still interested in making Centered Isotonic Regression part of sklearn. Would it be ok to proceed with this PR, and consider the question of what Isotonic Regression method is best for calibration as a separate issue to resolve at a later moment (i.e. maintaining for now the status quo for calibration)? |
Yes, sorry for the slow reply. Will need to put this PR review back to the top of my personal review backlog. |
I use this functionality quite regularly, so I would still be interested in merging this PR, or some other implementation or Centered Isotonic Regression. |
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Sangam Swadi K <sangamswadik@users.noreply.github.com> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…onCV (#23565) Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Sangam Swadi K <sangamswadik@users.noreply.github.com> Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com> Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Can you explain what we are aiming to test with this assert? I can't completely follow, and I don't see how we expect it to be true (the assert isn't true in general). |
I had a typo, sorry. I meant assert np.sum(y) == np.sum(iso_model.transform(X)) |
I checked: this is correct in the non-strict/regular isotonic regression when no weights are used, but not in the centered isotonic regression (or if weights are used in the regular isotonic regression). Do you have any reference as to why we would expect this to be true? |
This is basically Ordinary Least Squares and has different names: score equation, balance property, calibration on average and uncoditional calibration. Just note that for one interval in I have to say, if this property is violated by the centered/strict version of isotonic regression, I'm inclined (but not yet decided) to vote -1 for this feature. |
Thanks, that is good background info to have! Indeed the centered/strict version does not have this property. Can you explain why this violation makes you think not to vote in favor of this feature? I think there are more (non-linear) regressors and classifiers in scikit-learn. The centered/strict version seems to be well-used within a specific field (dosimetry), and I have found it useful many times for calibration purposes. |
Failing to have this property, CIR is systematically biased. For instance, if you use it to calibrate a classifier, you'll get a biased result, e.g., you have an observed frequency (probability for y to be the positive class) |
I checked and the mean train predictions of a Hist GBRT model (even with the default least squares loss) aren't exactly This does not happen for our traditional histogram gradient boosting implementation ( For CIR the difference can be even larger than |
@ogrisel Have you set |
See also #22892. |
I tried a bunch of regressors, and indeed they all seem to be (close to) having the balance property, linear regression without intercept being an exception. I understand that this means that CIR is inherently biased, which for the user would be a trade-off with the advantage that it brings (i.e. strictly increasing). Maybe we can add warnings for scenarios in which the balance property is not met, like CIR, but maybe also linear regression without intercept and maybe other scenarios in which it can happen? |
If you think it would be ok to implement CIR with the appropriate warning about violating the balancing property, I would still be interested in bringing this pull request to a successful completion. |
My very personal opinion is that the CIR algorithm has a flaw by producing biased results. My guess is that it could be fixed by shifting the x-coordinates. All this and the low citation count considered, I‘m currently -1 for inclusion. |
I am pretty sure that in practice the train-set bias observed does not convert into a test-set bias: for most natural data distribution, the ground truth conditional |
While I would like to have such a method available, I think that such an analysis should have been done in research papers. I'm very much interested in the use cases:
|
I went through the CIR paper again, and there is a section in the Supplemental Information that investigates the bias of CIR vs IR. Indeed CIR has a larger bias than IR. In addition, I found a paper exploring bias in IR, which suggests that asymptotically for large datasets IR bias goes to zero. I cannot extract from the CIR paper whether the bias of CIR also tends to zero for large samples, but the authors write the following thing, suggesting that it might not: "Conversely, this also means that CIR's My main use case is to fit (C)IR models in order to make non-parametric estimates of (univariate) cumulative distribution functions, in general, not for a particular context. (C)IR fits in this context, because the only assumption I want to make is that the CDF is (strictly) monotonic increasing. In particular, I am interested in the inverse CDF. The constant regions in IR are a bit of a problem, because they cause non-unique mappings and/or high variance in the inverse CDF. For a graphical example of this, see the plot at the top of this page. CIR doesn't have this problem. I think that the use case of dose-response curves is similar to the CDF use case. I am not very familiar with dose-response curves, but a quick look in the literature suggests that they are typically univariate, though there are some papers about multivariate dose-response curves. It would be possible for me to explore the bias in CIR further, for example as a function of sample size, empirically. However, preparing an actual publication on this or working on a method that would satisfy the balance equation (e.g. through shifting the x-coordinates) is a bit beyond the scope of this MR for me. I understand that ideally we want to base scikit-learn implementations on published results. I think it would make sense to determine whether the CIR algorithm in its current form (i.e. lacking the balance property) would fit in scikit-learn or not. If yes, I will invest some additional time in verifying some properties regarding bias, and possibly extending documentation that discusses this. If not, I will probably close this PR or mark it as stalled. According to the contribution documentation, two core developers would need to agree on an addition. I don't see any details in the documentation on how such a 'voting' process would work. Or is it fair to say that the -1 recommendation from @lorentzenchr means that in its current form, CIR should not be part of scikit-learn? |
@lorentzenchr @ogrisel Shall we make a (final) decision on whether CIR in its current form (i.e. without the balance property) should be part of scikit-learn or not (see my comment above)? Unfortunately adapting the algorithm is somewhat beyond the scope for me, so if we decide that it shouldn’t be part of scikit-learn, I will close this PR. |
SummaryCentered Isotonic Regression (CIR) modifies the PAV algorithm such that the resulting interpolation is strictly monotonic increasing. The algo/paper does clearly not meet the inclusion criteria and is not well investigated wrt statistical consistency and bias. On the other hand, it would be a small modification of PAV/ Use cases
@scikit-learn/core-devs |
-1 from me because I think that the use cases hint more in the direction of scipy or statsmodels (@mathijs02 I could help you to place it there) and because the bias in not investigated. |
Thanks for the suggestion and offer. Implementing CIR in statsmodels or scipy would mean implementing PAVA itself too, since those libraries currently do not have support for Isotonic Regression. That would expand the scope significantly. In addition, the scikit-learn API is very nice and makes it easy to combine models with other components of scikit-learn. I am therefore considering making a very small library that simply implements a child class of |
@mathijs02 I could imagine that the PAV algo would be a very good fit for |
Personally I still think Centered Isotonic Regression is relevant for scikit-learn as an implicitly regularized variant of vanilla Isotonic Regression. I expect it to show a small but significantly better test performance when used as a post-hoc calibrator for classifiers when the number of samples is limited (making isotonic regression overfit and degrade the resolution component of the decomposition of NLL or Brier score) while Platt would be mis-specified (for instance if the reliability curve of the original model is not symmetric). I would be +1 for reopening this PR if someone can come up an empirical study that can demonstrate those expected behaviors. In many settings, data is limited (e.g. health data) but calibration can still be very useful. |
Would it make sense to open an issue for it to make it more visible than a closed PR? (BTW, this could be a nice subject for a bachelor or master thesis, doesn‘t it?) |
I would be +1 for reopening this PR if someone can come up an empirical study that can demonstrate those expected behaviors.
Does it match our inclusion criteria in terms of visible popularity?
|
I fear the answer is no. |
But I would not mind to include it despite not being popular. The striking argument for me was the lack of scientific analysis of statistical properties and behavior in practice. |
I fear the answer is no.
Then I think that we should not include it.
I think that all of us (contributors and users) have things that we would personally like to include that are not popular enough. It becomes hard to draw the line.
|
Proposed feature
This PR implements 'Centered Isotonic Regression' (CIR). This algorithm is almost identical to regular Isotonic Regression (IR), but makes a small change to how the interpolation function is created. CIR is described in ref [1]. I am not one of the authors of the paper.
Details
The interpolation function of a regular IR is not strictly increasing/decreasing: there can be large parts of the domain where the output function is constant/flat. In some applications, it is known that the output function is strictly increasing/decreasing.[2] CIR changes the output function by shrinking the constant sections of the output function to a single weighted average point. For such cases, the resulting CIR function gives a reduction in estimation error. CIR requires no additional parameters.
I have used CIR multiple times already for use cases in which it is known that the output function should be strictly increasing/decreasing. CIR is implemented in an R package, and the implementation in scikit-learn is a relatively minor change. I therefore think this can be a valuable addition to scikit-learn.
The only required change is in the method
_build_y
where the IR-transformed input data points are converted to points for the scipy interpolation. An extra option is added here, which can trigger using the CIR method for building the points for the interpolation function.Example
With this new code, CIR can easily be used in the following way:
An example plot comparing IR and CIR, with the points of the interpolation function marked:

References
[1] https://www.tandfonline.com/doi/abs/10.1080/19466315.2017.1286256 or https://arxiv.org/abs/1701.05964
[2] https://en.wikipedia.org/wiki/Isotonic_regression#Centered_Isotonic_Regression