Skip to content

[MRG] Implement Centered Isotonic Regression #21454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1,438 commits into from
Closed

[MRG] Implement Centered Isotonic Regression #21454

wants to merge 1,438 commits into from

Conversation

mathijs02
Copy link

Proposed feature

This PR implements 'Centered Isotonic Regression' (CIR). This algorithm is almost identical to regular Isotonic Regression (IR), but makes a small change to how the interpolation function is created. CIR is described in ref [1]. I am not one of the authors of the paper.

Details

The interpolation function of a regular IR is not strictly increasing/decreasing: there can be large parts of the domain where the output function is constant/flat. In some applications, it is known that the output function is strictly increasing/decreasing.[2] CIR changes the output function by shrinking the constant sections of the output function to a single weighted average point. For such cases, the resulting CIR function gives a reduction in estimation error. CIR requires no additional parameters.

I have used CIR multiple times already for use cases in which it is known that the output function should be strictly increasing/decreasing. CIR is implemented in an R package, and the implementation in scikit-learn is a relatively minor change. I therefore think this can be a valuable addition to scikit-learn.

The only required change is in the method _build_y where the IR-transformed input data points are converted to points for the scipy interpolation. An extra option is added here, which can trigger using the CIR method for building the points for the interpolation function.

Example

With this new code, CIR can easily be used in the following way:

ir = IsotonicRegression(centered=True)

An example plot comparing IR and CIR, with the points of the interpolation function marked:
IR vs CIR

References

[1] https://www.tandfonline.com/doi/abs/10.1080/19466315.2017.1286256 or https://arxiv.org/abs/1701.05964
[2] https://en.wikipedia.org/wiki/Isotonic_regression#Centered_Isotonic_Regression

@mathijs02 mathijs02 changed the title [WIP] Implement 'Centered Isotonic Regression' [MRG] Implement 'Centered Isotonic Regression' Oct 25, 2021
@ogrisel
Copy link
Member

ogrisel commented Oct 25, 2021

Thanks @mathijs02.

This sounds like a nice way to deal with #16321. If confirmed we might want to enable centering by default when isotonic regression is used as a classification calibrator.

The only problem is that strict monotonicity is not guaranteed near the edges (and beyond the edges) of the training data input range.

@mathijs02
Copy link
Author

The only problem is that strict monotonicity is not guaranteed near the edges of the training data input range.

That is indeed the case. In the implementation of the original authors, the focus is on the application in dose-response studies, where this choice matches with the expected behaviour (I am not in that field, so not an expert).

There are two reasons why on the edge there is no strict monotonicity:

  • Outputs of 0 and 1 are treated as special cases, for which strict monotonicity is not a requirement. This is what gives the constant outputs at the edges in the image above. It would be easy to adapt the algorithm to remove this, but I stayed close to the implementation of the paper.
  • When the edge outputs are not 0 or 1, two points are used: (1) the edge of the domain and (2) the weighted average of the constant piece of the function. This creates a constant CIR function at the edges, with half the width of the constant piece in the regular IR function. This can be resolved by only including the point at the edge of the domain (point 1), but this is a slightly different algorithm that might not match the original ideas by the authors (so we might have to give it another name). I'm also afraid that this adaptation would introduce some bias in the function in those regions.

The implementation from the paper was not aimed at classifier calibration. I think CIR might still have fewer problems (of the type described in #16321) with constant output domains than regular IR has, but I'm happy with any choice we make about the using it as calibration default or not.

@ogrisel
Copy link
Member

ogrisel commented Oct 29, 2021

Would you be interested in starting a new PR that builds on top of this one (branched of off it) to explore the ability of using a variant (with an additional, non-default constructor param) to ensure strict monotonicity in the context of using centered isotonic regression for strictly monotic calibration of probabilistic classifiers with predict_proba values guaranteed to lie in the [0, 1] range?

@mathijs02
Copy link
Author

I would be happy to build further on this. I have limited familiarity with the scikit-learn codebase and with the calibration methods, so I'm afraid I need a bit more context for your suggestion.

If I understand correctly, CIR is guaranteed to give outputs in the [0, 1] range when trained on binary labels (just like regular IR is). So if we use the implementation of CIR in this PR (#21454), I think no additional parameter is required for the IsotonicRegression constructor to indicate when we want to use it for classifier calibration?

By your proposal for using it for calibration, do you mean branching from this PR (#21454) to add CIR as one of the options for the parameter method in CalibratedClassifierCV?

To double check: do you think that the current PR (i.e. the method from paper [1]) is sufficiently 'strictly monotonic' for calibration, or do you think the flat areas at the edges pose a problem that would require us to think of adaptations of the algorithm?

@ogrisel
Copy link
Member

ogrisel commented Nov 4, 2021

To double check: do you think that the current PR (i.e. the method from paper [1]) is sufficiently 'strictly monotonic' for calibration, or do you think the flat areas at the edges pose a problem that would require us to think of adaptations of the algorithm?

I think they are still problematic but it's worth trying empirically to see how it goes in practice.

@mathijs02
Copy link
Author

I am still interested in making Centered Isotonic Regression part of sklearn. Would it be ok to proceed with this PR, and consider the question of what Isotonic Regression method is best for calibration as a separate issue to resolve at a later moment (i.e. maintaining for now the status quo for calibration)?

@ogrisel
Copy link
Member

ogrisel commented Apr 13, 2022

I am still interested in making Centered Isotonic Regression part of sklearn. Would it be ok to proceed with this PR, and consider the question of what Isotonic Regression method is best for calibration as a separate issue to resolve at a later moment (i.e. maintaining for now the status quo for calibration)?

Yes, sorry for the slow reply. Will need to put this PR review back to the top of my personal review backlog.

@mathijs02
Copy link
Author

I use this functionality quite regularly, so I would still be interested in merging this PR, or some other implementation or Centered Isotonic Regression.

Eschivo and others added 19 commits June 24, 2022 17:48
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Sangam Swadi K <sangamswadik@users.noreply.github.com>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
…onCV (#23565)

Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Sangam Swadi K <sangamswadik@users.noreply.github.com>
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Co-authored-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Loïc Estève <loic.esteve@ymail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Co-authored-by: Meekail Zain <34613774+Micky774@users.noreply.github.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
Co-authored-by: jeremiedbb <jeremiedbb@yahoo.fr>
@mathijs02
Copy link
Author

Could you add a test that on the training data set, we have

assert np.sum(x) == np.sum(iso_model.transform(X))

on both strategies, centered/strict and un-centered/non-strict?

Can you explain what we are aiming to test with this assert? I can't completely follow, and I don't see how we expect it to be true (the assert isn't true in general).

@lorentzenchr
Copy link
Member

I had a typo, sorry. I meant

assert np.sum(y) == np.sum(iso_model.transform(X))

@mathijs02
Copy link
Author

I had a typo, sorry. I meant

assert np.sum(y) == np.sum(iso_model.transform(X))

I checked: this is correct in the non-strict/regular isotonic regression when no weights are used, but not in the centered isotonic regression (or if weights are used in the regular isotonic regression). Do you have any reference as to why we would expect this to be true?

@lorentzenchr
Copy link
Member

lorentzenchr commented Aug 17, 2022

This is basically Ordinary Least Squares and has different names: score equation, balance property, calibration on average and uncoditional calibration. Just note that for one interval in x of constant prediction, the the predicted value is just the empirical mean of y, i.e. 1/n sum(y) over all y that belong to that interval of x.
BTW, with weights w is reads np.sum(w*y) == np.sum(w * iso_model.transform(X)).

I have to say, if this property is violated by the centered/strict version of isotonic regression, I'm inclined (but not yet decided) to vote -1 for this feature.

@mathijs02
Copy link
Author

mathijs02 commented Aug 18, 2022

This is basically Ordinary Least Squares and has different names: score equation, balance property, calibration on average and uncoditional calibration. Just note that for one interval in x of constant prediction, the the predicted value is just the empirical mean of y, i.e. 1/n sum(y) over all y that belong to that interval of x. BTW, with weights w is reads np.sum(w*y) == np.sum(w * iso_model.transform(X)).

I have to say, if this property is violated by the centered/strict version of isotonic regression, I'm inclined (but not yet decided) to vote -1 for this feature.

Thanks, that is good background info to have! Indeed the centered/strict version does not have this property.

Can you explain why this violation makes you think not to vote in favor of this feature? I think there are more (non-linear) regressors and classifiers in scikit-learn. The centered/strict version seems to be well-used within a specific field (dosimetry), and I have found it useful many times for calibration purposes.

@lorentzenchr
Copy link
Member

Failing to have this property, CIR is systematically biased. For instance, if you use it to calibrate a classifier, you'll get a biased result, e.g., you have an observed frequency (probability for y to be the positive class) mean(y) = 0.8 but predict mean(predict_proba(X)) = 0.9 (for y in {0, 1}).

@ogrisel
Copy link
Member

ogrisel commented Aug 22, 2022

I think there are more (non-linear) regressors and classifiers in scikit-learn.

I checked and the mean train predictions of a Hist GBRT model (even with the default least squares loss) aren't exactly np.mean(y_train) (and the difference can but much larger than machine precision rounding errors, e.g. I observed 1e-6 to 1e-12).

This does not happen for our traditional histogram gradient boosting implementation (GradientBoostingRegressor).

For CIR the difference can be even larger than HistGradientBoostingRegressor (e.g. more than 1e-4).

@lorentzenchr
Copy link
Member

@ogrisel Have you set validation_fraction=None? Or could you share the code snippet with HistGradientBoostingRegressor? If you mean 1e-6 relative difference, that seems a bit large to me, but I'd rather open a new issue for it.

@lorentzenchr
Copy link
Member

See also #22892.

@mathijs02
Copy link
Author

I tried a bunch of regressors, and indeed they all seem to be (close to) having the balance property, linear regression without intercept being an exception. I understand that this means that CIR is inherently biased, which for the user would be a trade-off with the advantage that it brings (i.e. strictly increasing).

Maybe we can add warnings for scenarios in which the balance property is not met, like CIR, but maybe also linear regression without intercept and maybe other scenarios in which it can happen?

@mathijs02
Copy link
Author

If you think it would be ok to implement CIR with the appropriate warning about violating the balancing property, I would still be interested in bringing this pull request to a successful completion.

@lorentzenchr
Copy link
Member

lorentzenchr commented Oct 3, 2022

My very personal opinion is that the CIR algorithm has a flaw by producing biased results. My guess is that it could be fixed by shifting the x-coordinates.
It it not that severe as this bias does not seem to be systematic in the sense that I can’t predict when it overshoots or undershoots, and I think it goes to zero in the large sample limit (much stronger even: I guess it is a consistent estimator).

All this and the low citation count considered, I‘m currently -1 for inclusion.

@ogrisel
Copy link
Member

ogrisel commented Oct 10, 2022

I am pretty sure that in practice the train-set bias observed does not convert into a test-set bias: for most natural data distribution, the ground truth conditional E(Y|X) is likely to be smooth (instead of piecewise constant has is the traditional IR estimate). As a result I suspect the piecewise linear CIR estimate to generalize slightly better than the IR estimate and the difference between the two to be larger than the impact of the non-systematic train set bias of CIR. But this might require empirical confirmation.

@lorentzenchr
Copy link
Member

While I would like to have such a method available, I think that such an analysis should have been done in research papers.

I'm very much interested in the use cases:

  • For calibration of classifiers, this bias really matters and will likely result in worse models compared to calibrating with standard PAV/isotonic reg.
  • Dose-response curves:
    • Are they usually univariate or multivariate?
      Note: For univariate strictly monotonic interpolation, scipy might be the better place.
    • In which variable(s) are they monotone?
  • Further applications?

@mathijs02
Copy link
Author

mathijs02 commented Nov 19, 2022

I went through the CIR paper again, and there is a section in the Supplemental Information that investigates the bias of CIR vs IR. Indeed CIR has a larger bias than IR. In addition, I found a paper exploring bias in IR, which suggests that asymptotically for large datasets IR bias goes to zero.

I cannot extract from the CIR paper whether the bias of CIR also tends to zero for large samples, but the authors write the following thing, suggesting that it might not: "Conversely, this also means that CIR's $\widetilde{\overline{F}}$ using the original PAVA weights should be biased downward near $F=0$, and upwards near $F=1$."

My main use case is to fit (C)IR models in order to make non-parametric estimates of (univariate) cumulative distribution functions, in general, not for a particular context. (C)IR fits in this context, because the only assumption I want to make is that the CDF is (strictly) monotonic increasing. In particular, I am interested in the inverse CDF. The constant regions in IR are a bit of a problem, because they cause non-unique mappings and/or high variance in the inverse CDF. For a graphical example of this, see the plot at the top of this page. CIR doesn't have this problem.

I think that the use case of dose-response curves is similar to the CDF use case. I am not very familiar with dose-response curves, but a quick look in the literature suggests that they are typically univariate, though there are some papers about multivariate dose-response curves.

It would be possible for me to explore the bias in CIR further, for example as a function of sample size, empirically. However, preparing an actual publication on this or working on a method that would satisfy the balance equation (e.g. through shifting the x-coordinates) is a bit beyond the scope of this MR for me. I understand that ideally we want to base scikit-learn implementations on published results.

I think it would make sense to determine whether the CIR algorithm in its current form (i.e. lacking the balance property) would fit in scikit-learn or not. If yes, I will invest some additional time in verifying some properties regarding bias, and possibly extending documentation that discusses this. If not, I will probably close this PR or mark it as stalled. According to the contribution documentation, two core developers would need to agree on an addition. I don't see any details in the documentation on how such a 'voting' process would work. Or is it fair to say that the -1 recommendation from @lorentzenchr means that in its current form, CIR should not be part of scikit-learn?

@mathijs02
Copy link
Author

@lorentzenchr @ogrisel Shall we make a (final) decision on whether CIR in its current form (i.e. without the balance property) should be part of scikit-learn or not (see my comment above)? Unfortunately adapting the algorithm is somewhat beyond the scope for me, so if we decide that it shouldn’t be part of scikit-learn, I will close this PR.

@lorentzenchr
Copy link
Member

Summary

Centered Isotonic Regression (CIR) modifies the PAV algorithm such that the resulting interpolation is strictly monotonic increasing.

The algo/paper does clearly not meet the inclusion criteria and is not well investigated wrt statistical consistency and bias. On the other hand, it would be a small modification of PAV/IsotonicRegression.

Use cases

  • Rank conserving calibration of classifiers.
  • Non-parametric, univariate fit of cumulative distribution functions (CDF) and dose-response curves, where in bith cases one is interested in the inverse function in the end.

@scikit-learn/core-devs

@lorentzenchr
Copy link
Member

lorentzenchr commented Dec 19, 2022

-1 from me because I think that the use cases hint more in the direction of scipy or statsmodels (@mathijs02 I could help you to place it there) and because the bias in not investigated.

@mathijs02
Copy link
Author

-1 from me because I think that the use cases hint more in the direction of scipy or statsmodels (@mathijs02 I could help you to place it there) and because the bias in not investigated.

Thanks for the suggestion and offer. Implementing CIR in statsmodels or scipy would mean implementing PAVA itself too, since those libraries currently do not have support for Isotonic Regression. That would expand the scope significantly. In addition, the scikit-learn API is very nice and makes it easy to combine models with other components of scikit-learn.

I am therefore considering making a very small library that simply implements a child class of IsotonicRegression and only makes a small change to .fit(). It is relatively straightforward to convert the IR PAVA results to a CIR function. This approach is not as good practice as making it part of a larger stats/ML library, but it would create a model object that is fully compatible with scikit-learn. Would you want to help with a review of such code @lorentzenchr ?

@mathijs02 mathijs02 closed this Dec 27, 2022
@lorentzenchr
Copy link
Member

@mathijs02 I could imagine that the PAV algo would be a very good fit for scipy.interpolate. I guess scikit-learn would have happily used their implementation if they had one.
I would be able to help with scipy but not with another, new library.

@lorentzenchr lorentzenchr changed the title [MRG] Implement 'Centered Isotonic Regression' [MRG] Implement Centered Isotonic Regression Mar 22, 2023
@ogrisel
Copy link
Member

ogrisel commented Mar 22, 2023

Personally I still think Centered Isotonic Regression is relevant for scikit-learn as an implicitly regularized variant of vanilla Isotonic Regression. I expect it to show a small but significantly better test performance when used as a post-hoc calibrator for classifiers when the number of samples is limited (making isotonic regression overfit and degrade the resolution component of the decomposition of NLL or Brier score) while Platt would be mis-specified (for instance if the reliability curve of the original model is not symmetric).

I would be +1 for reopening this PR if someone can come up an empirical study that can demonstrate those expected behaviors.

In many settings, data is limited (e.g. health data) but calibration can still be very useful.

@lorentzenchr
Copy link
Member

Would it make sense to open an issue for it to make it more visible than a closed PR? (BTW, this could be a nice subject for a bachelor or master thesis, doesn‘t it?)

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 22, 2023 via email

@lorentzenchr
Copy link
Member

Does it match our inclusion criteria in terms of visible popularity?

I fear the answer is no.

@lorentzenchr
Copy link
Member

lorentzenchr commented Mar 22, 2023

But I would not mind to include it despite not being popular. The striking argument for me was the lack of scientific analysis of statistical properties and behavior in practice.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 22, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.