-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] Binned regression cv #14560
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[MRG] Binned regression cv #14560
Conversation
f22856c
to
8f1a1fc
Compare
8f1a1fc
to
254180e
Compare
# Conflicts: # doc/whats_new/v0.22.rst
@jnothman this should be easier to review than your |
How about this? Kohavi (1995) finds that "stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation". This is especially true when datasets are imbalanced. Forman (2010) also recommends stratification and notes that it avoids having zero positives in one or more of the folds, which would lead to undefined recall and undefined AUC. EDIT: Could also include this, taken from this stackexchange thread: |
moved to 0.24 |
Thanks @skeller88 but aren't all those references talking about stratification with categorical (i.e. classification) rather than continuous targets? |
Is there a way to get this past its impasse? Can we task someone with writing a summary of when you might/not use this? Otherwise I fear its inclusion adds noise and confusion. |
Fair point. I will look for a source that specifically mentions regression. |
It was surprisingly difficult to find a paper that explains the benefits of stratifying a continuous variable. Chatterjee (2017) might be a good source judging by the abstract, but I don't have access to the paper. However, it's clear that stratified sampling on continuous variables is widely supported in R. So that could be the reference in the documentation for this feature: Boehmke & Greenwell (2020) recommend stratified sampling in certain situations:
In The sources I mention above use a quantile-based binning approach. Interestingly, there's also an interleaved or Venetian Blinds cross validation approach that is implemented by the pls package. Diago et al (2018) explains the algorithm. First, the dataset is sorted by the dependent variable. Then:
It's unclear to me if this interleaved approach has advantages over the binned approach. I think that discussion is outside the scope of this PR. |
Edit: updated code with KBinsDiscretizer It would be interesting to collaborate on a paper that does an analysis of how this plays out for real-world datasets as well as synthetic. |
I suppose for a fair comparison we need to be evaluating "results are much more consistent" or "results are much better" on a held-out dataset that is produced similarly to the training dataset, rather than extracted from it with/without stratification. I think if we can just get a few comments in the PR on when it is likely to be useful, e.g. quoting Boehemke & Greenwell to say "for data sets that have a small sample size and where the response variable deviates strongly from normality", this could be approved... The question of whether this should be the default CV solution for continuous targets remains open. |
I'm not sure where we stand here, do we still want it included? |
I think that we should have this feature in scikit-learn. This looks like a useful tool. My only concern is about the API (do we want a new splitter or not). |
I'm still a bit undecided on this one. Why not just random shuffling? @mayer79 Do you have thoughts on this one? |
I think the reasons for not just using random shuffling are analogous to
the reasons for not just using random shuffling with imbalanced
classification problems.
Perhaps add to stratifiedkfold a splitter kwarg that has appropriate
default settings for classification and regression and which can be
overridden with an instance of a splitter. Default stratification for
continuous targets could be quantile and splitter kwarg could accept
integers for number of bins.
…On Mon, Oct 7, 2024, 11:05 AM Christian Lorentzen ***@***.***> wrote:
I'm still a bit undecided on this one. Why not just random shuffling?
If we go with it, we should just extend StratifiedKFold. I don't like the
distinction between things for classification and things for regression!
@mayer79 <https://github.com/mayer79> Do you have thoughts on this one?
—
Reply to this email directly, view it on GitHub
<#14560 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AL7SIY74FRYNP35YLW4NECLZ2KPJZAVCNFSM6AAAAABPPLBMFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJXGE4TAMJSG4>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
So I asked the opinion of @ogrisel and @GaelVaroquaux IRL and I'll report some takes that we had. The points were:
So I see two potentially opposed directions in which we could go:
So at the end I withdraw my thought on "we should implement this feature" because this might not the right thing to do. But we could come with a consensus on what we think is best. NB: and I did not approach time-series features / target that bring complexity on the shuffle vs. not shuffling issue. |
Seems like #26821 might solve this then? i.e. the user can decide how to bin / group the data for the regression task. |
Follow up on #4757 with a much simplified implementation.
Fixes #4757.
If anyone has an idea for a better/shorter name I'm all ears!