[MRG] Binned regression cv #14560

amueller · 2019-08-02T21:37:49Z

Follow up on #4757 with a much simplified implementation.
Fixes #4757.

If anyone has an idea for a better/shorter name I'm all ears!

# Conflicts: # doc/whats_new/v0.22.rst

amueller · 2019-08-22T00:04:54Z

@jnothman this should be easier to review than your StratifiedKFold rewrite ;)

skeller88 · 2020-04-08T18:25:44Z

@jnothman would love to, should I document it here?

@Dicksonchin93 please comment here (e.g., Why do you want it? When is it useful? Are there any references?)
thanks

How about this?

Kohavi (1995) finds that "stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation". This is especially true when datasets are imbalanced. Forman (2010) also recommends stratification and notes that it avoids having zero positives in one or more of the folds, which would lead to undefined recall and undefined AUC.

EDIT: Could also include this, taken from this stackexchange thread:
Stratified Cross-validation violates the principal that the test labels should never have been looked at before the statistics are calculated, but this is generally thought to be innocuous as the only effect is to balance the folds. Arguably the main reason stratification is important is to address defects in the model training process, as the model could be too easily biased by over- or under-representation of classes.

adrinjalali · 2020-04-21T17:38:55Z

moved to 0.24

jnothman · 2020-04-21T22:41:35Z

Thanks @skeller88 but aren't all those references talking about stratification with categorical (i.e. classification) rather than continuous targets?

jnothman · 2020-04-27T13:54:58Z

Is there a way to get this past its impasse? Can we task someone with writing a summary of when you might/not use this? Otherwise I fear its inclusion adds noise and confusion.

skeller88 · 2020-04-29T03:44:03Z

Thanks @skeller88 but aren't all those references talking about stratification with categorical (i.e. classification) rather than continuous targets?

Fair point. I will look for a source that specifically mentions regression.

skeller88 · 2020-05-02T21:56:06Z

It was surprisingly difficult to find a paper that explains the benefits of stratifying a continuous variable. Chatterjee (2017) might be a good source judging by the abstract, but I don't have access to the paper.

However, it's clear that stratified sampling on continuous variables is widely supported in R. So that could be the reference in the documentation for this feature:

Boehmke & Greenwell (2020) recommend stratified sampling in certain situations:

Stratified sampling [can be applied] to regression problems for data sets that have a small sample size and where the response variable deviates strongly from normality (i.e., positively skewed like Sale_Price). With a continuous response variable, stratified sampling will segment [the response variable] into quantiles and randomly sample from each. Consequently, this will help ensure a balanced representation of the response distribution in both the training and test sets.

In R, stratified sampling of continuous variables is supported by the rsample and caret packages.

The sources I mention above use a quantile-based binning approach. Interestingly, there's also an interleaved or Venetian Blinds cross validation approach that is implemented by the pls package. Diago et al (2018) explains the algorithm. First, the dataset is sorted by the dependent variable. Then:

In a n-fold venetian blind cross validation, each fold i is built taking samples from the dataset of a n-multiple position until the end of the dataset (samples i, i + n, i + 2n, i + 3n, …). Once the folds are built, a traditional n-fold cross validation is carried out, in which n models are trained with n–1 folds, and tested with the remaining fold, rotating the latter until all of them have been used. The average performance of the n models is finally computed.

It's unclear to me if this interleaved approach has advantages over the binned approach. I think that discussion is outside the scope of this PR.

DouglasPatton · 2020-09-19T12:43:21Z

Edit: updated code with KBinsDiscretizer
It looks like I'm late to the party, but.... I implemented an approach that wraps RepeatedStratifiedKfold and creates groupings by quantiles or uniform using KBinsDiscretizer. I ran some real world data with and without this kind of stratifiication, and cv results are much more consistent. I use the quantile stratification for both the outer CV step to learn about algorithm performance and also for the estimator's internal gridsearchCV for hyper-parameter tuning.
With a dataset of 900 samples and 40 features I found that 20 groups/quantiles ( a lot) and 5 fold repeated twice did poorly while 5 groups with 10 fold repeated twice did much better.
The biggest difference between the models with tighter results after stratified CV was for the folds with poor estimator performance. The occurrence of well- fit folds was not noticeably changed by stratification (as expected), but the occurrence of folds that fit poorly (i.e., R2<0) without stratification ceased altogether.
Here's where I posted the code: (#4757 (comment))

It would be interesting to collaborate on a paper that does an analysis of how this plays out for real-world datasets as well as synthetic.

jnothman · 2020-09-25T03:00:23Z

I suppose for a fair comparison we need to be evaluating "results are much more consistent" or "results are much better" on a held-out dataset that is produced similarly to the training dataset, rather than extracted from it with/without stratification.

I think if we can just get a few comments in the PR on when it is likely to be useful, e.g. quoting Boehemke & Greenwell to say "for data sets that have a small sample size and where the response variable deviates strongly from normality", this could be approved...

The question of whether this should be the default CV solution for continuous targets remains open.

adrinjalali · 2024-03-06T09:39:48Z

I'm not sure where we stand here, do we still want it included?

glemaitre · 2024-10-07T08:17:23Z

I think that we should have this feature in scikit-learn. This looks like a useful tool. My only concern is about the API (do we want a new splitter or not).

lorentzenchr · 2024-10-07T15:04:35Z

I'm still a bit undecided on this one. Why not just random shuffling?
If we go with it, we should just extend StratifiedKFold. I don't like the distinction between things for classification and things for regression!

@mayer79 Do you have thoughts on this one?

DouglasPatton · 2024-10-07T15:24:13Z

I think the reasons for not just using random shuffling are analogous to the reasons for not just using random shuffling with imbalanced classification problems. Perhaps add to stratifiedkfold a splitter kwarg that has appropriate default settings for classification and regression and which can be overridden with an instance of a splitter. Default stratification for continuous targets could be quantile and splitter kwarg could accept integers for number of bins.

…

On Mon, Oct 7, 2024, 11:05 AM Christian Lorentzen ***@***.***> wrote: I'm still a bit undecided on this one. Why not just random shuffling? If we go with it, we should just extend StratifiedKFold. I don't like the distinction between things for classification and things for regression! @mayer79 <https://github.com/mayer79> Do you have thoughts on this one? — Reply to this email directly, view it on GitHub <#14560 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AL7SIY74FRYNP35YLW4NECLZ2KPJZAVCNFSM6AAAAABPPLBMFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGOJXGE4TAMJSG4> . You are receiving this because you commented.Message ID: ***@***.***>

glemaitre · 2024-10-07T15:24:58Z

So I asked the opinion of @ogrisel and @GaelVaroquaux IRL and I'll report some takes that we had. The points were:

In expectation, uniform sampling should do the job and thus we should not need stratification (so in accordance with @lorentzenchr point above).
The "in expectation" does not stand anymore in small sample size but you might have a lot of other issues to work with small sample size.
Stratification for classification has certainly be introduced to counter an engineering problem rather than a statistical issue: make sure that we have all classes appearing the test set (or the training set) (EDIT by @ogrisel: to enforce shape consistency for the output of predict_proba or decision_function).
@ogrisel does not like the distinction between classification and regression when it comes to our stratification and thus agree with @lorentzenchr.
We might make a better job in the documentation to explain that the cross-validation and in particular the test set distribution should emulate setting that one should encounter during production, such that a viable performance estimate is reported. This rule is therefore dictating the type of grouping that one should apply.

So I see two potentially opposed directions in which we could go:

Implement of sort of stratification for regression to have a similar way to deal with classification and regression problem. It looks like most probably a statistical overkill that is going to solve an engineering problem instead of answering a statistical need.
Improve our documentation and present more in details aspect: shuffle vs. not shuffling, test set distribution vs. production distribution, etc. The end of the road might be that we want to not stratify by default in the classification case.

So at the end I withdraw my thought on "we should implement this feature" because this might not the right thing to do. But we could come with a consensus on what we think is best.

NB: and I did not approach time-series features / target that bring complexity on the shuffle vs. not shuffling issue.

adrinjalali · 2024-10-21T08:35:59Z

Seems like #26821 might solve this then? i.e. the user can decide how to bin / group the data for the regression task.

amueller added 3 commits August 2, 2019 17:15

add BinnedStratifiedKFold

a783df4

add tests

83ee718

add BinnedStratifiedKFold to model selection __init__

b0f8f62

amueller force-pushed the binned_regression_cv branch from f22856c to 8f1a1fc Compare August 2, 2019 21:39

try to convert to "new" interface

254180e

amueller force-pushed the binned_regression_cv branch from 8f1a1fc to 254180e Compare August 2, 2019 21:50

amueller added 5 commits August 2, 2019 17:56

n_splits=5 is default

4e5af26

add binnedstratifiedkfold to general test

9d896c3

add to more tests

1d1598b

replace implementation by binning and using StratifiedKFold

6e78b06

minor fixes, overload _make_test_folds

6924b38

amueller mentioned this pull request Aug 16, 2019

StratifiedKFold makes fold-sizes very unequal #14673

Closed

amueller added 5 commits August 16, 2019 18:05

add assertion on number of quantile bins for debugging

b4fcdf5

rename quantile_bins to n_bins, simplify tests

885745b

add to classes.rst

f823a34

add user guide

063ee0f

add see also to stratified kfold

f750f77

amueller mentioned this pull request Aug 21, 2019

Improve example on StratifiedKFold in user guide #14714

Closed

amueller added 3 commits August 21, 2019 19:48

change split output

2e15f56

pep8

296c044

whatsnew

f6606e3

amueller changed the title ~~[WIP] Binned regression cv~~ [MRG] Binned regression cv Aug 21, 2019

amueller added 4 commits August 21, 2019 19:55

Merge branch 'master' into binned_regression_cv

4dfc7fa

# Conflicts: # doc/whats_new/v0.22.rst

fix up docs

b939397

docstring nitpicks

38fdcf3

remove unnecessary code

15655f6

amueller added 3 commits August 21, 2019 20:06

fix "see also" formatting

6b82556

improve coverage

4bd0fb2

typo

deecd92

github-actions bot added the module:model_selection label Mar 2, 2020

cmarmo added the Waiting for Reviewer label Mar 29, 2020

adrinjalali modified the milestones: 0.23, 0.24 Apr 21, 2020

skeller88 mentioned this pull request May 8, 2020

Feature request - scientific paper that supports stratified sampling of numeric variables? tidymodels/rsample#151

Closed

cmarmo added Needs Decision Requires decision and removed Waiting for Reviewer labels Aug 25, 2020

cmarmo removed this from the 0.24 milestone Oct 15, 2020

Base automatically changed from master to main January 22, 2021 10:51

cmarmo added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Decision Requires decision labels May 17, 2022

adrinjalali added Stalled help wanted labels Mar 6, 2024

adrinjalali added the Needs Decision Requires decision label Mar 6, 2024

glemaitre mentioned this pull request Oct 7, 2024

Add balance_regression option to train_test_split for regression problems #30009

Closed

lorentzenchr mentioned this pull request Oct 8, 2024

extend StratifiedKFold to float for regression #4757

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Binned regression cv #14560

[MRG] Binned regression cv #14560

amueller commented Aug 2, 2019 •

edited

Loading

Uh oh!

amueller commented Aug 22, 2019

Uh oh!

skeller88 commented Apr 8, 2020 •

edited

Loading

Uh oh!

adrinjalali commented Apr 21, 2020

Uh oh!

jnothman commented Apr 21, 2020

Uh oh!

jnothman commented Apr 27, 2020

Uh oh!

skeller88 commented Apr 29, 2020

Uh oh!

skeller88 commented May 2, 2020

Uh oh!

DouglasPatton commented Sep 19, 2020 •

edited

Loading

Uh oh!

jnothman commented Sep 25, 2020

Uh oh!

adrinjalali commented Mar 6, 2024

Uh oh!

glemaitre commented Oct 7, 2024

Uh oh!

lorentzenchr commented Oct 7, 2024

Uh oh!

DouglasPatton commented Oct 7, 2024 via email

Uh oh!

glemaitre commented Oct 7, 2024 •

edited by ogrisel

Loading

Uh oh!

adrinjalali commented Oct 21, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[MRG] Binned regression cv #14560

Are you sure you want to change the base?

[MRG] Binned regression cv #14560

Conversation

amueller commented Aug 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

amueller commented Aug 22, 2019

Uh oh!

skeller88 commented Apr 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Apr 21, 2020

Uh oh!

jnothman commented Apr 21, 2020

Uh oh!

jnothman commented Apr 27, 2020

Uh oh!

skeller88 commented Apr 29, 2020

Uh oh!

skeller88 commented May 2, 2020

Uh oh!

DouglasPatton commented Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jnothman commented Sep 25, 2020

Uh oh!

adrinjalali commented Mar 6, 2024

Uh oh!

glemaitre commented Oct 7, 2024

Uh oh!

lorentzenchr commented Oct 7, 2024

Uh oh!

DouglasPatton commented Oct 7, 2024 via email

Uh oh!

glemaitre commented Oct 7, 2024 • edited by ogrisel Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adrinjalali commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

amueller commented Aug 2, 2019 •

edited

Loading

skeller88 commented Apr 8, 2020 •

edited

Loading

DouglasPatton commented Sep 19, 2020 •

edited

Loading

glemaitre commented Oct 7, 2024 •

edited by ogrisel

Loading

adrinjalali commented Oct 21, 2024 •

edited

Loading