Skip to content

[MRG] Binned regression cv #14560

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 39 commits into
base: main
Choose a base branch
from

Conversation

amueller
Copy link
Member

@amueller amueller commented Aug 2, 2019

Follow up on #4757 with a much simplified implementation.
Fixes #4757.

If anyone has an idea for a better/shorter name I'm all ears!

@amueller amueller force-pushed the binned_regression_cv branch from f22856c to 8f1a1fc Compare August 2, 2019 21:39
@amueller amueller force-pushed the binned_regression_cv branch from 8f1a1fc to 254180e Compare August 2, 2019 21:50
@amueller amueller changed the title [WIP] Binned regression cv [MRG] Binned regression cv Aug 21, 2019
@amueller
Copy link
Member Author

@jnothman this should be easier to review than your StratifiedKFold rewrite ;)

@skeller88
Copy link
Contributor

skeller88 commented Apr 8, 2020

@jnothman would love to, should I document it here?

@Dicksonchin93 please comment here (e.g., Why do you want it? When is it useful? Are there any references?)
thanks

How about this?

Kohavi (1995) finds that "stratification is generally a better scheme, both in terms of bias and variance, when compared to regular cross-validation". This is especially true when datasets are imbalanced. Forman (2010) also recommends stratification and notes that it avoids having zero positives in one or more of the folds, which would lead to undefined recall and undefined AUC.

EDIT: Could also include this, taken from this stackexchange thread:
Stratified Cross-validation violates the principal that the test labels should never have been looked at before the statistics are calculated, but this is generally thought to be innocuous as the only effect is to balance the folds. Arguably the main reason stratification is important is to address defects in the model training process, as the model could be too easily biased by over- or under-representation of classes.

@adrinjalali adrinjalali modified the milestones: 0.23, 0.24 Apr 21, 2020
@adrinjalali
Copy link
Member

moved to 0.24

@jnothman
Copy link
Member

Thanks @skeller88 but aren't all those references talking about stratification with categorical (i.e. classification) rather than continuous targets?

@jnothman
Copy link
Member

Is there a way to get this past its impasse? Can we task someone with writing a summary of when you might/not use this? Otherwise I fear its inclusion adds noise and confusion.

@skeller88
Copy link
Contributor

Thanks @skeller88 but aren't all those references talking about stratification with categorical (i.e. classification) rather than continuous targets?

Fair point. I will look for a source that specifically mentions regression.

@skeller88
Copy link
Contributor

It was surprisingly difficult to find a paper that explains the benefits of stratifying a continuous variable. Chatterjee (2017) might be a good source judging by the abstract, but I don't have access to the paper.

However, it's clear that stratified sampling on continuous variables is widely supported in R. So that could be the reference in the documentation for this feature:


Boehmke & Greenwell (2020) recommend stratified sampling in certain situations:

Stratified sampling [can be applied] to regression problems for data sets that have a small sample size and where the response variable deviates strongly from normality (i.e., positively skewed like Sale_Price). With a continuous response variable, stratified sampling will segment [the response variable] into quantiles and randomly sample from each. Consequently, this will help ensure a balanced representation of the response distribution in both the training and test sets.

In R, stratified sampling of continuous variables is supported by the rsample and caret packages.


The sources I mention above use a quantile-based binning approach. Interestingly, there's also an interleaved or Venetian Blinds cross validation approach that is implemented by the pls package. Diago et al (2018) explains the algorithm. First, the dataset is sorted by the dependent variable. Then:

In a n-fold venetian blind cross validation, each fold i is built taking samples from the dataset of a n-multiple position until the end of the dataset (samples i, i + n, i + 2n, i + 3n, …). Once the folds are built, a traditional n-fold cross validation is carried out, in which n models are trained with n–1 folds, and tested with the remaining fold, rotating the latter until all of them have been used. The average performance of the n models is finally computed.

It's unclear to me if this interleaved approach has advantages over the binned approach. I think that discussion is outside the scope of this PR.

@DouglasPatton
Copy link

DouglasPatton commented Sep 19, 2020

Edit: updated code with KBinsDiscretizer
It looks like I'm late to the party, but.... I implemented an approach that wraps RepeatedStratifiedKfold and creates groupings by quantiles or uniform using KBinsDiscretizer. I ran some real world data with and without this kind of stratifiication, and cv results are much more consistent. I use the quantile stratification for both the outer CV step to learn about algorithm performance and also for the estimator's internal gridsearchCV for hyper-parameter tuning.
With a dataset of 900 samples and 40 features I found that 20 groups/quantiles ( a lot) and 5 fold repeated twice did poorly while 5 groups with 10 fold repeated twice did much better.
The biggest difference between the models with tighter results after stratified CV was for the folds with poor estimator performance. The occurrence of well- fit folds was not noticeably changed by stratification (as expected), but the occurrence of folds that fit poorly (i.e., R2<0) without stratification ceased altogether.
Here's where I posted the code: (#4757 (comment))

It would be interesting to collaborate on a paper that does an analysis of how this plays out for real-world datasets as well as synthetic.

@jnothman
Copy link
Member

I suppose for a fair comparison we need to be evaluating "results are much more consistent" or "results are much better" on a held-out dataset that is produced similarly to the training dataset, rather than extracted from it with/without stratification.

I think if we can just get a few comments in the PR on when it is likely to be useful, e.g. quoting Boehemke & Greenwell to say "for data sets that have a small sample size and where the response variable deviates strongly from normality", this could be approved...

The question of whether this should be the default CV solution for continuous targets remains open.

@cmarmo cmarmo removed this from the 0.24 milestone Oct 15, 2020
Base automatically changed from master to main January 22, 2021 10:51
@cmarmo cmarmo added Needs Decision - Include Feature Requires decision regarding including feature and removed Needs Decision Requires decision labels May 17, 2022
@adrinjalali
Copy link
Member

I'm not sure where we stand here, do we still want it included?

@glemaitre
Copy link
Member

I think that we should have this feature in scikit-learn. This looks like a useful tool. My only concern is about the API (do we want a new splitter or not).

@lorentzenchr
Copy link
Member

I'm still a bit undecided on this one. Why not just random shuffling?
If we go with it, we should just extend StratifiedKFold. I don't like the distinction between things for classification and things for regression!

@mayer79 Do you have thoughts on this one?

@DouglasPatton
Copy link

DouglasPatton commented Oct 7, 2024 via email

@glemaitre
Copy link
Member

glemaitre commented Oct 7, 2024

So I asked the opinion of @ogrisel and @GaelVaroquaux IRL and I'll report some takes that we had. The points were:

  • In expectation, uniform sampling should do the job and thus we should not need stratification (so in accordance with @lorentzenchr point above).
  • The "in expectation" does not stand anymore in small sample size but you might have a lot of other issues to work with small sample size.
  • Stratification for classification has certainly be introduced to counter an engineering problem rather than a statistical issue: make sure that we have all classes appearing the test set (or the training set) (EDIT by @ogrisel: to enforce shape consistency for the output of predict_proba or decision_function).
  • @ogrisel does not like the distinction between classification and regression when it comes to our stratification and thus agree with @lorentzenchr.
  • We might make a better job in the documentation to explain that the cross-validation and in particular the test set distribution should emulate setting that one should encounter during production, such that a viable performance estimate is reported. This rule is therefore dictating the type of grouping that one should apply.

So I see two potentially opposed directions in which we could go:

  • Implement of sort of stratification for regression to have a similar way to deal with classification and regression problem. It looks like most probably a statistical overkill that is going to solve an engineering problem instead of answering a statistical need.
  • Improve our documentation and present more in details aspect: shuffle vs. not shuffling, test set distribution vs. production distribution, etc. The end of the road might be that we want to not stratify by default in the classification case.

So at the end I withdraw my thought on "we should implement this feature" because this might not the right thing to do. But we could come with a consensus on what we think is best.

NB: and I did not approach time-series features / target that bring complexity on the shuffle vs. not shuffling issue.

@adrinjalali
Copy link
Member

adrinjalali commented Oct 21, 2024

Seems like #26821 might solve this then? i.e. the user can decide how to bin / group the data for the regression task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

extend StratifiedKFold to float for regression