MRG+1: Add resample to preprocessing. #1454

erg · 2012-12-09T01:15:42Z

There have been several requests lately for class rebalancing using under/oversampling. This utility function addresses most of the use cases I can think of for sampling with replacement from a dataset.

One thing it does not do is to sample without replacement before sampling with replacement because it changes the code substantially and there is no efficient version of random.sample as per #1438 (comment). I could add that feature eventually.

amueller · 2012-12-09T09:16:33Z

Thanks for the PR. This will be really useful!

amueller · 2012-12-09T09:27:07Z

I am not sure I understand the motivation for sampling with replacement instead of without. Using np.random.permutation or similar that should be easy to do. Have you seen the discussion in #1362?

The issue I have with the sampling with replacement is that it discards ~30% of the samples from the smallest class. I don't see the point in doing that.

amueller · 2012-12-09T09:42:13Z

Btw, you say this is useful for oversampling and undersampling but it does neither.

I just checked and np.random.permutation is slower than random.sample :(
I don't really see how this feature is performance relevant, though. This is just done once, whereas in the random projection case it is done once per feature dimension.

mblondel · 2012-12-09T11:11:30Z

I'm not sure if I would look for this in the preprocessing module. utils/__init__.py has a resample function already so maybe rename your function stratified_resample (or balanced_resample) and move it there?

GaelVaroquaux · 2012-12-09T13:04:49Z

I'm not sure if I would look for this in the preprocessing module. utils/
init.py has a resample function already so maybe rename your function
stratified_resample and move it there?

Maybe that resample function shouldn't be in utils? The rule is that
anything in utils is not for end users, so if we believe that 'resample'
is for end user, we need to expose it elsewhere.

amueller · 2012-12-09T13:11:55Z

+1. Maybe we should also move shuffle out then.

I didn't knew about resample. Is it in the references?

-1 on stratified_resample (I would use that name but in sklearn stratified has a different meaning. balanced_resample sounds ok.

the current resample uses randint btw.

To me it looks like we could maybe even but both in the same function using a keyword argument?

mblondel · 2012-12-09T14:37:34Z

We could have public utils and private utils (function beginning with _).

mblondel · 2012-12-09T14:39:09Z

@amueller Where would you put shuffle? For such a general function, it's hard to choose where to put it.

GaelVaroquaux · 2012-12-09T14:40:19Z

We could have public utils and private utils (function beginning with _).

The logic for not having anything useful for end users in 'utils' is that
it is a non descriptive name. If something is useful to a user, it should
be put in a module/sub-package with a descriptive name.

amueller · 2012-12-09T14:44:51Z

@mblondel I don't have an idea. I just feel that shuffle does something very similar to resample, so they should be close together. Also shuffle should be easily accessible to the user, I use it a lot.

mblondel · 2012-12-09T14:45:52Z

@GaelVaroquaux Utils is divided into submodules with meaningful names such as arrayfuncs, sparsefuncs, validation... For resample and shuffle, we could create a submodule sampling.

FWIW, I'm using the utils quite often in my research code. I think they can be useful to users too.

mblondel · 2012-12-09T14:46:35Z

We should probably stop putting anything in __init__.py.

amueller · 2012-12-09T14:50:23Z

sampling sounds pretty unspecific, too, I think...

GaelVaroquaux · 2012-12-09T14:50:55Z

Also shuffle should be easily accessible to the user, I use it a lot.

I never use it. I just do:

indices = np.random.permutation(n_samples)
X = X[indices]
y = y[indices]

In my experience, it is not a good idea to push people to use
library-specific function to save 3 lines of code, because it does not
teach them how to use basic Python. Also, it makes the code harder to
follow to anyhow who does not know scikit-learn. One should favor Python
and numpy knowledge over knowledge of the scikit.

GaelVaroquaux · 2012-12-09T14:51:11Z

sampling sounds pretty unspecific, too, I think...

More than 'utils'.

GaelVaroquaux · 2012-12-09T14:51:41Z

We should probably stop putting anything in init.py.

There is an 'import *' in there that makes me sick each time I see it.

GaelVaroquaux · 2012-12-09T14:55:27Z

@GaelVaroquaux Utils is divided into submodules with meaningful names such as
arrayfuncs, sparsefuncs, validation...

Flat is better than nested.

For resample and shuffle, we could create a submodule sampling.

Sounds probably useful.

FWIW, I'm using the utils quite often in my research code. I think they can be
useful to users too.

If you are using them to build your own estimators, than I think that it
is fair enough. If you are using them to do end user machine
learning/data mining pipelines, then I think that we have a design
problem.

Think about it this way: you are coming new to a package. You have a
vague idea of what is in there and want to explore quickly to get
something done. 'utils' certainly does not help you to discover
functionality.

amueller · 2012-12-09T14:56:48Z

On 12/09/2012 03:51 PM, Gael Varoquaux wrote:

In my experience, it is not a good idea to push people to use
library-specific function to save 3 lines of code, because it does not
teach them how to use basic Python. Also, it makes the code harder to
follow to anyhow who does not know scikit-learn. One should favor Python
and numpy knowledge over knowledge of the scikit.

—

True. I see more and more people asking for stuff in sklearn that is
just 3 lines of numpy.
It is nice to have convenience functions but people shouldn't have to
rely on them.

GaelVaroquaux · 2012-12-09T14:59:45Z

True. I see more and more people asking for stuff in sklearn that is
just 3 lines of numpy.

Yes, and it is easy to yeild to these requests, but this is a dangerous
line, and a battle that we will loose by exhaustion. People should learn
numpy. If we can teach it to them via tutorials and examples, this is a
better option.

It is nice to have convenience functions but people shouldn't have to
rely on them.

Our convenience functions do much more: they do input validation, and
sensible error messages. But that's for inner usage.

amueller · 2012-12-09T15:19:29Z

I will keep your advice in mind when answering on stackoverflow and the mailing list.

erg · 2012-12-09T17:04:50Z

I am not sure I understand the motivation for sampling with replacement instead of without. Using np.random.permutation or similar that should be easy to do. Have you seen the discussion in #1362?

Bagging, by the literature, uses sampling without replacement. Edit: sampling with replacement

The issue I have with the sampling with replacement is that it discards ~30% of the samples from the smallest class. I don't see the point in doing that.

You are right, so how about a mode that samples without replacement the full set of samples until it's exhausted, then repeats in a loop? So if you have 20 different samples in one class and you want 100, it would just repeat those indices five times and return the shuffle. This way is more deterministic, still gives some randomness if the requested n_samples is not a multiple of the number available, and uses all the samples before repeating.

Btw, you say this is useful for oversampling and undersampling but it does neither.

According to my understanding, it does either depending on the situation. If you have a set of 100 samples with a 30/70 split, and you want to balance them, then it will oversample the 30 to 50 and undersample the 70 to 50. You can also use n or scale to triple or half the size of the dataset.

sampling sounds pretty unspecific, too, I think...

I am -1 on making this utility function its own module, especially one called sampling because of MCMC, Gibbs sampling, rejection sampling, etc.

We should probably stop putting anything in init.py

+2 for this, I've never even thought to look in init.py for any actual code.

I just looked at the current resample and it looks like it can only select some % of data equal or less than the current amount. If the scale factor is over 1.0, you get more samples in the new resample, so you could scale your dataset arbitrarily to test the running time on somewhat real data.

shuffle seems to duplicate np.random.permutation, as pointed out above. Agree that this is bad.

I will add a replace parameter to this pull request and then see where we're at. Do I need to do anything specific to support sparse matrices?

amueller · 2012-12-10T12:35:27Z

Am 09.12.2012 18:04, schrieb erg:

I am not sure I understand the motivation for sampling with
replacement instead of without. Using np.random.permutation or
similar that should be easy to do. Have you seen the discussion in
#1362 <https://github.com/scikit-learn/scikit-learn/issues/1362>?
Bagging, by the literature, uses sampling without replacement.

What has this pull-request to do with bagging?
The issue I have with the sampling with replacement is that it
discards ~30% of the samples from the smallest class. I don't see
the point in doing that.
You are right, so how about a mode that samples without replacement
the full set of samples until it's exhausted, then repeats in a loop?
So if you have 20 different samples in one class and you want 100, it
would just repeat those indices five times and return the shuffle.
This way is more deterministic, still gives some randomness if the
requested n_samples is not a multiple of the number available, and
uses all the samples before repeating.

Not sure. This seems pretty magic.

|shuffle| seems to duplicate |np.random.permutation|, as pointed out
above. Agree that this is bad.

It is not, it does a consistent permutation of different arrays but it
is synonymous with what Gael posted.

I will add a |replace| parameter to this pull request and then see
where we're at.

Sounds good.

Do I need to do anything specific to support sparse matrices?

As you return indices and not masks, it should just work.

amueller · 2012-12-10T12:36:58Z

sklearn/preprocessing.py

+
+def _histogram_indices(seq):
+    """Collects a list of indices for each element occuring in an array-like
+    and returns a dict of lists.


I think this could be implemented more easily by using "sort" or "argsort".

Weird github quirk, your comment is still here now but on the wrong line...

Any idea about which line was this comment for originally? Has it been addressed?

It was for a function that doesn't exist anymore.

amueller · 2012-12-10T12:41:16Z

So should this resample replace the current version? I'm not sure if / how the current version is used.

mblondel · 2012-12-10T12:53:59Z

I also like shuffle. It's a nifty little util and handles sparse matrices for you. Also

X, y = shuffle(X, y)

speaks for itself, so it fits @GaelVaroquaux 's readability requirement.

erg · 2012-12-10T15:01:10Z

Bagging, by the literature, uses sampling without replacement.

What has this pull-request to do with bagging?

Damn, I meant it does sampling with replacement. You were asking why we want that feature and that was the reason, but I got it backwards..........oops.

You are right, so how about a mode that samples without replacement
the full set of samples until it's exhausted, then repeats in a loop?
So if you have 20 different samples in one class and you want 100, it
would just repeat those indices five times and return the shuffle.
This way is more deterministic, still gives some randomness if the
requested n_samples is not a multiple of the number available, and
uses all the samples before repeating.

Not sure. This seems pretty magic.

The current resample checks that the number of samples you want is <= to the number you pass it. However, I think it would be useful to be able to create immense datasets that look like a dataset, like scaling the iris dataset to 100,000 times its size to check the running time of an algorithm and still hope to get good results. So you can sample with replacement and it just works, or sample without replacement and you exhaust the samples quickly.

When the samples run out because replace is False, how do you get more to scale up the data? You could 1) continue sampling with replacement for the rest of the samples, or you could 2) restart the sampling without replacement again and loop the procedure until your dataset is big enough. Those are the two options that make sense to me, but option 1) is basically just setting replace=True for a large enough dataset.

X, y = shuffle(X, y)

This could be useful, I can leave it in and base it off of the new code.

As for where this function should live, it still seems like a preprocessing thing to me. Rebalance labels, standardize, train...

GaelVaroquaux · 2012-12-10T17:12:12Z

X, y = shuffle(X, y)

speaks for itself, so it fits @GaelVaroquaux 's readability requirement.

Yes and no. We I read such code from a library I don't know, I always end
up opening the code of 'shuffle', because I want to know exactly what it
does, and why the author didn't use a two-liner.

erg · 2012-12-11T00:33:42Z

I just rebased and added a replace= feature, renamed n to n_samples, added random_state, and added over/undersample modes that calculate the output size based on the max/min class counts. I really can't think of any more features it doesn't support right now, but be my guest to find some.

I think preprocessing.resample is now a superset of the features of utils.resample.

I did a forced push, so you'll want to reset to my new patches instead of using the old ones.

ogrisel · 2012-12-11T09:48:34Z

sklearn/preprocessing.py

+    return counts
+
+
+def resample(y, method=None, proba=None, n_samples=None, scale=None, replace=False,


Maybe we could rename y to class_labels to be more explicit?

ogrisel · 2012-12-11T09:59:08Z

+1 for keeping the current implementation in sklearn.preprocessing and deprecating sklearn.utils.resample and migrating existing code to the new implementation (beware of import loops though...).

We should also add a new sklearn.preprocessing.ClassRebalancer class implementing the TransformerMixin that woul use the resample function internally to make it possible to use in a Pipeline in collaboration with cross validation and grid search utilities. See #1362.

We could also move the shuffle utility function into sklearn.preprocessing as I often find it useful outside of sklearn, esp. when dealing with sparse matrices in interactive sessions (with deprecation warnings as well for backward compat), but this can be addressed in a separate PR).

Also we could turn sklearn.preprocessing into a package with sub modules as it's growing big. We would also still be importing the public API of the sub modules in the sklearn.preprocessing's __init__ to keep it friendly / flat for the users.

iteritems() -> items() scikit-learn#1454 BUG: Fix dictionary code for python3. DOC: Add narrative docs for resample_labels.

agramfort · 2015-04-19T23:21:09Z

@erg thanks for the quick reaction time. See why travis is unhappy then +1 for merge.

jnothman · 2015-04-26T07:09:33Z

Thinking more about sampling as weights: methods that accept weights do not need to distinguish between undersampling, oversampling, etc. For those that do not accept sample weights we can produce a sample of size N stochastically balancing classes (sampling without replacement) using:

counts = random_state.multinomial(N, sklearn.utils.class_weight.compute_sample_weight('auto', y))
indices = np.repeat(np.arange(y.shape[0], counts)

Here the equal sampling of classes is not exact but it has the advantage of accounting for multioutput/multilabel y.

jnothman · 2016-09-15T04:54:44Z

Well, this is long dormant. It certainly had a lot of effort put into it.

How much do we still want/need this?

I think a Resampler class can be made as a wrapper around the new CV splitters, as I've drafted at #5972 (comment)

jnothman · 2016-09-15T04:58:55Z

I've commented a couple of times about the relationship between resampling and class_weight. it may still be worthwhile providing utilities for resampling, but emphasising that similar can be accomplished with class_weight.

I think progress of this PR would benefit from a fresh PR...

jnothman · 2016-09-26T23:42:23Z

sklearn/preprocessing/resample.py

+            over/undersampling for `len(y)` samples by default.
+        "oversample" grows all classes to the count of the largest class.
+        "undersample" shrinks all classes to the count of the smallest class.
+        dict with pairs of class, probability with values summing to 1


"(class, probability)"

jnothman · 2016-09-26T23:42:25Z

sklearn/preprocessing/resample.py

+    counts = np.repeat(sample_size, n_classes)
+    if sample_size_rem > 0:
+        counts[:sample_size_rem] += 1
+        # Shuffle so the class inbalance varies between runs


"imbalance"

jnothman · 2016-09-26T23:42:27Z

sklearn/preprocessing/resample.py

+def _fair_array_counts(n_samples, n_classes, random_state=None):
+    """Tries to fairly partition n_samples between n_classes.
+    If this cannot be done fairly, +1 is added `remainder` times
+    to the counts for random arrays until a total of `n_samples` is


"random arrays" -> "random classes"

jnothman · 2016-09-26T23:43:35Z

sklearn/preprocessing/resample.py

+        Indices sampled from the dataset respecting a class distribution
+        controlled by this function's parameters.
+
+    Examples


This seems to me a lot of examples, with too much changing between them (i.e. X and y could be constant). Could this be simplified?

jnothman · 2016-09-26T23:43:52Z

I'd like to see this merged. It's been sitting around a while and I think general consensus is that it's more useful than not (it's certainly sought), and fits in scikit-learn's scope. I'd like to see a few changes to it:

More emphasis in documentation on the relationship between resampling and class_weight. I think oen of the benefits of resampling.
A test for equivalence with class_weight with a dict / "balanced"
I'm not so happy with the interface surrounding "undersample" and "oversample". These are really "balanced" with named scaling values. I think these should either be values for scaling, or should be invalid in conjunction with scaling != 1.
Mix of float and int values for scaling is a bit confusing. Elsewhere we allow floats to substitute for int values, but usually that's in the context where a float >1.0 or even >=1.0 is inappropriate/invalid. Here I'm a little worried that allowing 2.0 to differ semantically from 2 is too big a trap for users. A user forgetting to int their ceil will potentially generate an enormous dataset. Can someone point me to other places in the codebase we've done this with a float >1.0? Unfortunately, the solutions I can think of (string '2.0x' or '200%'; additional parameter) are not very consistent with other scikit-learn APIs.
Perhaps a CV meta-splitter that wraps this to perform resampled CV; however, this may be controversial and I'd happily see it as a next step. It merely mitigates the fact that this helper does not fit into the API neatly and best practiced for resampling-style transformations is not established. (Potential controversy may extend from the fact that other transformations altering training set only, such as pertubation or dropout, cannot be achieved with a CV meta-splitter.)
A final review for code simplification may be worthwhile.

Who's with me?

@erg have we sapped you of all enthusiasm, or are you hanging in there? Thanks for your great work and great patience!

scikit-learn#1454

erg · 2018-06-09T04:42:04Z

I ported this patch to the latest scikit-learn and addressed some of the latest critiques.

@jnothman I tried to move oversample and undersample to the scale parameter, but then you can't oversample and scale the dataset in the same call.

Also, method=balance and method=dict are mutually exclusive, I'm not sure how to add the test case you asked for.

It would be nice to merge this PR soon so that we can use it as a starting place and incrementally improve it. It would have already had six years of improvements if merged in 2012. :)

jnothman · 2018-06-09T12:02:55Z

Hahahaha :) Thanks Doug, especially for your patience and persistence. I'd like to see this merged in once we have an interface like #3855 to integrate it into an estimator. See #3855 (comment) in particular as my plan of action... but I've not yet had time to act myself. :\

scikit-learn#1454

adrinjalali · 2019-08-07T12:14:39Z

Superseded by #13269

amueller reviewed Dec 10, 2012
View reviewed changes

ogrisel reviewed Dec 11, 2012
View reviewed changes

erg added a commit to erg/scikit-learn that referenced this pull request Apr 19, 2015

ENH Add resample_labels, docs, and tests to the preprocessing module.

8a1ca24

iteritems() -> items() scikit-learn#1454 BUG: Fix dictionary code for python3. DOC: Add narrative docs for resample_labels.

agramfort changed the title ~~PR: Add resample to preprocessing.~~ MRG+1: Add resample to preprocessing. Apr 19, 2015

amueller added the Waiting for Reviewer label Dec 10, 2015

agramfort mentioned this pull request Mar 21, 2016

Balanced/Weighted Sampling #6568

Open

jnothman mentioned this pull request Sep 15, 2016

Stratifying Across Classes During Training in ShuffleSplit #5965 #5972

Closed

jnothman requested changes Sep 26, 2016

View reviewed changes

raghavrv added Stalled Need Contributor labels Nov 27, 2016

jnothman mentioned this pull request Dec 5, 2016

Determining parameter behavior based on float / int type #7973

Open

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

erg force-pushed the resample branch from 23653b2 to acd6587 Compare June 8, 2018 14:26

erg added a commit to erg/scikit-learn that referenced this pull request Jun 8, 2018

ENH Add resample_labels, docs, and tests to the preprocessing module.

acd6587

scikit-learn#1454

erg force-pushed the resample branch from acd6587 to 263375c Compare June 9, 2018 04:30

erg added a commit to erg/scikit-learn that referenced this pull request Jun 9, 2018

ENH Add resample_labels, docs, and tests to the preprocessing module.

dbf09c8

scikit-learn#1454

erg added 4 commits June 11, 2018 11:01

ENH Add resample_labels, docs, and tests to the preprocessing module.

f96c35d

scikit-learn#1454

DOC: Fix some typos, clarify docs, fix flake8 warnings.

a77d725

DOC: Trying to write shorter and clearer docs.

b817861

DOC: Fix the doctests according to travisci output.

295a1de

erg force-pushed the resample branch from 263375c to 295a1de Compare June 11, 2018 16:02

orausch mentioned this pull request Feb 26, 2019

[WIP] Resamplers #13269

Open

8 tasks

adrinjalali closed this Aug 7, 2019

		return counts


		def resample(y, method=None, proba=None, n_samples=None, scale=None, replace=False,

MRG+1: Add resample to preprocessing. #1454

MRG+1: Add resample to preprocessing. #1454

Conversation

erg commented Dec 9, 2012

amueller commented Dec 9, 2012

amueller commented Dec 9, 2012

amueller commented Dec 9, 2012

mblondel commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

amueller commented Dec 9, 2012

mblondel commented Dec 9, 2012

mblondel commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

amueller commented Dec 9, 2012

mblondel commented Dec 9, 2012

mblondel commented Dec 9, 2012

amueller commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

amueller commented Dec 9, 2012

GaelVaroquaux commented Dec 9, 2012

amueller commented Dec 9, 2012

erg commented Dec 9, 2012

amueller commented Dec 10, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amueller commented Dec 10, 2012

mblondel commented Dec 10, 2012

erg commented Dec 10, 2012

GaelVaroquaux commented Dec 10, 2012

erg commented Dec 11, 2012

Choose a reason for hiding this comment

ogrisel commented Dec 11, 2012

agramfort commented Apr 19, 2015

jnothman commented Apr 26, 2015

jnothman commented Sep 15, 2016

jnothman commented Sep 15, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnothman commented Sep 26, 2016

erg commented Jun 9, 2018

jnothman commented Jun 9, 2018 via email

adrinjalali commented Aug 7, 2019