Skip to content

MRG+1: Add resample to preprocessing. #1454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

erg
Copy link
Contributor

@erg erg commented Dec 9, 2012

There have been several requests lately for class rebalancing using under/oversampling. This utility function addresses most of the use cases I can think of for sampling with replacement from a dataset.

One thing it does not do is to sample without replacement before sampling with replacement because it changes the code substantially and there is no efficient version of random.sample as per #1438 (comment). I could add that feature eventually.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

Thanks for the PR. This will be really useful!

@amueller
Copy link
Member

amueller commented Dec 9, 2012

I am not sure I understand the motivation for sampling with replacement instead of without. Using np.random.permutation or similar that should be easy to do. Have you seen the discussion in #1362?

The issue I have with the sampling with replacement is that it discards ~30% of the samples from the smallest class. I don't see the point in doing that.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

Btw, you say this is useful for oversampling and undersampling but it does neither.

I just checked and np.random.permutation is slower than random.sample :(
I don't really see how this feature is performance relevant, though. This is just done once, whereas in the random projection case it is done once per feature dimension.

@mblondel
Copy link
Member

mblondel commented Dec 9, 2012

I'm not sure if I would look for this in the preprocessing module. utils/__init__.py has a resample function already so maybe rename your function stratified_resample (or balanced_resample) and move it there?

@GaelVaroquaux
Copy link
Member

I'm not sure if I would look for this in the preprocessing module. utils/
init.py has a resample function already so maybe rename your function
stratified_resample and move it there?

Maybe that resample function shouldn't be in utils? The rule is that
anything in utils is not for end users, so if we believe that 'resample'
is for end user, we need to expose it elsewhere.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

+1. Maybe we should also move shuffle out then.

I didn't knew about resample. Is it in the references?

-1 on stratified_resample (I would use that name but in sklearn stratified has a different meaning. balanced_resample sounds ok.

the current resample uses randint btw.

To me it looks like we could maybe even but both in the same function using a keyword argument?

@mblondel
Copy link
Member

mblondel commented Dec 9, 2012

We could have public utils and private utils (function beginning with _).

@mblondel
Copy link
Member

mblondel commented Dec 9, 2012

@amueller Where would you put shuffle? For such a general function, it's hard to choose where to put it.

@GaelVaroquaux
Copy link
Member

We could have public utils and private utils (function beginning with _).

The logic for not having anything useful for end users in 'utils' is that
it is a non descriptive name. If something is useful to a user, it should
be put in a module/sub-package with a descriptive name.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

@mblondel I don't have an idea. I just feel that shuffle does something very similar to resample, so they should be close together. Also shuffle should be easily accessible to the user, I use it a lot.

@mblondel
Copy link
Member

mblondel commented Dec 9, 2012

@GaelVaroquaux Utils is divided into submodules with meaningful names such as arrayfuncs, sparsefuncs, validation... For resample and shuffle, we could create a submodule sampling.

FWIW, I'm using the utils quite often in my research code. I think they can be useful to users too.

@mblondel
Copy link
Member

mblondel commented Dec 9, 2012

We should probably stop putting anything in __init__.py.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

sampling sounds pretty unspecific, too, I think...

@GaelVaroquaux
Copy link
Member

Also shuffle should be easily accessible to the user, I use it a lot.

I never use it. I just do:

indices = np.random.permutation(n_samples)
X = X[indices]
y = y[indices]

In my experience, it is not a good idea to push people to use
library-specific function to save 3 lines of code, because it does not
teach them how to use basic Python. Also, it makes the code harder to
follow to anyhow who does not know scikit-learn. One should favor Python
and numpy knowledge over knowledge of the scikit.

@GaelVaroquaux
Copy link
Member

sampling sounds pretty unspecific, too, I think...

More than 'utils'.

@GaelVaroquaux
Copy link
Member

We should probably stop putting anything in init.py.

There is an 'import *' in there that makes me sick each time I see it.

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux Utils is divided into submodules with meaningful names such as
arrayfuncs, sparsefuncs, validation...

Flat is better than nested.

For resample and shuffle, we could create a submodule sampling.

Sounds probably useful.

FWIW, I'm using the utils quite often in my research code. I think they can be
useful to users too.

If you are using them to build your own estimators, than I think that it
is fair enough. If you are using them to do end user machine
learning/data mining pipelines, then I think that we have a design
problem.

Think about it this way: you are coming new to a package. You have a
vague idea of what is in there and want to explore quickly to get
something done. 'utils' certainly does not help you to discover
functionality.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

On 12/09/2012 03:51 PM, Gael Varoquaux wrote:

In my experience, it is not a good idea to push people to use
library-specific function to save 3 lines of code, because it does not
teach them how to use basic Python. Also, it makes the code harder to
follow to anyhow who does not know scikit-learn. One should favor Python
and numpy knowledge over knowledge of the scikit.

True. I see more and more people asking for stuff in sklearn that is
just 3 lines of numpy.
It is nice to have convenience functions but people shouldn't have to
rely on them.

@GaelVaroquaux
Copy link
Member

True. I see more and more people asking for stuff in sklearn that is
just 3 lines of numpy.

Yes, and it is easy to yeild to these requests, but this is a dangerous
line, and a battle that we will loose by exhaustion. People should learn
numpy. If we can teach it to them via tutorials and examples, this is a
better option.

It is nice to have convenience functions but people shouldn't have to
rely on them.

Our convenience functions do much more: they do input validation, and
sensible error messages. But that's for inner usage.

@amueller
Copy link
Member

amueller commented Dec 9, 2012

I will keep your advice in mind when answering on stackoverflow and the mailing list.

@erg
Copy link
Contributor Author

erg commented Dec 9, 2012

I am not sure I understand the motivation for sampling with replacement instead of without. Using np.random.permutation or similar that should be easy to do. Have you seen the discussion in #1362?

Bagging, by the literature, uses sampling without replacement. Edit: sampling with replacement

The issue I have with the sampling with replacement is that it discards ~30% of the samples from the smallest class. I don't see the point in doing that.

You are right, so how about a mode that samples without replacement the full set of samples until it's exhausted, then repeats in a loop? So if you have 20 different samples in one class and you want 100, it would just repeat those indices five times and return the shuffle. This way is more deterministic, still gives some randomness if the requested n_samples is not a multiple of the number available, and uses all the samples before repeating.

Btw, you say this is useful for oversampling and undersampling but it does neither.

According to my understanding, it does either depending on the situation. If you have a set of 100 samples with a 30/70 split, and you want to balance them, then it will oversample the 30 to 50 and undersample the 70 to 50. You can also use n or scale to triple or half the size of the dataset.

sampling sounds pretty unspecific, too, I think...

I am -1 on making this utility function its own module, especially one called sampling because of MCMC, Gibbs sampling, rejection sampling, etc.

We should probably stop putting anything in init.py

+2 for this, I've never even thought to look in init.py for any actual code.

I just looked at the current resample and it looks like it can only select some % of data equal or less than the current amount. If the scale factor is over 1.0, you get more samples in the new resample, so you could scale your dataset arbitrarily to test the running time on somewhat real data.

shuffle seems to duplicate np.random.permutation, as pointed out above. Agree that this is bad.

I will add a replace parameter to this pull request and then see where we're at. Do I need to do anything specific to support sparse matrices?

@amueller
Copy link
Member

Am 09.12.2012 18:04, schrieb erg:

I am not sure I understand the motivation for sampling with
replacement instead of without. Using np.random.permutation or
similar that should be easy to do. Have you seen the discussion in
#1362 <https://github.com/scikit-learn/scikit-learn/issues/1362>?

Bagging, by the literature, uses sampling without replacement.

What has this pull-request to do with bagging?

The issue I have with the sampling with replacement is that it
discards ~30% of the samples from the smallest class. I don't see
the point in doing that.

You are right, so how about a mode that samples without replacement
the full set of samples until it's exhausted, then repeats in a loop?
So if you have 20 different samples in one class and you want 100, it
would just repeat those indices five times and return the shuffle.
This way is more deterministic, still gives some randomness if the
requested n_samples is not a multiple of the number available, and
uses all the samples before repeating.

Not sure. This seems pretty magic.

|shuffle| seems to duplicate |np.random.permutation|, as pointed out
above. Agree that this is bad.

It is not, it does a consistent permutation of different arrays but it
is synonymous with what Gael posted.

I will add a |replace| parameter to this pull request and then see
where we're at.

Sounds good.

Do I need to do anything specific to support sparse matrices?

As you return indices and not masks, it should just work.


def _histogram_indices(seq):
"""Collects a list of indices for each element occuring in an array-like
and returns a dict of lists.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be implemented more easily by using "sort" or "argsort".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird github quirk, your comment is still here now but on the wrong line...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea about which line was this comment for originally? Has it been addressed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was for a function that doesn't exist anymore.

@amueller
Copy link
Member

So should this resample replace the current version? I'm not sure if / how the current version is used.

@mblondel
Copy link
Member

I also like shuffle. It's a nifty little util and handles sparse matrices for you. Also

X, y = shuffle(X, y)

speaks for itself, so it fits @GaelVaroquaux 's readability requirement.

@erg
Copy link
Contributor Author

erg commented Dec 10, 2012

Bagging, by the literature, uses sampling without replacement.

What has this pull-request to do with bagging?

Damn, I meant it does sampling with replacement. You were asking why we want that feature and that was the reason, but I got it backwards..........oops.

You are right, so how about a mode that samples without replacement
the full set of samples until it's exhausted, then repeats in a loop?
So if you have 20 different samples in one class and you want 100, it
would just repeat those indices five times and return the shuffle.
This way is more deterministic, still gives some randomness if the
requested n_samples is not a multiple of the number available, and
uses all the samples before repeating.

Not sure. This seems pretty magic.

The current resample checks that the number of samples you want is <= to the number you pass it. However, I think it would be useful to be able to create immense datasets that look like a dataset, like scaling the iris dataset to 100,000 times its size to check the running time of an algorithm and still hope to get good results. So you can sample with replacement and it just works, or sample without replacement and you exhaust the samples quickly.

When the samples run out because replace is False, how do you get more to scale up the data? You could 1) continue sampling with replacement for the rest of the samples, or you could 2) restart the sampling without replacement again and loop the procedure until your dataset is big enough. Those are the two options that make sense to me, but option 1) is basically just setting replace=True for a large enough dataset.

X, y = shuffle(X, y)

This could be useful, I can leave it in and base it off of the new code.

As for where this function should live, it still seems like a preprocessing thing to me. Rebalance labels, standardize, train...

@GaelVaroquaux
Copy link
Member

X, y = shuffle(X, y)

speaks for itself, so it fits @GaelVaroquaux 's readability requirement.

Yes and no. We I read such code from a library I don't know, I always end
up opening the code of 'shuffle', because I want to know exactly what it
does, and why the author didn't use a two-liner.

@erg
Copy link
Contributor Author

erg commented Dec 11, 2012

I just rebased and added a replace= feature, renamed n to n_samples, added random_state, and added over/undersample modes that calculate the output size based on the max/min class counts. I really can't think of any more features it doesn't support right now, but be my guest to find some.

I think preprocessing.resample is now a superset of the features of utils.resample.

I did a forced push, so you'll want to reset to my new patches instead of using the old ones.

return counts


def resample(y, method=None, proba=None, n_samples=None, scale=None, replace=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could rename y to class_labels to be more explicit?

@ogrisel
Copy link
Member

ogrisel commented Dec 11, 2012

+1 for keeping the current implementation in sklearn.preprocessing and deprecating sklearn.utils.resample and migrating existing code to the new implementation (beware of import loops though...).

We should also add a new sklearn.preprocessing.ClassRebalancer class implementing the TransformerMixin that woul use the resample function internally to make it possible to use in a Pipeline in collaboration with cross validation and grid search utilities. See #1362.

We could also move the shuffle utility function into sklearn.preprocessing as I often find it useful outside of sklearn, esp. when dealing with sparse matrices in interactive sessions (with deprecation warnings as well for backward compat), but this can be addressed in a separate PR).

Also we could turn sklearn.preprocessing into a package with sub modules as it's growing big. We would also still be importing the public API of the sub modules in the sklearn.preprocessing's __init__ to keep it friendly / flat for the users.

erg added a commit to erg/scikit-learn that referenced this pull request Apr 19, 2015
iteritems() -> items()
scikit-learn#1454
BUG: Fix dictionary code for python3.
DOC: Add narrative docs for resample_labels.
@agramfort
Copy link
Member

@erg thanks for the quick reaction time. See why travis is unhappy then +1 for merge.

@agramfort agramfort changed the title PR: Add resample to preprocessing. MRG+1: Add resample to preprocessing. Apr 19, 2015
@jnothman
Copy link
Member

Thinking more about sampling as weights: methods that accept weights do not need to distinguish between undersampling, oversampling, etc. For those that do not accept sample weights we can produce a sample of size N stochastically balancing classes (sampling without replacement) using:

counts = random_state.multinomial(N, sklearn.utils.class_weight.compute_sample_weight('auto', y))
indices = np.repeat(np.arange(y.shape[0], counts)

Here the equal sampling of classes is not exact but it has the advantage of accounting for multioutput/multilabel y.

@jnothman
Copy link
Member

Well, this is long dormant. It certainly had a lot of effort put into it.

How much do we still want/need this?

I think a Resampler class can be made as a wrapper around the new CV splitters, as I've drafted at #5972 (comment)

@jnothman
Copy link
Member

I've commented a couple of times about the relationship between resampling and class_weight. it may still be worthwhile providing utilities for resampling, but emphasising that similar can be accomplished with class_weight.

I think progress of this PR would benefit from a fresh PR...

over/undersampling for `len(y)` samples by default.
"oversample" grows all classes to the count of the largest class.
"undersample" shrinks all classes to the count of the smallest class.
dict with pairs of class, probability with values summing to 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(class, probability)"

counts = np.repeat(sample_size, n_classes)
if sample_size_rem > 0:
counts[:sample_size_rem] += 1
# Shuffle so the class inbalance varies between runs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"imbalance"

def _fair_array_counts(n_samples, n_classes, random_state=None):
"""Tries to fairly partition n_samples between n_classes.
If this cannot be done fairly, +1 is added `remainder` times
to the counts for random arrays until a total of `n_samples` is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"random arrays" -> "random classes"

Indices sampled from the dataset respecting a class distribution
controlled by this function's parameters.

Examples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to me a lot of examples, with too much changing between them (i.e. X and y could be constant). Could this be simplified?

@jnothman
Copy link
Member

I'd like to see this merged. It's been sitting around a while and I think general consensus is that it's more useful than not (it's certainly sought), and fits in scikit-learn's scope. I'd like to see a few changes to it:

  • More emphasis in documentation on the relationship between resampling and class_weight. I think oen of the benefits of resampling.
  • A test for equivalence with class_weight with a dict / "balanced"
  • I'm not so happy with the interface surrounding "undersample" and "oversample". These are really "balanced" with named scaling values. I think these should either be values for scaling, or should be invalid in conjunction with scaling != 1.
  • Mix of float and int values for scaling is a bit confusing. Elsewhere we allow floats to substitute for int values, but usually that's in the context where a float >1.0 or even >=1.0 is inappropriate/invalid. Here I'm a little worried that allowing 2.0 to differ semantically from 2 is too big a trap for users. A user forgetting to int their ceil will potentially generate an enormous dataset. Can someone point me to other places in the codebase we've done this with a float >1.0? Unfortunately, the solutions I can think of (string '2.0x' or '200%'; additional parameter) are not very consistent with other scikit-learn APIs.
  • Perhaps a CV meta-splitter that wraps this to perform resampled CV; however, this may be controversial and I'd happily see it as a next step. It merely mitigates the fact that this helper does not fit into the API neatly and best practiced for resampling-style transformations is not established. (Potential controversy may extend from the fact that other transformations altering training set only, such as pertubation or dropout, cannot be achieved with a CV meta-splitter.)
  • A final review for code simplification may be worthwhile.

Who's with me?

@erg have we sapped you of all enthusiasm, or are you hanging in there? Thanks for your great work and great patience!

@erg
Copy link
Contributor Author

erg commented Jun 9, 2018

I ported this patch to the latest scikit-learn and addressed some of the latest critiques.

@jnothman I tried to move oversample and undersample to the scale parameter, but then you can't oversample and scale the dataset in the same call.

Also, method=balance and method=dict are mutually exclusive, I'm not sure how to add the test case you asked for.

It would be nice to merge this PR soon so that we can use it as a starting place and incrementally improve it. It would have already had six years of improvements if merged in 2012. :)

@jnothman
Copy link
Member

jnothman commented Jun 9, 2018 via email

@orausch orausch mentioned this pull request Feb 26, 2019
8 tasks
@adrinjalali
Copy link
Member

Superseded by #13269

@adrinjalali adrinjalali closed this Aug 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.