-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG+1: Add resample to preprocessing. #1454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for the PR. This will be really useful! |
I am not sure I understand the motivation for sampling with replacement instead of without. Using The issue I have with the sampling with replacement is that it discards ~30% of the samples from the smallest class. I don't see the point in doing that. |
Btw, you say this is useful for oversampling and undersampling but it does neither. I just checked and |
I'm not sure if I would look for this in the preprocessing module. |
Maybe that resample function shouldn't be in utils? The rule is that |
+1. Maybe we should also move I didn't knew about -1 on the current To me it looks like we could maybe even but both in the same function using a keyword argument? |
We could have public utils and private utils (function beginning with |
@amueller Where would you put |
The logic for not having anything useful for end users in 'utils' is that |
@mblondel I don't have an idea. I just feel that |
@GaelVaroquaux Utils is divided into submodules with meaningful names such as arrayfuncs, sparsefuncs, validation... For FWIW, I'm using the utils quite often in my research code. I think they can be useful to users too. |
We should probably stop putting anything in |
|
I never use it. I just do: indices = np.random.permutation(n_samples) X = X[indices] y = y[indices] In my experience, it is not a good idea to push people to use |
More than 'utils'. |
There is an 'import *' in there that makes me sick each time I see it. |
Flat is better than nested.
Sounds probably useful.
If you are using them to build your own estimators, than I think that it Think about it this way: you are coming new to a package. You have a |
On 12/09/2012 03:51 PM, Gael Varoquaux wrote:
|
Yes, and it is easy to yeild to these requests, but this is a dangerous
Our convenience functions do much more: they do input validation, and |
I will keep your advice in mind when answering on stackoverflow and the mailing list. |
Bagging, by the literature, uses sampling without replacement. Edit: sampling with replacement
You are right, so how about a mode that samples without replacement the full set of samples until it's exhausted, then repeats in a loop? So if you have 20 different samples in one class and you want 100, it would just repeat those indices five times and return the shuffle. This way is more deterministic, still gives some randomness if the requested n_samples is not a multiple of the number available, and uses all the samples before repeating.
According to my understanding, it does either depending on the situation. If you have a set of 100 samples with a 30/70 split, and you want to balance them, then it will oversample the 30 to 50 and undersample the 70 to 50. You can also use
I am -1 on making this utility function its own module, especially one called
+2 for this, I've never even thought to look in init.py for any actual code. I just looked at the current
I will add a |
Am 09.12.2012 18:04, schrieb erg:
|
sklearn/preprocessing.py
Outdated
|
||
def _histogram_indices(seq): | ||
"""Collects a list of indices for each element occuring in an array-like | ||
and returns a dict of lists. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be implemented more easily by using "sort" or "argsort".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Weird github quirk, your comment is still here now but on the wrong line...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any idea about which line was this comment for originally? Has it been addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for a function that doesn't exist anymore.
So should this |
I also like shuffle. It's a nifty little util and handles sparse matrices for you. Also
speaks for itself, so it fits @GaelVaroquaux 's readability requirement. |
Damn, I meant it does sampling with replacement. You were asking why we want that feature and that was the reason, but I got it backwards..........oops.
The current When the samples run out because
This could be useful, I can leave it in and base it off of the new code. As for where this function should live, it still seems like a preprocessing thing to me. Rebalance labels, standardize, train... |
Yes and no. We I read such code from a library I don't know, I always end |
I just rebased and added a I think I did a forced push, so you'll want to reset to my new patches instead of using the old ones. |
sklearn/preprocessing.py
Outdated
return counts | ||
|
||
|
||
def resample(y, method=None, proba=None, n_samples=None, scale=None, replace=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could rename y
to class_labels
to be more explicit?
+1 for keeping the current implementation in We should also add a new We could also move the Also we could turn |
iteritems() -> items() scikit-learn#1454 BUG: Fix dictionary code for python3. DOC: Add narrative docs for resample_labels.
@erg thanks for the quick reaction time. See why travis is unhappy then +1 for merge. |
Thinking more about sampling as weights: methods that accept weights do not need to distinguish between undersampling, oversampling, etc. For those that do not accept sample weights we can produce a sample of size
Here the equal sampling of classes is not exact but it has the advantage of accounting for multioutput/multilabel |
Well, this is long dormant. It certainly had a lot of effort put into it. How much do we still want/need this? I think a |
I've commented a couple of times about the relationship between resampling and I think progress of this PR would benefit from a fresh PR... |
sklearn/preprocessing/resample.py
Outdated
over/undersampling for `len(y)` samples by default. | ||
"oversample" grows all classes to the count of the largest class. | ||
"undersample" shrinks all classes to the count of the smallest class. | ||
dict with pairs of class, probability with values summing to 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(class, probability)"
sklearn/preprocessing/resample.py
Outdated
counts = np.repeat(sample_size, n_classes) | ||
if sample_size_rem > 0: | ||
counts[:sample_size_rem] += 1 | ||
# Shuffle so the class inbalance varies between runs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"imbalance"
sklearn/preprocessing/resample.py
Outdated
def _fair_array_counts(n_samples, n_classes, random_state=None): | ||
"""Tries to fairly partition n_samples between n_classes. | ||
If this cannot be done fairly, +1 is added `remainder` times | ||
to the counts for random arrays until a total of `n_samples` is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"random arrays" -> "random classes"
Indices sampled from the dataset respecting a class distribution | ||
controlled by this function's parameters. | ||
|
||
Examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to me a lot of examples, with too much changing between them (i.e. X
and y
could be constant). Could this be simplified?
I'd like to see this merged. It's been sitting around a while and I think general consensus is that it's more useful than not (it's certainly sought), and fits in scikit-learn's scope. I'd like to see a few changes to it:
Who's with me? @erg have we sapped you of all enthusiasm, or are you hanging in there? Thanks for your great work and great patience! |
I ported this patch to the latest scikit-learn and addressed some of the latest critiques. @jnothman I tried to move Also, It would be nice to merge this PR soon so that we can use it as a starting place and incrementally improve it. It would have already had six years of improvements if merged in 2012. :) |
Hahahaha :)
Thanks Doug, especially for your patience and persistence. I'd like to see
this merged in once we have an interface like #3855 to integrate it into an
estimator. See
#3855 (comment)
in particular as my plan of action... but I've not yet had time to act
myself. :\
|
Superseded by #13269 |
There have been several requests lately for class rebalancing using under/oversampling. This utility function addresses most of the use cases I can think of for sampling with replacement from a dataset.
One thing it does not do is to sample without replacement before sampling with replacement because it changes the code substantially and there is no efficient version of
random.sample
as per #1438 (comment). I could add that feature eventually.