-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[WIP] Resamplers #13269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[WIP] Resamplers #13269
Conversation
Many tests from imblearn were adapted.
did we discuss what we're going to do with this api-wise? |
For now the plan is to follow #3855 (comment). The unclear part for the API is how array-like fit parameters, such as |
If you use sample_props dictionary to contain properties mapped to the samples (eg sample_weight) the pipeline implementation for resampling transformers can be as follows:
|
You will need to again do the routing interpretation (i.e. the computation
of fit_params_step) on the fit_params returned from the resampler.
|
Since we didn't get to discuss it today, here's a review of (by my understanding) the current plan for
Let me know if I'm understanding something wrong. |
I think a cleaner / simpler way to do it would be to specify which fit_params are sample_props. this can be accomplished using a list of keys (as per the implementation I have above). But I think your way works too. Also I think it's better to have a consistent return signature, and its ok to return None for sample_props if none were passed into the resampler. |
I don't think we want a parameter named `sample_props`, but that's
unresolved. In my mind, these props are exactly what we have as
sample_weight and groups now. I also think (but not sure) a resampler
should be permitted to return props (weights especially) without any being
passed in.
I'd be tempted to say that resamplers should always generate (X, y, dict of
props) but maybe we should initially propose and handle just the (X, y)
case.
Sorry I've not yet reviewed this.
|
This is a really good point - a really clean solution would be to check fit_params for items that are sample_props (have len = len(X)). This avoids passing multiple dictionaries / parameter collections. |
After the discussion, I think it would be better to not implement |
Ok so I will review tomorrow and start to make a SLEP or update one that is opened. We can check that together.
|
I discussed this briefly with Guillaume last night. As much as it would be great to just try to merge this, I think we now understand a few things about it (e.g. not supporting fit_transform, or transform in the same object; sample_prop output and hence wanting to support alternatively 2- or 3-tuple output) that are worth describing formally and getting wider review. |
I could write a draft SLEP if that's useful (or update an existing one). |
…into fit_resample
@jnothman I've added a function Note that I also found some estimators ( |
@Jude188 the work here is stalled since we had some trouble reaching a decision in scikit-learn/enhancement_proposals#15. I think the next steps would be to sort the SLEP out and get it approved; this PR was more of an exploration/POC anyway. |
@Jude188 You can look at imbalanced-learn which provide compatible tools in the meanwhile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at this after a long break.
"""Mixin class for all outlier detection resamplers in scikit-learn. Child | ||
classes remove outliers from the dataset. | ||
""" | ||
_estimator_type = "outlier_rejector" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's problematic to change the type of existing outlier detectors.
""" | ||
_estimator_type = "outlier_rejector" | ||
|
||
def fit_resample(self, X, y, **kws): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just be adding this to OutlierMixin rather than adding a new mixin and type.
@@ -918,6 +918,19 @@ Class APIs and Estimator Types | |||
outliers have score below 0. :term:`score_samples` may provide an | |||
unnormalized score per sample. | |||
|
|||
outlier rejector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this and just add a note under outlier detector
that if fit_resample
is provided, it should act as an outlier rejector, returning the training data with outliers removed.
doc/glossary.rst
Outdated
@@ -949,6 +962,10 @@ Class APIs and Estimator Types | |||
A purely :term:`transductive` transformer, such as | |||
:class:`manifold.TSNE`, may not implement ``transform``. | |||
|
|||
resampler | |||
resamplers | |||
An estimator supporting :term:`fit_resample`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An estimator supporting :term:`fit_resample`. | |
An estimator supporting :term:`fit_resample`. This can be used in a | |
:class:`ResampledTrainer` to resample, augment or reduce the training | |
dataset passed to another estimator. |
doc/glossary.rst
Outdated
A method on :term:`resamplers` which fits the estimator on a passed | ||
dataset, and returns a new dataset. In the new dataset, samples may be | ||
removed or added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A method on :term:`resamplers` which fits the estimator on a passed | |
dataset, and returns a new dataset. In the new dataset, samples may be | |
removed or added. | |
A method whose presence in an estimator is sufficient and necessary for | |
it to be a :term:`resampler`. | |
When called it should fit the estimator and return a new | |
dataset. In the new dataset, samples may be removed, added or modified. | |
In contrast to :term:`fit_transform`: | |
* X, y, and any other sample-aligned data may be generated; | |
* the samples in the returned dataset need not have any alignment or | |
correspondence to the input dataset. | |
This method has the signature ``fit_resample(X, y, **kw)`` and returns | |
a 3-tuple ``X_new, y_new, kw_new`` where ``kw_new`` is a dict mapping | |
names to data-aligned values that should be passed as fit parameters | |
to the subsequent estimator. Any keyword arguments passed in should be | |
resampled and returned, and if the resampler is not capable of | |
resampling the keyword arguments, it should raise a TypeError. | |
Ordinarily, this method is only called by a :class:`ResampledTrainer`, | |
which acts like a specialised pipeline for cases when the training data | |
should be augmented or resampled. |
|
||
class NaNFilter(BaseEstimator): | ||
""" | ||
A resampler that removes samples containing NaN in X. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful? There are likely to be NaNs in the test set. Unless we can provide a good demonstration of this, I'd leave it out of this PR.
yield check_sample_weights_invariance | ||
yield check_estimators_fit_returns_self | ||
yield partial(check_estimators_fit_returns_self, readonly_memmap=True) | ||
if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this more readable as
if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'): | |
if hasattr(estimator, 'fit') or not hasattr(estimator, 'fit_resample'): |
but I don't really get why we need to have this kind of logic.
I'm wondering whether resamplers should not support fit
, unless they serve a purpose as another kind of estimator as well. Then without fit
and without transform
, check_estimator
should not do much to them at present.
@orausch, will you be available to continue this, or should we consider passing it onto someone else? |
@jnothman I'm happy to contribute some more to this, but the next steps/goals are currently a bit unclear to me (especially because it has been hard to reach consensus in the SLEP). Of course its also easier to invest time with a clear plan/requirements for merging. Do you have any particular goals in mind? |
I have plans, but I suppose you're right that we should reach consensus on
it first!
I think we have been increasing in our shared understanding, but I suppose
i could be biased. I think open questions may be related to dealing with
fit parameters, and how to name ResampledTrainer better.
|
Reference Issues/PRs
Closes #4143
Closes #9630
Related SLEP:
scikit-learn/enhancement_proposals#15
What does this implement/fix? Explain your changes.
Implement estimators that can change the sample size at fit
Other comments
How could resamplers work with ColumnTransformer or FeatureUnion?
Does the API allow the modification of
y
values? If so, should we reimplement TransformedTargetRegressor under this API?Plan