Skip to content

[WIP] Resamplers #13269

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 56 commits into
base: main
Choose a base branch
from
Open

[WIP] Resamplers #13269

wants to merge 56 commits into from

Conversation

orausch
Copy link

@orausch orausch commented Feb 26, 2019

Reference Issues/PRs

Closes #4143
Closes #9630
Related SLEP:
scikit-learn/enhancement_proposals#15

What does this implement/fix? Explain your changes.

Implement estimators that can change the sample size at fit

Other comments

How could resamplers work with ColumnTransformer or FeatureUnion?
Does the API allow the modification of y values? If so, should we reimplement TransformedTargetRegressor under this API?

Plan

  • handle kwargs behavior
  • Rewrite all relevant common_tests involving fit to test fit_resample (see common test failures for FilterNaN)
  • FilterNaN Tests
  • More estimator checks for resamplers
  • More ResampledTrainer tests (including with FilterNaN and outlier rejectors)
  • Docstrings for ResampledTrainer
  • Port RandomOverSampler and RandomUnderSampler from imblearn
  • Sphinx Example(s)

@amueller
Copy link
Member

did we discuss what we're going to do with this api-wise?

@orausch
Copy link
Author

orausch commented Feb 26, 2019

did we discuss what we're going to do with this api-wise?

For now the plan is to follow #3855 (comment). The unclear part for the API is how array-like fit parameters, such as sample_weight will be handled, since they will need to be resampled in pipelines.

@dmbee
Copy link
Contributor

dmbee commented Feb 27, 2019

If you use sample_props dictionary to contain properties mapped to the samples (eg sample_weight) the pipeline implementation for resampling transformers can be as follows:

for step_idx, (name, transformer) in enumerate(self.steps[:-1]):

    fit_params_step = {key.split('__', 1)[1]: fit_params.pop(key) for key in fit_params if name + "__" in key};

    if isinstance(transformer, ResamplerMixin):
        props = {key : fit_params[key] for key in sample_props}
        X, y, props = transformer.fit_resample(X, y, **fit_params_step, props)
        fit_params.update(props)


    else: ... # is a regular transfomer

@jnothman
Copy link
Member

jnothman commented Feb 27, 2019 via email

@orausch
Copy link
Author

orausch commented Feb 27, 2019

Since we didn't get to discuss it today, here's a review of (by my understanding) the current plan for sample_props:

  • introduce a parameter to resamplers named sample_props of type dict string -> (array-like, shape=(n_samples,))
  • resamplers take sample_props as an argument to fit_resample. Upon resampling, they accordingly resample the arrays in sample_props.
  • resamplers return X, y, sample_props if sample_props were passed. If no sample_props were passed, the resampler returns X, y.
  • pipeline will, while fitting, check the args of the estimator/transformer it is fitting. If a parameter arg is both a fit parameter and in sample_props, the pipeline will pass sample_props[arg] to fit/fit_transform.
  • furthermore, pipeline will correctly pass along modified sample_props when resamplers are in the pipeline.

Let me know if I'm understanding something wrong.

@dmbee
Copy link
Contributor

dmbee commented Feb 27, 2019

I think a cleaner / simpler way to do it would be to specify which fit_params are sample_props. this can be accomplished using a list of keys (as per the implementation I have above). But I think your way works too.

Also I think it's better to have a consistent return signature, and its ok to return None for sample_props if none were passed into the resampler.

@jnothman
Copy link
Member

jnothman commented Feb 28, 2019 via email

@dmbee
Copy link
Contributor

dmbee commented Feb 28, 2019

I don't think we want a parameter named sample_props, but that's unresolved. In my mind, these props are exactly what we have as sample_weight and groups now. I also think (but not sure) a resampler should be permitted to return props (weights especially) without any being passed in. I'd be tempted to say that resamplers should always generate (X, y, dict of props) but maybe we should initially propose and handle just the (X, y) case. Sorry I've not yet reviewed this.

This is a really good point - a really clean solution would be to check fit_params for items that are sample_props (have len = len(X)). This avoids passing multiple dictionaries / parameter collections.

@orausch
Copy link
Author

orausch commented Feb 28, 2019

After the discussion, I think it would be better to not implement sample_props in this PR (I was not aware that they would be such an API-wide change).

@glemaitre
Copy link
Member

glemaitre commented Feb 28, 2019 via email

@jnothman
Copy link
Member

jnothman commented Mar 1, 2019

Ok so I will review tomorrow and start to make a SLEP or update one that is opened. We can check that together.

I discussed this briefly with Guillaume last night. As much as it would be great to just try to merge this, I think we now understand a few things about it (e.g. not supporting fit_transform, or transform in the same object; sample_prop output and hence wanting to support alternatively 2- or 3-tuple output) that are worth describing formally and getting wider review.

@orausch
Copy link
Author

orausch commented Mar 1, 2019

describing formally and getting wider review

I could write a draft SLEP if that's useful (or update an existing one).

@orausch orausch changed the title [WIP] Resamplers [WIP] Outlier Rejection Mar 1, 2019
@orausch
Copy link
Author

orausch commented Jul 3, 2019

@jnothman I've added a function check_X_y_kwargs that validates kwargs along with X and y. Since you said all estimators should have fit kwargs of shape (n_samples,), maybe this is something that should be applied across the codebase?

Note that I also found some estimators (KernelRidge, Ridge) that accept float as sample_weight. So this check would break for them.

@judahrand
Copy link

judahrand commented Mar 27, 2020

@orausch is the Pull Request still active? This is a feature which I dearly need in Sklearn. I'm on the edge of having implement it myself in a separate project but, obviously, helping out put it into Sklearn would be MASSIVELY preferable!

@jnothman I've heard rumblings that SLEP001 is preferable?

@orausch
Copy link
Author

orausch commented Mar 27, 2020

@Jude188 the work here is stalled since we had some trouble reaching a decision in scikit-learn/enhancement_proposals#15. I think the next steps would be to sort the SLEP out and get it approved; this PR was more of an exploration/POC anyway.

@glemaitre
Copy link
Member

@Jude188 You can look at imbalanced-learn which provide compatible tools in the meanwhile

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at this after a long break.

"""Mixin class for all outlier detection resamplers in scikit-learn. Child
classes remove outliers from the dataset.
"""
_estimator_type = "outlier_rejector"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's problematic to change the type of existing outlier detectors.

"""
_estimator_type = "outlier_rejector"

def fit_resample(self, X, y, **kws):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just be adding this to OutlierMixin rather than adding a new mixin and type.

@@ -918,6 +918,19 @@ Class APIs and Estimator Types
outliers have score below 0. :term:`score_samples` may provide an
unnormalized score per sample.

outlier rejector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this and just add a note under outlier detector that if fit_resample is provided, it should act as an outlier rejector, returning the training data with outliers removed.

doc/glossary.rst Outdated
@@ -949,6 +962,10 @@ Class APIs and Estimator Types
A purely :term:`transductive` transformer, such as
:class:`manifold.TSNE`, may not implement ``transform``.

resampler
resamplers
An estimator supporting :term:`fit_resample`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
An estimator supporting :term:`fit_resample`.
An estimator supporting :term:`fit_resample`. This can be used in a
:class:`ResampledTrainer` to resample, augment or reduce the training
dataset passed to another estimator.

doc/glossary.rst Outdated
Comment on lines 1239 to 1241
A method on :term:`resamplers` which fits the estimator on a passed
dataset, and returns a new dataset. In the new dataset, samples may be
removed or added.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A method on :term:`resamplers` which fits the estimator on a passed
dataset, and returns a new dataset. In the new dataset, samples may be
removed or added.
A method whose presence in an estimator is sufficient and necessary for
it to be a :term:`resampler`.
When called it should fit the estimator and return a new
dataset. In the new dataset, samples may be removed, added or modified.
In contrast to :term:`fit_transform`:
* X, y, and any other sample-aligned data may be generated;
* the samples in the returned dataset need not have any alignment or
correspondence to the input dataset.
This method has the signature ``fit_resample(X, y, **kw)`` and returns
a 3-tuple ``X_new, y_new, kw_new`` where ``kw_new`` is a dict mapping
names to data-aligned values that should be passed as fit parameters
to the subsequent estimator. Any keyword arguments passed in should be
resampled and returned, and if the resampler is not capable of
resampling the keyword arguments, it should raise a TypeError.
Ordinarily, this method is only called by a :class:`ResampledTrainer`,
which acts like a specialised pipeline for cases when the training data
should be augmented or resampled.


class NaNFilter(BaseEstimator):
"""
A resampler that removes samples containing NaN in X.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful? There are likely to be NaNs in the test set. Unless we can provide a good demonstration of this, I'd leave it out of this PR.

yield check_sample_weights_invariance
yield check_estimators_fit_returns_self
yield partial(check_estimators_fit_returns_self, readonly_memmap=True)
if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this more readable as

Suggested change
if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'):
if hasattr(estimator, 'fit') or not hasattr(estimator, 'fit_resample'):

but I don't really get why we need to have this kind of logic.

I'm wondering whether resamplers should not support fit, unless they serve a purpose as another kind of estimator as well. Then without fit and without transform, check_estimator should not do much to them at present.

@jnothman
Copy link
Member

@orausch, will you be available to continue this, or should we consider passing it onto someone else?

@orausch
Copy link
Author

orausch commented Aug 30, 2020

@jnothman I'm happy to contribute some more to this, but the next steps/goals are currently a bit unclear to me (especially because it has been hard to reach consensus in the SLEP). Of course its also easier to invest time with a clear plan/requirements for merging.

Do you have any particular goals in mind?

@jnothman
Copy link
Member

jnothman commented Aug 30, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Decision Requires decision
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Pipelining Outlier Removal Allow for Transformers on y
8 participants