[WIP] Resamplers #13269

orausch · 2019-02-26T09:14:49Z

Reference Issues/PRs

Closes #4143
Closes #9630
Related SLEP:
scikit-learn/enhancement_proposals#15

What does this implement/fix? Explain your changes.

Implement estimators that can change the sample size at fit

Other comments

How could resamplers work with ColumnTransformer or FeatureUnion?
Does the API allow the modification of y values? If so, should we reimplement TransformedTargetRegressor under this API?

Plan

handle kwargs behavior
Rewrite all relevant common_tests involving fit to test fit_resample (see common test failures for FilterNaN)
FilterNaN Tests
More estimator checks for resamplers
More ResampledTrainer tests (including with FilterNaN and outlier rejectors)
Docstrings for ResampledTrainer
Port RandomOverSampler and RandomUnderSampler from imblearn
Sphinx Example(s)

sklearn/base.py

sklearn/resample/__init__.py

Many tests from imblearn were adapted.

amueller · 2019-02-26T17:45:17Z

did we discuss what we're going to do with this api-wise?

orausch · 2019-02-26T18:11:19Z

did we discuss what we're going to do with this api-wise?

For now the plan is to follow #3855 (comment). The unclear part for the API is how array-like fit parameters, such as sample_weight will be handled, since they will need to be resampled in pipelines.

dmbee · 2019-02-27T03:44:09Z

If you use sample_props dictionary to contain properties mapped to the samples (eg sample_weight) the pipeline implementation for resampling transformers can be as follows:

for step_idx, (name, transformer) in enumerate(self.steps[:-1]):

    fit_params_step = {key.split('__', 1)[1]: fit_params.pop(key) for key in fit_params if name + "__" in key};

    if isinstance(transformer, ResamplerMixin):
        props = {key : fit_params[key] for key in sample_props}
        X, y, props = transformer.fit_resample(X, y, **fit_params_step, props)
        fit_params.update(props)


    else: ... # is a regular transfomer

jnothman · 2019-02-27T13:25:28Z

You will need to again do the routing interpretation (i.e. the computation of fit_params_step) on the fit_params returned from the resampler.

orausch · 2019-02-27T21:34:42Z

Since we didn't get to discuss it today, here's a review of (by my understanding) the current plan for sample_props:

introduce a parameter to resamplers named sample_props of type dict string -> (array-like, shape=(n_samples,))
resamplers take sample_props as an argument to fit_resample. Upon resampling, they accordingly resample the arrays in sample_props.
resamplers return X, y, sample_props if sample_props were passed. If no sample_props were passed, the resampler returns X, y.
pipeline will, while fitting, check the args of the estimator/transformer it is fitting. If a parameter arg is both a fit parameter and in sample_props, the pipeline will pass sample_props[arg] to fit/fit_transform.
furthermore, pipeline will correctly pass along modified sample_props when resamplers are in the pipeline.

Let me know if I'm understanding something wrong.

dmbee · 2019-02-27T21:48:54Z

I think a cleaner / simpler way to do it would be to specify which fit_params are sample_props. this can be accomplished using a list of keys (as per the implementation I have above). But I think your way works too.

Also I think it's better to have a consistent return signature, and its ok to return None for sample_props if none were passed into the resampler.

jnothman · 2019-02-28T09:00:47Z

I don't think we want a parameter named `sample_props`, but that's unresolved. In my mind, these props are exactly what we have as sample_weight and groups now. I also think (but not sure) a resampler should be permitted to return props (weights especially) without any being passed in. I'd be tempted to say that resamplers should always generate (X, y, dict of props) but maybe we should initially propose and handle just the (X, y) case. Sorry I've not yet reviewed this.

…into fit_resample

dmbee · 2019-02-28T15:28:54Z

I don't think we want a parameter named sample_props, but that's unresolved. In my mind, these props are exactly what we have as sample_weight and groups now. I also think (but not sure) a resampler should be permitted to return props (weights especially) without any being passed in. I'd be tempted to say that resamplers should always generate (X, y, dict of props) but maybe we should initially propose and handle just the (X, y) case. Sorry I've not yet reviewed this.

This is a really good point - a really clean solution would be to check fit_params for items that are sample_props (have len = len(X)). This avoids passing multiple dictionaries / parameter collections.

orausch · 2019-02-28T17:48:54Z

After the discussion, I think it would be better to not implement sample_props in this PR (I was not aware that they would be such an API-wide change).

glemaitre · 2019-02-28T22:43:42Z

Ok so I will review tomorrow and start to make a SLEP or update one that is opened. We can check that together.

jnothman · 2019-03-01T06:38:26Z

Ok so I will review tomorrow and start to make a SLEP or update one that is opened. We can check that together.

I discussed this briefly with Guillaume last night. As much as it would be great to just try to merge this, I think we now understand a few things about it (e.g. not supporting fit_transform, or transform in the same object; sample_prop output and hence wanting to support alternatively 2- or 3-tuple output) that are worth describing formally and getting wider review.

orausch · 2019-03-01T08:06:59Z

describing formally and getting wider review

I could write a draft SLEP if that's useful (or update an existing one).

…o missing fit.

…into fit_resample

orausch · 2019-07-03T10:57:23Z

@jnothman I've added a function check_X_y_kwargs that validates kwargs along with X and y. Since you said all estimators should have fit kwargs of shape (n_samples,), maybe this is something that should be applied across the codebase?

Note that I also found some estimators (KernelRidge, Ridge) that accept float as sample_weight. So this check would break for them.

sklearn/compose/_resampled.py

judahrand · 2020-03-27T09:11:13Z

@orausch is the Pull Request still active? This is a feature which I dearly need in Sklearn. I'm on the edge of having implement it myself in a separate project but, obviously, helping out put it into Sklearn would be MASSIVELY preferable!

@jnothman I've heard rumblings that SLEP001 is preferable?

orausch · 2020-03-27T09:16:57Z

@Jude188 the work here is stalled since we had some trouble reaching a decision in scikit-learn/enhancement_proposals#15. I think the next steps would be to sort the SLEP out and get it approved; this PR was more of an exploration/POC anyway.

glemaitre · 2020-05-27T15:56:32Z

@Jude188 You can look at imbalanced-learn which provide compatible tools in the meanwhile

jnothman

Taking a look at this after a long break.

jnothman · 2020-08-17T15:16:11Z

sklearn/base.py

+    """Mixin class for all outlier detection resamplers in scikit-learn. Child
+    classes remove outliers from the dataset.
+    """
+    _estimator_type = "outlier_rejector"


it's problematic to change the type of existing outlier detectors.

jnothman · 2020-08-17T15:16:24Z

sklearn/base.py

+    """
+    _estimator_type = "outlier_rejector"
+
+    def fit_resample(self, X, y, **kws):


I think we should just be adding this to OutlierMixin rather than adding a new mixin and type.

jnothman · 2020-08-17T15:17:32Z

doc/glossary.rst

@@ -918,6 +918,19 @@ Class APIs and Estimator Types
        outliers have score below 0.  :term:`score_samples` may provide an
        unnormalized score per sample.

+    outlier rejector


I think we should remove this and just add a note under outlier detector that if fit_resample is provided, it should act as an outlier rejector, returning the training data with outliers removed.

jnothman · 2020-08-17T15:18:26Z

doc/glossary.rst

@@ -949,6 +962,10 @@ Class APIs and Estimator Types
        A purely :term:`transductive` transformer, such as
        :class:`manifold.TSNE`, may not implement ``transform``.

+    resampler
+    resamplers
+        An estimator supporting :term:`fit_resample`.


Suggested change

An estimator supporting :term:`fit_resample`.

An estimator supporting :term:`fit_resample`. This can be used in a

:class:`ResampledTrainer` to resample, augment or reduce the training

dataset passed to another estimator.

jnothman · 2020-08-17T15:20:29Z

doc/glossary.rst

+        A method on :term:`resamplers` which fits the estimator on a passed
+        dataset, and returns a new dataset. In the new dataset, samples may be
+        removed or added.


Suggested change

A method on :term:`resamplers` which fits the estimator on a passed

dataset, and returns a new dataset. In the new dataset, samples may be

removed or added.

A method whose presence in an estimator is sufficient and necessary for

it to be a :term:`resampler`.

When called it should fit the estimator and return a new

dataset. In the new dataset, samples may be removed, added or modified.

In contrast to :term:`fit_transform`:

* X, y, and any other sample-aligned data may be generated;

* the samples in the returned dataset need not have any alignment or

correspondence to the input dataset.

This method has the signature ``fit_resample(X, y, **kw)`` and returns

a 3-tuple ``X_new, y_new, kw_new`` where ``kw_new`` is a dict mapping

names to data-aligned values that should be passed as fit parameters

to the subsequent estimator. Any keyword arguments passed in should be

resampled and returned, and if the resampler is not capable of

resampling the keyword arguments, it should raise a TypeError.

Ordinarily, this method is only called by a :class:`ResampledTrainer`,

which acts like a specialised pipeline for cases when the training data

should be augmented or resampled.

jnothman · 2020-08-17T21:39:01Z

sklearn/preprocessing/data.py

+
+class NaNFilter(BaseEstimator):
+    """
+    A resampler that removes samples containing NaN in X.


Is this useful? There are likely to be NaNs in the test set. Unless we can provide a good demonstration of this, I'd leave it out of this PR.

jnothman · 2020-08-17T21:44:42Z

sklearn/utils/estimator_checks.py

-    yield check_sample_weights_invariance
-    yield check_estimators_fit_returns_self
-    yield partial(check_estimators_fit_returns_self, readonly_memmap=True)
+    if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'):


I find this more readable as

Suggested change

if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'):

if hasattr(estimator, 'fit') or not hasattr(estimator, 'fit_resample'):

but I don't really get why we need to have this kind of logic.

I'm wondering whether resamplers should not support fit, unless they serve a purpose as another kind of estimator as well. Then without fit and without transform, check_estimator should not do much to them at present.

jnothman · 2020-08-30T12:31:57Z

@orausch, will you be available to continue this, or should we consider passing it onto someone else?

orausch · 2020-08-30T16:41:06Z

@jnothman I'm happy to contribute some more to this, but the next steps/goals are currently a bit unclear to me (especially because it has been hard to reach consensus in the SLEP). Of course its also easier to invest time with a clear plan/requirements for merging.

Do you have any particular goals in mind?

jnothman · 2020-08-30T22:13:15Z

I have plans, but I suppose you're right that we should reach consensus on it first! I think we have been increasing in our shared understanding, but I suppose i could be biased. I think open questions may be related to dealing with fit parameters, and how to name ResampledTrainer better.

Basic OutlierResamplers

f90766d

glemaitre reviewed Feb 26, 2019

View reviewed changes

sklearn/base.py Outdated Show resolved Hide resolved

sklearn/resample/__init__.py Outdated Show resolved Hide resolved

orausch added 2 commits February 26, 2019 17:49

Add fit_resample support to pipeline.

26c4153

Many tests from imblearn were adapted.

Remove resample module

60708f4

orausch added 5 commits February 27, 2019 16:16

More tests, more general _iter

630ba2b

Fix flake

96df7ab

Merge remote-tracking branch 'upstream/master' into fit_resample

b9623e4

Remove some warnings

638d147

Remove props code

ad54774

orausch and others added 5 commits February 28, 2019 12:19

Add tests to tests_common.py

4254a35

Changes to fit_resample

561c47a

Merge branch 'master' into fit_resample

b0907a1

Remove nondeterminism

33f1fe6

Merge branch 'fit_resample' of https://github.com/orausch/scikit-learn …

2e24420

…into fit_resample

Remove sample_props and fix pipeline docstrings

4c0aea8

OVerall fixes

8339128

orausch changed the title ~~[WIP] Resamplers~~ [WIP] Outlier Rejection Mar 1, 2019

Rename OutlierResampler -> OutlierRejection

d9ba54f

orausch and others added 9 commits July 3, 2019 00:39

Handle kwargs, go through all estimator_checks and fix failures due t…

9cf2a9c

…o missing fit.

pep

38c5d4d

Merge branch 'master' into fit_resample

ed0a431

Local

59bb6c4

Merge branch 'fit_resample' of https://github.com/orausch/scikit-learn …

8e54d2d

…into fit_resample

Tests for fit are only left out if estimator is a resampler

3ad9ff8

pep

f4c8b7e

Docs error

8a3e1f8

Add check_X_y_kwargs

a38607c

orausch added 2 commits July 3, 2019 13:02

Remove deprecated parameter from check_X_y_kwargs

8a28047

check_X_y_kwargs in fit of ResampledTrainer

d0b2789

jnothman reviewed Jul 3, 2019

View reviewed changes

sklearn/compose/_resampled.py Outdated Show resolved Hide resolved

orausch added 2 commits July 3, 2019 13:12

Check ResampledTrainer correctly resamples kwargs

26e53b9

Remove fit_predict

87181b9

amueller added the Needs Decision Requires decision label Aug 6, 2019

adrinjalali mentioned this pull request Aug 7, 2019

MRG+1: Add resample to preprocessing. #1454

Closed

qtux mentioned this pull request Aug 27, 2019

Resampling with imbalanced-learn samplers dmbee/seglearn#15

Merged

jnothman reviewed Aug 17, 2020

View reviewed changes

Some extensions to glossary

6220843

Base automatically changed from master to main January 22, 2021 10:50

adrinjalali mentioned this pull request Jun 22, 2021

ENH Add reweighing preprocessing technique based on Kamiran, Calders fairlearn/fairlearn#784

Open

cmarmo mentioned this pull request Dec 10, 2021

Balanced/Weighted Sampling #6568

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Resamplers #13269

[WIP] Resamplers #13269

orausch commented Feb 26, 2019 •

edited by jnothman

Loading

amueller commented Feb 26, 2019

orausch commented Feb 26, 2019

dmbee commented Feb 27, 2019 •

edited

Loading

jnothman commented Feb 27, 2019 via email

orausch commented Feb 27, 2019

dmbee commented Feb 27, 2019

jnothman commented Feb 28, 2019 via email

dmbee commented Feb 28, 2019

orausch commented Feb 28, 2019

glemaitre commented Feb 28, 2019 via email

jnothman commented Mar 1, 2019

orausch commented Mar 1, 2019 •

edited

Loading

orausch commented Jul 3, 2019

judahrand commented Mar 27, 2020 •

edited

Loading

orausch commented Mar 27, 2020

glemaitre commented May 27, 2020

jnothman left a comment

jnothman Aug 17, 2020

jnothman Aug 17, 2020

jnothman Aug 17, 2020

jnothman Aug 17, 2020

jnothman Aug 17, 2020

jnothman Aug 17, 2020

jnothman Aug 17, 2020

jnothman commented Aug 30, 2020

orausch commented Aug 30, 2020

jnothman commented Aug 30, 2020 via email

-        An estimator supporting :term:`fit_resample`.
+        An estimator supporting :term:`fit_resample`. This can be used in a
+        :class:`ResampledTrainer` to resample, augment or reduce the training
+        dataset passed to another estimator.

-        A method on :term:`resamplers` which fits the estimator on a passed
-        dataset, and returns a new dataset. In the new dataset, samples may be
-        removed or added.
+        A method whose presence in an estimator is sufficient and necessary for
+        it to be a :term:`resampler`.
+        When called it should fit the estimator and return a new
+        dataset. In the new dataset, samples may be removed, added or modified.
+        In contrast to :term:`fit_transform`:
+        * X, y, and any other sample-aligned data may be generated;
+        * the samples in the returned dataset need not have any alignment or
+          correspondence to the input dataset.
+        This method has the signature ``fit_resample(X, y, **kw)`` and returns
+        a 3-tuple ``X_new, y_new, kw_new`` where ``kw_new`` is a dict mapping
+        names to data-aligned values that should be passed as fit parameters
+        to the subsequent estimator. Any keyword arguments passed in should be
+        resampled and returned, and if the resampler is not capable of
+        resampling the keyword arguments, it should raise a TypeError.
+        Ordinarily, this method is only called by a :class:`ResampledTrainer`,
+        which acts like a specialised pipeline for cases when the training data
+        should be augmented or resampled.

	if not hasattr(estimator, 'fit_resample') or hasattr(estimator, 'fit'):
	if hasattr(estimator, 'fit') or not hasattr(estimator, 'fit_resample'):

[WIP] Resamplers #13269

Are you sure you want to change the base?

[WIP] Resamplers #13269

Conversation

orausch commented Feb 26, 2019 • edited by jnothman Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Other comments

Plan

amueller commented Feb 26, 2019

orausch commented Feb 26, 2019

dmbee commented Feb 27, 2019 • edited Loading

jnothman commented Feb 27, 2019 via email

orausch commented Feb 27, 2019

dmbee commented Feb 27, 2019

jnothman commented Feb 28, 2019 via email

dmbee commented Feb 28, 2019

orausch commented Feb 28, 2019

glemaitre commented Feb 28, 2019 via email

jnothman commented Mar 1, 2019

orausch commented Mar 1, 2019 • edited Loading

orausch commented Jul 3, 2019

judahrand commented Mar 27, 2020 • edited Loading

orausch commented Mar 27, 2020

glemaitre commented May 27, 2020

jnothman left a comment

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman Aug 17, 2020

Choose a reason for hiding this comment

jnothman commented Aug 30, 2020

orausch commented Aug 30, 2020

jnothman commented Aug 30, 2020 via email

orausch commented Feb 26, 2019 •

edited by jnothman

Loading

dmbee commented Feb 27, 2019 •

edited

Loading

orausch commented Mar 1, 2019 •

edited

Loading

judahrand commented Mar 27, 2020 •

edited

Loading