[RFC] Allow transformers on y #4552

amueller · 2015-04-08T20:47:13Z

Shot at #4143.

Adds a new function to the interface, with the uninspired name pipe,
with signature

def pipe(self, X=None, y=None):
     ...
    return Xt, yt

example:

from sklearn.preprocessing.label import _LabelTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression


class TargetScaler(StandardScaler, _LabelTransformer):
    pass

X, y = make_regression()
Xt, yt = TargetScaler().fit_pipe(X, y)
print(y.mean())
print(yt.mean())

15.5246170559
-1.11022302463e-18

Another (unnecessary) example:

from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target_names[iris.target]
pipe = make_pipeline(LabelEncoder(), DecisionTreeClassifier()).fit(X, y)
print(pipe.score(X, y))
print(pipe.predict(X))

1.0
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

(the labels got mapped to numbers again)

amueller · 2015-04-08T20:49:43Z

This doesn't seem very useful to me at the moment. I think I'd like to add the "resampling". I don't care if FeatureUnion breaks, I think that is on the user.
What I do care about is a meaningful distinction between training and prediction behavior.

Are there any other good use-cases for transforming y?

amueller · 2015-04-08T20:52:53Z

ping @GaelVaroquaux @jnothman

ogrisel · 2015-04-11T11:57:15Z

Why not make transform return (X_transformed, y_transformed) whenever y is not None instead of introducing a new method? I am a bit worried about introducing a new bunch of methods. On the otherhand this approach is more explicit.

ogrisel · 2015-04-11T12:24:00Z

I should have read the description of the linked issue before commenting...

amueller · 2015-04-19T18:46:37Z

pinging #3855 which discusses changing n_samples in a pipeline for visibility.

amueller · 2015-06-11T22:42:32Z

@GaelVaroquaux after thinking about it a bit more, I think we shouldn't deprecate transform. In 90% of use-cases, you don't want to return y. We had a hard time to come up with any use cases apart from data loading and resampling.

I can add the resampling as a use-case here, but otherwise I think this is actually good to go.
I have no strong feelings about the name. How about transform_Xy?

jnothman · 2015-06-12T05:04:50Z

sklearn/pipeline.py

+        Parameters
+        ----------
+        X : iterable
+            Data to inverse transform. Must fulfill output requirements of the


"Data to inverse transform"?

Damn, the docstrings are not finished yet, it seems. I'll fix them now. Also add some narrative.

jnothman · 2015-06-12T05:12:39Z

pipe needs to have capacity for altering other args (e.g. sample_weight).
What case is the method pipe for, wherein the target variable is observed, but we're not fitting the model, in a pipeline context?
this doesn't help the case where the number of samples should change (although sample_weight support would allow a subset of such changes)
it needs to be clear that the pipeline will be scored in terms of the initial y space, although I'm not certain that this is always what's desired.

I think we need a clear set of use-cases (described in terms of what happens at train time, what happens at test), preferably supported by research papers where these techniques are used, to design a careful solution to generalised transform operations. With only fairly trivial use-cases to motivate this PR, I don't think this is it, sadly.

amueller · 2015-06-12T15:16:14Z

I think adding other args would be simplest in the context of sample_props. It could for the moment return X, y, fit_params.
You mean when do you need to do pipe(X, y)? For scoring, I think. Otherwise not sure.
why not? [at least without sample weights]
I was operating under the assumption that scoring is done on the transformed y, with the use-case of preprocessing, loading or generating y in mind. What would be a case where you want the original space?

Why is subsampling a trivial use-case? It is a somewhat trivial operation that we can't currently do.

amueller · 2015-06-12T15:17:43Z

I agree though the examples I gave at the top are not very inspiring. I should have added the subsampling from the beginning, and maybe a loading example.

jnothman · 2015-06-13T11:04:55Z

Perhaps I misunderstood and this is more widely applicable than I thought. But unless I'm much mistaken, it's not finished.

How do I use a custom scorer on a Pipeline containing one of these? Currently the scorers call predict(X) or decision_function(X), etc. They would need to call the not-yet-extant predict_pipe(X, y) etc. (Maybe the fact that this is only applicable to Pipelines is a reason that custom scorers should be going through an estimator's score function, not be external to it.)

What is correct behaviour when a Piper appears in a FeatureUnion?

jnothman · 2015-06-13T11:05:26Z

sklearn/pipeline.py

@@ -272,6 +278,22 @@ def inverse_transform(self, X):
        return Xt

    @if_delegate_has_method(delegate='_final_estimator')


No, it's applicable if any of the estimators supports pipe and the last supports transform or pipe.

AlexisMignon · 2015-06-15T11:36:53Z

Has it been considered adding a specific mixin "TargetTransformer", it would require to have "inverse_transform" so that pipelines can apply the transformation back when "predict" is called. And it would be not so hard to modify the pipelines to detect this Mixin and the specific transforms needed. I see a few reasons to add this new mixin:

Within a pipeline, targets should be treated differently from 'X' data since when "predict" is called on the pipeline the inverse transformations need to be applied on the prediction. Having a specific Mixin makes it easy to detect them and apply ad hoc treatments.
Everything else is kept unchanged. Only the pipeline has to be changed and specific transformers can be easily derived from standard ones by derivation and adding the TargetTransformer mixin.

Another possibility would be to add extra information when building the pipeline:

p = Pipline([("scaleX", StandardScaler()), ("scaleY", StandardScaler(), "y"),
                   ("the_model", MyModel())])

or even

p = Pipline([("scaleXY", StandardScaler(), "Xy"),("the_model", MyModel())])

In this case no need to change the API only the pipeline needs to be changed. Again "inverse_transform" needs to be implemented on transformers applied on targets.

What do you think ? Or is the design already chosen ?

amueller · 2015-06-15T15:36:56Z

@AlexisMignon The design is definitely not chosen yet. I think your use-case might be better handled with meta-estimators, though.
This PR is mostly to support subsampling and creating labels on the fly.

amueller · 2015-06-15T15:37:42Z

@jnothman thank you for your comments, I'll work on them. I have to think about what happens with a custom scorer.

amueller · 2015-06-22T21:51:03Z

Meditating on this a bit longer, I'm not sure if subsampling is not better done with a meta estimator.

amueller · 2015-10-22T10:21:45Z

@ogrisel @GaelVaroquaux @agramfort @arjoly opening again in light of our discussion.

jnothman · 2015-10-22T10:30:12Z

That sounds ominous.

On 22 October 2015 at 21:21, Andreas Mueller notifications@github.com
wrote:

@ogrisel https://github.com/ogrisel @GaelVaroquaux
https://github.com/GaelVaroquaux @agramfort
https://github.com/agramfort @arjoly https://github.com/arjoly
opening again in light of our discussion.

—
Reply to this email directly or view it on GitHub
#4552 (comment)
.

amueller · 2015-10-22T10:37:33Z

@jnothman it's too bad you are not here. We are having a very animated discussion (mostly me and Gael). To give you a very short summary of what we discussed:

This is the way forward (there are many use cases, some of which I agree with ;).
It is really important for pipelines.
The way it will work is adding a new type of object, that has fit_pipe and transform_pipe (name up to discussion) and no other methods. This will make sure that something else can happen during fit and during transform in a consistent way.
There will probably not be a fit method, as, say for undersampling, only fit_pipe will do undersampling. So calling fit is pretty much useless.
We could add these methods to the existing transformers, but we will not for the moment (because of explosion of methods).

Also, there will now probably be "scikit-learn advancement proposals" (slap) in the spirit of PEPs that have user stories for API changes. @GaelVaroquaux will open a new repo which will store RST files in a minute.

amueller · 2015-10-22T10:38:49Z

This is actually a very minimal surgery, as it only has some changes to the pipeline that are relatively easy to understand.

Why do Scorers need to know about the new method? They don't know about transformers, right?

jnothman · 2015-10-22T10:45:25Z

Oy will I give you a slap.

;)

I'm a bit full up this week, and not able to think about the proposal in
detail but hope to at some later point.

On 22 October 2015 at 21:38, Andreas Mueller notifications@github.com
wrote:

This is actually a very minimal surgery, as it only has some changes to
the pipeline that are relatively easy to understand.

Why do Scorers need to know about the new method? They don't know about
transformers, right?

—
Reply to this email directly or view it on GitHub
#4552 (comment)
.

amueller · 2015-10-22T10:48:30Z

cool, that would be great :)

versatran01 · 2016-02-01T22:21:30Z

Guys, any updates on this issue?

I was trying to use Pipeline for an object detection application, my X is an color image and y is a bunch of binary images that labels which pixel is which. Not being able to transform y is very inconvenient for me, I end up having to subclass Pipeline and return (Xt, yt) from each transform functions.

EelcoHoogendoorn · 2016-02-14T08:44:53Z

Id like to second @versatran01. I am also working on object detection. The logical approach is to convert images and their annotations to feature vectors of sliding windows and labels. Of course such processing can be done outside of sklearn, but there are many hyperparameters involved in these transformations, which impact the y-vector as well; and since we would like to perform a gridsearch on them, we would want these transforms to be part of the pipeline.

Subclassing Pipeline to pass through y seems to be the preferred solution; but is there a good reason not to include this behavior in the general pipeline class? That is, allow transforms to return an X, y tuple, and if a modified y is returned, propagate that through the pipeline?

amueller · 2016-10-08T00:03:57Z

@EelcoHoogendoorn yeah, because it breaks the current API and therefore possible user code. There is no transformer in scikit-learn that changes y, so that addition is very unnatural and totally breaks the API contract.

The object detection case is certainly valid, but I think our pipeline is not very well suited for that. Feel free to do a PR to scikit-learn contrib with your transformers that change y and a pipeline that can work with that. There's actually already one in imblearn: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py

@EelcoHoogendoorn @versatran01 do your transformers change y during fit_transform, during transform or both?

versatran01 · 2016-10-08T00:53:23Z

@amueller Thanks for the pointer to imbalanced-learn.

I have a class ImagePipeline that subclasses Pipeline and I modified both fit_transform and transform so that as long as y is not None, it will apply transformers to y (whether that transformer changes y depends on the implementation).

I agree with you that the current API is not well suited for this particular case and should not be changed for compatibility reasons. And I'm happy with my current solution. I'd like to see scikit-learn have better support for image-related learning tasks.

sv3ndk · 2016-10-18T13:28:05Z

HI all,

Based on your comments above, I coded two very small Pipeline sub-classes that let a transformer behave as a predictor and vice-versa. This is probably very similar to the ImagePipeline that @versatran01 mentioned.

This seems to do the trick to include post-processor as part of a cross-validated pipeline, and might even allow to code stacked-ensemble as scikit-learn pipelines (not tested, yet).

As far as i can tell many people are interested in this subject, so even if my solution is not groudbreaking nor even new, I blogged about it, I assume it's ok to document the link here ? Here is it:

Using model post-processor within scikit-learn pipelines

amueller · 2016-10-18T18:41:43Z

@svendx4f thanks! So one thing with your example is that you "only" process after the predictions. Often, for example if you want to do a log-linear model, you want to transform before the training and then back after the prediction. That would be possible with your pipeline, but there would be no way to ensure that the transformation and inverse transformation match.

Also this might make your code more readable: #7608

amueller · 2016-10-18T18:46:32Z

@svendx4f Actually, I kinda don't like your solution because you change the meaning of X and y within the nested pipelines, and that's easy to get wrong. It's pretty subtle to see which steps are applied to X and which to y. But I acknowledge that we don't have a nice solution right now and any work-around is better than no solution ;) And you managed to not add a new method to the API, to that's a benefit.

sv3ndk · 2016-10-19T05:35:48Z

Hi @amueller , thanks for the feed-back.

I acknowledge my solution probably works only in some specific cases, it's working for me atm at least (so far so good).

Maybe I fail to grasp some subtlety between X and y? To me the output of any predictor/transformer is "its Y", which becomes the X of the next one. What is specific to, say linear regression, so its transformation of X is called Y, whereas the output of PCA or standard scaler is still called X? In both cases they need to be trained beforehand, and in both cases the dimension of the output is potentially different from that of the input.

I understand the difference from a ML point of view of course, but why is it relevant from a pipeline "plumbing" point of view? Can't the pipeline ignore the semantics of its components and see them all as thingies that need to be trained and can then transform inputs into outputs?

I like your operators, they remove tons of parentheses :)

jnothman · 2016-10-19T06:24:43Z

I think Y is characterised by being Y unseen at prediction time but
necessary for model evaluation.

Yes, some predictions can be used as features, and some transformed feature
spaces as predictions.

On 19 October 2016 at 16:35, Svend Vanderveken notifications@github.com
wrote:

Hi @amueller https://github.com/amueller , thanks for the feed-back.

I acknowledge my solution probably works only in some specific cases, it's
working for me atm at least (so far so good).

Maybe I fail to grasp some subtlety between X and y? To me the output of
any predictor/transformer is "its Y", which becomes the X of the next one.
What is specific to, say linear regression, so its transformation of X is
called Y, whereas the output of PCA or standard scaler is still called X?
In both cases they need to be trained beforehand, and in both cases the
dimension of the output is potentially different from that of the input.

I understand the difference from a ML point of view of course, but why is
it relevant from a pipeline "plumbing" point of view? Can't the pipeline
ignore the semantics of its components and see them all as thingies that
need to be trained and can then transform inputs into outputs?

I like your operators, they remove tons of parentheses :)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#4552 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6ybWWIAj-XNh8FcefwlT_Rzu9gNIks5q1aw1gaJpZM4D8v9c
.

amueller · 2016-10-19T18:17:32Z

@svendx4f I found it hard to follow the code because it was implicit whether a transformation was applied to X or y. It's not wrong from a programming point of view, but I find the API confusing and I think it makes it easy to write nonsensical code ;)

gminorcoles · 2017-03-28T13:26:45Z

Is there an implementation of this anywhere that is considered OK? I do not find it convincing that the reason X,y transforms are bad is because all the existing code only knows about X transforms. I need to support arbitrary X,y transformations, and I feel that this is a superset of the features required by X-only transforms. I would prefer not to maintain my own XY pipeline, which I am trying to do now.

jnothman · 2017-03-28T13:50:35Z

There's more discussion at scikit-learn/enhancement_proposals#2, but I'd really appreciate it if you could describe your current use-case and its need for "arbitrary X,y transformations" (and ideally why/how that matches the uses of a pipeline in the context of e.g. grid search).

…

On 29 March 2017 at 00:26, gminorcoles ***@***.***> wrote: Is there an implementation of this anywhere that is considered OK? I do not find it convincing that the reason X,y transforms are bad is because all the existing code only knows about X transforms. I need to support arbitrary X,y transformations, and I feel that this is a superset of the features required by X-only transforms. I would prefer not to maintain my own XY pipeline, which I am trying to do now. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4552 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6-W9m080BtJ6KLNkLW8FRPRvNk3wks5rqQqWgaJpZM4D8v9c> .

add pipe interface

1c6c3ff

typo fixes

83c3bc3

amueller mentioned this pull request Apr 19, 2015

ClassBalancer as method for combating classification imbalance #4617

Closed

amueller changed the title ~~[WIP] Allow transformers on y~~ [RFC] Allow transformers on y May 1, 2015

amueller changed the title ~~[RFC] Allow transformers on y~~ [MRG] Allow transformers on y Jun 11, 2015

jnothman reviewed Jun 12, 2015
View reviewed changes

jnothman reviewed Jun 13, 2015
View reviewed changes

amueller closed this Aug 3, 2015

amueller mentioned this pull request Aug 28, 2015

Allow for Transformers on y #4143

Closed

amueller reopened this Oct 22, 2015

arjoly changed the title ~~[MRG] Allow transformers on y~~ [RFC] Allow transformers on y Oct 22, 2015

jnothman mentioned this pull request Aug 27, 2017

Feature Request: Pipelining Outlier Removal #9630

Open

amueller closed this May 22, 2018

ablaom mentioned this pull request Nov 28, 2019

Target transformations/inverse transformations alegonz/baikal#11

Closed

		@@ -272,6 +278,22 @@ def inverse_transform(self, X):
		return Xt

		@if_delegate_has_method(delegate='_final_estimator')

[RFC] Allow transformers on y #4552

[RFC] Allow transformers on y #4552

Conversation

amueller commented Apr 8, 2015

amueller commented Apr 8, 2015

amueller commented Apr 8, 2015

ogrisel commented Apr 11, 2015

ogrisel commented Apr 11, 2015

amueller commented Apr 19, 2015

amueller commented Jun 11, 2015

jnothman Jun 12, 2015

Choose a reason for hiding this comment

amueller Jun 12, 2015

Choose a reason for hiding this comment

jnothman commented Jun 12, 2015

amueller commented Jun 12, 2015

amueller commented Jun 12, 2015

jnothman commented Jun 13, 2015

jnothman Jun 13, 2015

Choose a reason for hiding this comment

AlexisMignon commented Jun 15, 2015

amueller commented Jun 15, 2015

amueller commented Jun 15, 2015

amueller commented Jun 22, 2015

amueller commented Oct 22, 2015

jnothman commented Oct 22, 2015

amueller commented Oct 22, 2015

amueller commented Oct 22, 2015

jnothman commented Oct 22, 2015

amueller commented Oct 22, 2015

versatran01 commented Feb 1, 2016

EelcoHoogendoorn commented Feb 14, 2016

amueller commented Oct 8, 2016

versatran01 commented Oct 8, 2016

sv3ndk commented Oct 18, 2016

amueller commented Oct 18, 2016

amueller commented Oct 18, 2016

sv3ndk commented Oct 19, 2016

jnothman commented Oct 19, 2016

amueller commented Oct 19, 2016

gminorcoles commented Mar 28, 2017

jnothman commented Mar 28, 2017 via email