Skip to content

[RFC] Allow transformers on y #4552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

amueller
Copy link
Member

@amueller amueller commented Apr 8, 2015

Shot at #4143.

Adds a new function to the interface, with the uninspired name pipe,
with signature

def pipe(self, X=None, y=None):
     ...
    return Xt, yt

example:

from sklearn.preprocessing.label import _LabelTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression


class TargetScaler(StandardScaler, _LabelTransformer):
    pass

X, y = make_regression()
Xt, yt = TargetScaler().fit_pipe(X, y)
print(y.mean())
print(yt.mean())

15.5246170559
-1.11022302463e-18

Another (unnecessary) example:

from sklearn.pipeline import make_pipeline
from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data, iris.target_names[iris.target]
pipe = make_pipeline(LabelEncoder(), DecisionTreeClassifier()).fit(X, y)
print(pipe.score(X, y))
print(pipe.predict(X))

1.0
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]

(the labels got mapped to numbers again)

@amueller
Copy link
Member Author

amueller commented Apr 8, 2015

This doesn't seem very useful to me at the moment. I think I'd like to add the "resampling". I don't care if FeatureUnion breaks, I think that is on the user.
What I do care about is a meaningful distinction between training and prediction behavior.

Are there any other good use-cases for transforming y?

@amueller
Copy link
Member Author

amueller commented Apr 8, 2015

ping @GaelVaroquaux @jnothman

@ogrisel
Copy link
Member

ogrisel commented Apr 11, 2015

Why not make transform return (X_transformed, y_transformed) whenever y is not None instead of introducing a new method? I am a bit worried about introducing a new bunch of methods. On the otherhand this approach is more explicit.

@ogrisel
Copy link
Member

ogrisel commented Apr 11, 2015

I should have read the description of the linked issue before commenting...

@amueller
Copy link
Member Author

pinging #3855 which discusses changing n_samples in a pipeline for visibility.

@amueller amueller changed the title [WIP] Allow transformers on y [RFC] Allow transformers on y May 1, 2015
@amueller
Copy link
Member Author

@GaelVaroquaux after thinking about it a bit more, I think we shouldn't deprecate transform. In 90% of use-cases, you don't want to return y. We had a hard time to come up with any use cases apart from data loading and resampling.

I can add the resampling as a use-case here, but otherwise I think this is actually good to go.
I have no strong feelings about the name. How about transform_Xy?

@amueller amueller changed the title [RFC] Allow transformers on y [MRG] Allow transformers on y Jun 11, 2015
Parameters
----------
X : iterable
Data to inverse transform. Must fulfill output requirements of the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Data to inverse transform"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, the docstrings are not finished yet, it seems. I'll fix them now. Also add some narrative.

@jnothman
Copy link
Member

  • pipe needs to have capacity for altering other args (e.g. sample_weight).
  • What case is the method pipe for, wherein the target variable is observed, but we're not fitting the model, in a pipeline context?
  • this doesn't help the case where the number of samples should change (although sample_weight support would allow a subset of such changes)
  • it needs to be clear that the pipeline will be scored in terms of the initial y space, although I'm not certain that this is always what's desired.

I think we need a clear set of use-cases (described in terms of what happens at train time, what happens at test), preferably supported by research papers where these techniques are used, to design a careful solution to generalised transform operations. With only fairly trivial use-cases to motivate this PR, I don't think this is it, sadly.

@amueller
Copy link
Member Author

  1. I think adding other args would be simplest in the context of sample_props. It could for the moment return X, y, fit_params.
  2. You mean when do you need to do pipe(X, y)? For scoring, I think. Otherwise not sure.
  3. why not? [at least without sample weights]
  4. I was operating under the assumption that scoring is done on the transformed y, with the use-case of preprocessing, loading or generating y in mind. What would be a case where you want the original space?

Why is subsampling a trivial use-case? It is a somewhat trivial operation that we can't currently do.

@amueller
Copy link
Member Author

I agree though the examples I gave at the top are not very inspiring. I should have added the subsampling from the beginning, and maybe a loading example.

@jnothman
Copy link
Member

Perhaps I misunderstood and this is more widely applicable than I thought. But unless I'm much mistaken, it's not finished.

How do I use a custom scorer on a Pipeline containing one of these? Currently the scorers call predict(X) or decision_function(X), etc. They would need to call the not-yet-extant predict_pipe(X, y) etc. (Maybe the fact that this is only applicable to Pipelines is a reason that custom scorers should be going through an estimator's score function, not be external to it.)

What is correct behaviour when a Piper appears in a FeatureUnion?

@@ -272,6 +278,22 @@ def inverse_transform(self, X):
return Xt

@if_delegate_has_method(delegate='_final_estimator')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's applicable if any of the estimators supports pipe and the last supports transform or pipe.

@AlexisMignon
Copy link
Contributor

Has it been considered adding a specific mixin "TargetTransformer", it would require to have "inverse_transform" so that pipelines can apply the transformation back when "predict" is called. And it would be not so hard to modify the pipelines to detect this Mixin and the specific transforms needed. I see a few reasons to add this new mixin:

  1. Within a pipeline, targets should be treated differently from 'X' data since when "predict" is called on the pipeline the inverse transformations need to be applied on the prediction. Having a specific Mixin makes it easy to detect them and apply ad hoc treatments.

  2. Everything else is kept unchanged. Only the pipeline has to be changed and specific transformers can be easily derived from standard ones by derivation and adding the TargetTransformer mixin.

Another possibility would be to add extra information when building the pipeline:

p = Pipline([("scaleX", StandardScaler()), ("scaleY", StandardScaler(), "y"),
                   ("the_model", MyModel())])

or even

p = Pipline([("scaleXY", StandardScaler(), "Xy"),("the_model", MyModel())])

In this case no need to change the API only the pipeline needs to be changed. Again "inverse_transform" needs to be implemented on transformers applied on targets.

What do you think ? Or is the design already chosen ?

@amueller
Copy link
Member Author

@AlexisMignon The design is definitely not chosen yet. I think your use-case might be better handled with meta-estimators, though.
This PR is mostly to support subsampling and creating labels on the fly.

@amueller
Copy link
Member Author

@jnothman thank you for your comments, I'll work on them. I have to think about what happens with a custom scorer.

@amueller
Copy link
Member Author

Meditating on this a bit longer, I'm not sure if subsampling is not better done with a meta estimator.

@amueller
Copy link
Member Author

@ogrisel @GaelVaroquaux @agramfort @arjoly opening again in light of our discussion.

@amueller amueller reopened this Oct 22, 2015
@jnothman
Copy link
Member

That sounds ominous.

On 22 October 2015 at 21:21, Andreas Mueller notifications@github.com
wrote:

@ogrisel https://github.com/ogrisel @GaelVaroquaux
https://github.com/GaelVaroquaux @agramfort
https://github.com/agramfort @arjoly https://github.com/arjoly
opening again in light of our discussion.


Reply to this email directly or view it on GitHub
#4552 (comment)
.

@amueller
Copy link
Member Author

@jnothman it's too bad you are not here. We are having a very animated discussion (mostly me and Gael). To give you a very short summary of what we discussed:

  • This is the way forward (there are many use cases, some of which I agree with ;).
  • It is really important for pipelines.
  • The way it will work is adding a new type of object, that has fit_pipe and transform_pipe (name up to discussion) and no other methods. This will make sure that something else can happen during fit and during transform in a consistent way.
  • There will probably not be a fit method, as, say for undersampling, only fit_pipe will do undersampling. So calling fit is pretty much useless.
  • We could add these methods to the existing transformers, but we will not for the moment (because of explosion of methods).

Also, there will now probably be "scikit-learn advancement proposals" (slap) in the spirit of PEPs that have user stories for API changes. @GaelVaroquaux will open a new repo which will store RST files in a minute.

@amueller
Copy link
Member Author

This is actually a very minimal surgery, as it only has some changes to the pipeline that are relatively easy to understand.

Why do Scorers need to know about the new method? They don't know about transformers, right?

@jnothman
Copy link
Member

Oy will I give you a slap.

;)

I'm a bit full up this week, and not able to think about the proposal in
detail but hope to at some later point.

On 22 October 2015 at 21:38, Andreas Mueller notifications@github.com
wrote:

This is actually a very minimal surgery, as it only has some changes to
the pipeline that are relatively easy to understand.

Why do Scorers need to know about the new method? They don't know about
transformers, right?


Reply to this email directly or view it on GitHub
#4552 (comment)
.

@amueller
Copy link
Member Author

cool, that would be great :)

@arjoly arjoly changed the title [MRG] Allow transformers on y [RFC] Allow transformers on y Oct 22, 2015
@versatran01
Copy link

Guys, any updates on this issue?

I was trying to use Pipeline for an object detection application, my X is an color image and y is a bunch of binary images that labels which pixel is which. Not being able to transform y is very inconvenient for me, I end up having to subclass Pipeline and return (Xt, yt) from each transform functions.

@EelcoHoogendoorn
Copy link

Id like to second @versatran01. I am also working on object detection. The logical approach is to convert images and their annotations to feature vectors of sliding windows and labels. Of course such processing can be done outside of sklearn, but there are many hyperparameters involved in these transformations, which impact the y-vector as well; and since we would like to perform a gridsearch on them, we would want these transforms to be part of the pipeline.

Subclassing Pipeline to pass through y seems to be the preferred solution; but is there a good reason not to include this behavior in the general pipeline class? That is, allow transforms to return an X, y tuple, and if a modified y is returned, propagate that through the pipeline?

@amueller
Copy link
Member Author

amueller commented Oct 8, 2016

@EelcoHoogendoorn yeah, because it breaks the current API and therefore possible user code. There is no transformer in scikit-learn that changes y, so that addition is very unnatural and totally breaks the API contract.

The object detection case is certainly valid, but I think our pipeline is not very well suited for that. Feel free to do a PR to scikit-learn contrib with your transformers that change y and a pipeline that can work with that. There's actually already one in imblearn: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py

@EelcoHoogendoorn @versatran01 do your transformers change y during fit_transform, during transform or both?

@versatran01
Copy link

@amueller Thanks for the pointer to imbalanced-learn.

I have a class ImagePipeline that subclasses Pipeline and I modified both fit_transform and transform so that as long as y is not None, it will apply transformers to y (whether that transformer changes y depends on the implementation).

I agree with you that the current API is not well suited for this particular case and should not be changed for compatibility reasons. And I'm happy with my current solution. I'd like to see scikit-learn have better support for image-related learning tasks.

@sv3ndk
Copy link

sv3ndk commented Oct 18, 2016

HI all,

Based on your comments above, I coded two very small Pipeline sub-classes that let a transformer behave as a predictor and vice-versa. This is probably very similar to the ImagePipeline that @versatran01 mentioned.

This seems to do the trick to include post-processor as part of a cross-validated pipeline, and might even allow to code stacked-ensemble as scikit-learn pipelines (not tested, yet).

As far as i can tell many people are interested in this subject, so even if my solution is not groudbreaking nor even new, I blogged about it, I assume it's ok to document the link here ? Here is it:

Using model post-processor within scikit-learn pipelines

@amueller
Copy link
Member Author

@svendx4f thanks! So one thing with your example is that you "only" process after the predictions. Often, for example if you want to do a log-linear model, you want to transform before the training and then back after the prediction. That would be possible with your pipeline, but there would be no way to ensure that the transformation and inverse transformation match.

Also this might make your code more readable: #7608

@amueller
Copy link
Member Author

@svendx4f Actually, I kinda don't like your solution because you change the meaning of X and y within the nested pipelines, and that's easy to get wrong. It's pretty subtle to see which steps are applied to X and which to y. But I acknowledge that we don't have a nice solution right now and any work-around is better than no solution ;) And you managed to not add a new method to the API, to that's a benefit.

@sv3ndk
Copy link

sv3ndk commented Oct 19, 2016

Hi @amueller , thanks for the feed-back.

I acknowledge my solution probably works only in some specific cases, it's working for me atm at least (so far so good).

Maybe I fail to grasp some subtlety between X and y? To me the output of any predictor/transformer is "its Y", which becomes the X of the next one. What is specific to, say linear regression, so its transformation of X is called Y, whereas the output of PCA or standard scaler is still called X? In both cases they need to be trained beforehand, and in both cases the dimension of the output is potentially different from that of the input.

I understand the difference from a ML point of view of course, but why is it relevant from a pipeline "plumbing" point of view? Can't the pipeline ignore the semantics of its components and see them all as thingies that need to be trained and can then transform inputs into outputs?

I like your operators, they remove tons of parentheses :)

@jnothman
Copy link
Member

I think Y is characterised by being Y unseen at prediction time but
necessary for model evaluation.

Yes, some predictions can be used as features, and some transformed feature
spaces as predictions.

On 19 October 2016 at 16:35, Svend Vanderveken notifications@github.com
wrote:

Hi @amueller https://github.com/amueller , thanks for the feed-back.

I acknowledge my solution probably works only in some specific cases, it's
working for me atm at least (so far so good).

Maybe I fail to grasp some subtlety between X and y? To me the output of
any predictor/transformer is "its Y", which becomes the X of the next one.
What is specific to, say linear regression, so its transformation of X is
called Y, whereas the output of PCA or standard scaler is still called X?
In both cases they need to be trained beforehand, and in both cases the
dimension of the output is potentially different from that of the input.

I understand the difference from a ML point of view of course, but why is
it relevant from a pipeline "plumbing" point of view? Can't the pipeline
ignore the semantics of its components and see them all as thingies that
need to be trained and can then transform inputs into outputs?

I like your operators, they remove tons of parentheses :)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#4552 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6ybWWIAj-XNh8FcefwlT_Rzu9gNIks5q1aw1gaJpZM4D8v9c
.

@amueller
Copy link
Member Author

@svendx4f I found it hard to follow the code because it was implicit whether a transformation was applied to X or y. It's not wrong from a programming point of view, but I find the API confusing and I think it makes it easy to write nonsensical code ;)

@gminorcoles
Copy link

Is there an implementation of this anywhere that is considered OK? I do not find it convincing that the reason X,y transforms are bad is because all the existing code only knows about X transforms. I need to support arbitrary X,y transformations, and I feel that this is a superset of the features required by X-only transforms. I would prefer not to maintain my own XY pipeline, which I am trying to do now.

@jnothman
Copy link
Member

jnothman commented Mar 28, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants