-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Allow transformers on y #4552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This doesn't seem very useful to me at the moment. I think I'd like to add the "resampling". I don't care if Are there any other good use-cases for transforming y? |
ping @GaelVaroquaux @jnothman |
Why not make |
I should have read the description of the linked issue before commenting... |
pinging #3855 which discusses changing n_samples in a pipeline for visibility. |
@GaelVaroquaux after thinking about it a bit more, I think we shouldn't deprecate I can add the resampling as a use-case here, but otherwise I think this is actually good to go. |
Parameters | ||
---------- | ||
X : iterable | ||
Data to inverse transform. Must fulfill output requirements of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Data to inverse transform"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, the docstrings are not finished yet, it seems. I'll fix them now. Also add some narrative.
I think we need a clear set of use-cases (described in terms of what happens at train time, what happens at test), preferably supported by research papers where these techniques are used, to design a careful solution to generalised transform operations. With only fairly trivial use-cases to motivate this PR, I don't think this is it, sadly. |
Why is subsampling a trivial use-case? It is a somewhat trivial operation that we can't currently do. |
I agree though the examples I gave at the top are not very inspiring. I should have added the subsampling from the beginning, and maybe a loading example. |
Perhaps I misunderstood and this is more widely applicable than I thought. But unless I'm much mistaken, it's not finished. How do I use a custom scorer on a What is correct behaviour when a Piper appears in a |
sklearn/pipeline.py
Outdated
@@ -272,6 +278,22 @@ def inverse_transform(self, X): | |||
return Xt | |||
|
|||
@if_delegate_has_method(delegate='_final_estimator') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's applicable if any of the estimators supports pipe
and the last supports transform
or pipe
.
Has it been considered adding a specific mixin "TargetTransformer", it would require to have "inverse_transform" so that pipelines can apply the transformation back when "predict" is called. And it would be not so hard to modify the pipelines to detect this Mixin and the specific transforms needed. I see a few reasons to add this new mixin:
Another possibility would be to add extra information when building the pipeline: p = Pipline([("scaleX", StandardScaler()), ("scaleY", StandardScaler(), "y"),
("the_model", MyModel())]) or even p = Pipline([("scaleXY", StandardScaler(), "Xy"),("the_model", MyModel())]) In this case no need to change the API only the pipeline needs to be changed. Again "inverse_transform" needs to be implemented on transformers applied on targets. What do you think ? Or is the design already chosen ? |
@AlexisMignon The design is definitely not chosen yet. I think your use-case might be better handled with meta-estimators, though. |
@jnothman thank you for your comments, I'll work on them. I have to think about what happens with a custom scorer. |
Meditating on this a bit longer, I'm not sure if subsampling is not better done with a meta estimator. |
@ogrisel @GaelVaroquaux @agramfort @arjoly opening again in light of our discussion. |
That sounds ominous. On 22 October 2015 at 21:21, Andreas Mueller notifications@github.com
|
@jnothman it's too bad you are not here. We are having a very animated discussion (mostly me and Gael). To give you a very short summary of what we discussed:
Also, there will now probably be "scikit-learn advancement proposals" (slap) in the spirit of PEPs that have user stories for API changes. @GaelVaroquaux will open a new repo which will store RST files in a minute. |
This is actually a very minimal surgery, as it only has some changes to the pipeline that are relatively easy to understand. Why do |
Oy will I give you a slap. ;) I'm a bit full up this week, and not able to think about the proposal in On 22 October 2015 at 21:38, Andreas Mueller notifications@github.com
|
cool, that would be great :) |
Guys, any updates on this issue? I was trying to use Pipeline for an object detection application, my X is an color image and y is a bunch of binary images that labels which pixel is which. Not being able to transform y is very inconvenient for me, I end up having to subclass Pipeline and return |
Id like to second @versatran01. I am also working on object detection. The logical approach is to convert images and their annotations to feature vectors of sliding windows and labels. Of course such processing can be done outside of sklearn, but there are many hyperparameters involved in these transformations, which impact the y-vector as well; and since we would like to perform a gridsearch on them, we would want these transforms to be part of the pipeline. Subclassing Pipeline to pass through y seems to be the preferred solution; but is there a good reason not to include this behavior in the general pipeline class? That is, allow transforms to return an X, y tuple, and if a modified y is returned, propagate that through the pipeline? |
@EelcoHoogendoorn yeah, because it breaks the current API and therefore possible user code. There is no transformer in scikit-learn that changes y, so that addition is very unnatural and totally breaks the API contract. The object detection case is certainly valid, but I think our pipeline is not very well suited for that. Feel free to do a PR to scikit-learn contrib with your transformers that change y and a pipeline that can work with that. There's actually already one in imblearn: https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/pipeline.py @EelcoHoogendoorn @versatran01 do your transformers change y during fit_transform, during transform or both? |
@amueller Thanks for the pointer to imbalanced-learn. I have a class I agree with you that the current API is not well suited for this particular case and should not be changed for compatibility reasons. And I'm happy with my current solution. I'd like to see scikit-learn have better support for image-related learning tasks. |
HI all, Based on your comments above, I coded two very small This seems to do the trick to include post-processor as part of a cross-validated pipeline, and might even allow to code stacked-ensemble as scikit-learn pipelines (not tested, yet). As far as i can tell many people are interested in this subject, so even if my solution is not groudbreaking nor even new, I blogged about it, I assume it's ok to document the link here ? Here is it: |
@svendx4f thanks! So one thing with your example is that you "only" process after the predictions. Often, for example if you want to do a log-linear model, you want to transform before the training and then back after the prediction. That would be possible with your pipeline, but there would be no way to ensure that the transformation and inverse transformation match. Also this might make your code more readable: #7608 |
@svendx4f Actually, I kinda don't like your solution because you change the meaning of X and y within the nested pipelines, and that's easy to get wrong. It's pretty subtle to see which steps are applied to X and which to y. But I acknowledge that we don't have a nice solution right now and any work-around is better than no solution ;) And you managed to not add a new method to the API, to that's a benefit. |
Hi @amueller , thanks for the feed-back. I acknowledge my solution probably works only in some specific cases, it's working for me atm at least (so far so good). Maybe I fail to grasp some subtlety between X and y? To me the output of any predictor/transformer is "its Y", which becomes the X of the next one. What is specific to, say linear regression, so its transformation of X is called Y, whereas the output of PCA or standard scaler is still called X? In both cases they need to be trained beforehand, and in both cases the dimension of the output is potentially different from that of the input. I understand the difference from a ML point of view of course, but why is it relevant from a pipeline "plumbing" point of view? Can't the pipeline ignore the semantics of its components and see them all as thingies that need to be trained and can then transform inputs into outputs? I like your operators, they remove tons of parentheses :) |
I think Y is characterised by being Y unseen at prediction time but Yes, some predictions can be used as features, and some transformed feature On 19 October 2016 at 16:35, Svend Vanderveken notifications@github.com
|
@svendx4f I found it hard to follow the code because it was implicit whether a transformation was applied to X or y. It's not wrong from a programming point of view, but I find the API confusing and I think it makes it easy to write nonsensical code ;) |
Is there an implementation of this anywhere that is considered OK? I do not find it convincing that the reason X,y transforms are bad is because all the existing code only knows about X transforms. I need to support arbitrary X,y transformations, and I feel that this is a superset of the features required by X-only transforms. I would prefer not to maintain my own XY pipeline, which I am trying to do now. |
There's more discussion at
scikit-learn/enhancement_proposals#2, but I'd
really appreciate it if you could describe your current use-case and its
need for "arbitrary X,y transformations" (and ideally why/how that matches
the uses of a pipeline in the context of e.g. grid search).
…On 29 March 2017 at 00:26, gminorcoles ***@***.***> wrote:
Is there an implementation of this anywhere that is considered OK? I do
not find it convincing that the reason X,y transforms are bad is because
all the existing code only knows about X transforms. I need to support
arbitrary X,y transformations, and I feel that this is a superset of the
features required by X-only transforms. I would prefer not to maintain my
own XY pipeline, which I am trying to do now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4552 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-W9m080BtJ6KLNkLW8FRPRvNk3wks5rqQqWgaJpZM4D8v9c>
.
|
Shot at #4143.
Adds a new function to the interface, with the uninspired name
pipe
,with signature
example:
Another (unnecessary) example:
(the labels got mapped to numbers again)