-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[Proof of concept] Syntactic sugar for pipelines #2589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
really fun :) I am just wondering about names of the steps. What if you do: StandardScaler() | PCA(n_components=2) | StandardScaler() | SVC(random_state=0) are you going to have a name collision? also now it's not super explicit how to use set_params as names are hidden. and it's a bit against the zen of python "There should be one-- and preferably only one --obvious way to do it." |
this will not create one pipeline with multiple steps but rather a sequence of pipelines? If it would create just a single pipeline I'd propose to do naming based on position rather then class name. Otherwise, I'm a friend of syntactic sugar -- even if it contradicts the zen |
Yes. I was thinking of adding an integer suffix (e.g. StandardScaler2 in your example).
This creates a single pipeline. Edit: |
Awesome! |
don't curse please :) |
The question is whether we want to make pipelines just an utility tool or a core element of the scikit-learn "language". |
This looks very neat for interactive experimentation: the list of tuples required for the pipeline constructor makes it exceedingly cumbersome. Yet I agree with @agramfort that it's not very pythonic, and that step naming is an issue. And given his response to other Pipeline syntactic sugar, I expect @GaelVaroquaux will strongly object. |
You absolutely got it. I was waiting a bit to step in, but I do That said, I do like the implicit naming. I think that a syntax in the pipe = Pipeline((StandardScaler(), SVM())) would be useful, more explicit, and get us most of the way of the So am I actually enthousiastic about half of the proposal! |
I know that @GaelVaroquaux is our BDFL but out of curiosity I would still love to hear other developers' opinion :) |
And that's why I waited to answer. But maybe I answered too quickly. Note that I actually do think that this discussion is shaping an |
While I agree this is elegant and handy, I am not convinced this is the most readable thing to do. In particular, for those who are not aware that operators can actually be overriden, I think this would look totally magical. (I talk from experience, as a teaching assistant in various programming courses... Unfortunately, you would be surprised to see that even the most basic programming concepts are not always fully grasped, let alone redefining operators!) That being said, I agree that implicit naming would be a good compromise to improve the pipeline syntax. In its current form, it is too cumbersome to write. |
People use mobile phones every day without knowing anything about signal processing :) People just need to know that |
They have to know what a Pipeline is for understanding the code. They don't with the explicit syntax. |
I don't really like this either. I'm currently using Celery, which uses the same operator with the same meaning, but I always use its explicit syntax. The problem with operator overloading is that if the operator is not an established one, then you have to look up what it means. But you can't type What I would personally like better is a freestanding utility function:
This allows
which is both short and explicit. |
I agree a freestanding utility might be better than trying to implement the On Fri, Nov 22, 2013 at 6:24 AM, Lars Buitinck notifications@github.comwrote:
|
Also it makes the call syntax simpler still by relying on |
But such a util means that you are teaching a new constructor to users that deviates from the stand scikit-learn of doing this, just to save 2 characters. |
Well, no, it externalises complexity from the On Fri, Nov 22, 2013 at 10:20 AM, Gael Varoquaux
|
Hear, hear. We have convenience functions for clustering methods as well. There's nothing wrong with that, as long as the docs explain what's going on in terms of the full-featured API. I'd rather have that than overloading the syntax of the |
For an idea of what else might belong in class transformer(TransformerMixin, BaseEstimator):
def __init__(self, transform):
self.transform = transform
def __repr__(self):
return '{0.__class__.__name__}({0.transform!r})'.format(self) |
Point taken. |
Agreed, although I think that I would like it with a capital 'T', as it's |
The Python stdlib violates this convention itself: it has several classes posing as functions, e.g. |
One difference is that these functions do actual computations and return fitted parameters. Your
A syntactic sugar is a shorter syntax for an existing functionality. Nobody is forced to used it if they don't like it. Personally, if such syntactic sugar were available, I would use it in IPython shells or experiment scripts but probably not in library code. I also don't buy that the syntactic sugar is less readable than |
Agree with
Except when they have to read other people's code. By forcing people to write explicit, readable code, we make their colleagues' lives easier :) Also, on a more fundamental level, this operator overloading has to be done (due to how Python works) on |
@2675 is a tentative implementation of |
Inspired by https://github.com/aht/stream.py by @aht
This PR adds a simple syntactic sugar for easily creating pipelines.
I used the bitwise or operator because of the similarity between pipelines and UNIX pipes but the right shift operator could be nice too (easier to read and convey the idea of direction):