Skip to content

[Proof of concept] Syntactic sugar for pipelines #2589

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mblondel
Copy link
Member

Inspired by https://github.com/aht/stream.py by @aht

This PR adds a simple syntactic sugar for easily creating pipelines.

>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> pipeline = StandardScaler() | PCA(n_components=2) | SVC(random_state=0)
>>> pipeline.__class__
<class 'sklearn.pipeline.Pipeline'>
>>> y_pred = pipeline.fit(X, y).predict(X)
>>> np.mean(y == y_pred)
0.92000000000000004

I used the bitwise or operator because of the similarity between pipelines and UNIX pipes but the right shift operator could be nice too (easier to read and convey the idea of direction):

>>> pipeline = StandardScaler() >> PCA(n_components=2) >> SVC(random_state=0)

@coveralls
Copy link

Coverage Status

Coverage remained the same when pulling 608d3ad on mblondel:pipeline_syntax into a8089b0 on scikit-learn:master.

@agramfort
Copy link
Member

really fun :)

I am just wondering about names of the steps. What if you do:

StandardScaler() | PCA(n_components=2) | StandardScaler() | SVC(random_state=0)

are you going to have a name collision?

also now it's not super explicit how to use set_params as names are hidden.

and it's a bit against the zen of python

"There should be one-- and preferably only one --obvious way to do it."

@pprett
Copy link
Member

pprett commented Nov 14, 2013

this will not create one pipeline with multiple steps but rather a sequence of pipelines?

If it would create just a single pipeline I'd propose to do naming based on position rather then class name.

Otherwise, I'm a friend of syntactic sugar -- even if it contradicts the zen

@mblondel
Copy link
Member Author

are you going to have a name collision?

Yes. I was thinking of adding an integer suffix (e.g. StandardScaler2 in your example).

this will not create one pipeline with multiple steps but rather a sequence of pipelines?

This creates a single pipeline.

Edit: StandardScaler() | PCA(n_components=2) | SVC(random_state=0) creates a single pipeline (in the end) but it creates two intermediary pipelines (which are gonna be garbage collected).

@aht
Copy link

aht commented Nov 14, 2013

Awesome!

@agramfort
Copy link
Member

Otherwise, I'm a friend of syntactic sugar -- even if it contradicts the
zen

don't curse please :)

@mblondel
Copy link
Member Author

The question is whether we want to make pipelines just an utility tool or a core element of the scikit-learn "language".

@jnothman
Copy link
Member

This looks very neat for interactive experimentation: the list of tuples required for the pipeline constructor makes it exceedingly cumbersome. Yet I agree with @agramfort that it's not very pythonic, and that step naming is an issue. And given his response to other Pipeline syntactic sugar, I expect @GaelVaroquaux will strongly object.

@GaelVaroquaux
Copy link
Member

Yet I agree with @agramfort that it's not very pythonic, and that step
naming is an issue. And given his response to other Pipeline syntactic
sugar, I expect @GaelVaroquaux will strongly object.

You absolutely got it. I was waiting a bit to step in, but I do
absolutely object. If every library in Python was playing these games,
our code would be completely unreadable.

That said, I do like the implicit naming. I think that a syntax in the
form:

pipe = Pipeline((StandardScaler(), SVM()))

would be useful, more explicit, and get us most of the way of the
proposed improvement. It would add roughly 10 characters to the suggested
syntax, but what are 10 characters in the face of readability :).

So am I actually enthousiastic about half of the proposal!

@mblondel
Copy link
Member Author

I know that @GaelVaroquaux is our BDFL but out of curiosity I would still love to hear other developers' opinion :)

@GaelVaroquaux
Copy link
Member

And that's why I waited to answer. But maybe I answered too quickly.

Note that I actually do think that this discussion is shaping an
improvement to the pipeline API (implicit naming).

@glouppe
Copy link
Contributor

glouppe commented Nov 15, 2013

While I agree this is elegant and handy, I am not convinced this is the most readable thing to do. In particular, for those who are not aware that operators can actually be overriden, I think this would look totally magical. (I talk from experience, as a teaching assistant in various programming courses... Unfortunately, you would be surprised to see that even the most basic programming concepts are not always fully grasped, let alone redefining operators!)

That being said, I agree that implicit naming would be a good compromise to improve the pipeline syntax. In its current form, it is too cumbersome to write.

@mblondel
Copy link
Member Author

While I agree this is elegant and handy, I am not convinced this is the most readable thing to do. In particular, for those who are not aware that operators can actually be overriden, I think this would look totally magical.

People use mobile phones every day without knowing anything about signal processing :) People just need to know that a | b | c or a >> b >> c is the pipeline of a, b and c (and if that's not obvious, that's what variables are for).

@glouppe
Copy link
Contributor

glouppe commented Nov 15, 2013

They have to know what a Pipeline is for understanding the code. They don't with the explicit syntax.

@larsmans
Copy link
Member

I don't really like this either. I'm currently using Celery, which uses the same operator with the same meaning, but I always use its explicit syntax. The problem with operator overloading is that if the operator is not an established one, then you have to look up what it means. But you can't type ?| or |? in IPython, and you can't google it either.

What I would personally like better is a freestanding utility function:

def pipeline(*components):
    names = map(_generate_name, components)
    return Pipeline(zip(names, components))

This allows

pipeline(CountVectorizer(), TruncatedSVD(), LinearSVC())

which is both short and explicit.

@jnothman
Copy link
Member

I agree a freestanding utility might be better than trying to implement the
name generation in Pipeline, which would involve introducing steps_ (a
dynamic property, or produced by fit), and requires a change to
is_classifier which tries to use estimator.steps[-1][1]...

On Fri, Nov 22, 2013 at 6:24 AM, Lars Buitinck notifications@github.comwrote:

I don't really like this either. I'm currently using Celery, which uses
the same operator with the same meaning, but I always use its explicit
syntax. The problem with operator overloading is that if the operator is
not an established one, then you have to look up what it means. But you
can't type ?| or |? in IPython, and you can't google it either.

What I would personally like better is a freestanding utility function:

def pipeline(*components):
names = map(_generate_name, components)
return Pipeline(zip(names, components))


Reply to this email directly or view it on GitHubhttps://github.com//pull/2589#issuecomment-29015013
.

@larsmans
Copy link
Member

Also it makes the call syntax simpler still by relying on *args (no list passing). Hacking that into the Pipeline constructor would make it hard to maintain.

@GaelVaroquaux
Copy link
Member

But such a util means that you are teaching a new constructor to users that deviates from the stand scikit-learn of doing this, just to save 2 characters.

@jnothman
Copy link
Member

Well, no, it externalises complexity from the Pipeline class (required
because of the scikit learn convention that constructor parameters are
saved as-is) for the sake of a simplified constructor. Other packages (e.g.
django) might put such things in a shortcuts namespace; it might make
sense for scikit-learn to provide such shortcuts to enable rapid
experimentation and concise code, and to separate functionality from
syntactic sugar.

On Fri, Nov 22, 2013 at 10:20 AM, Gael Varoquaux
notifications@github.comwrote:

But such a util means that you are teaching a new constructor to users
that deviates from the stand scikit-learn of doing this, just to save 2
characters.


Reply to this email directly or view it on GitHubhttps://github.com//pull/2589#issuecomment-29034691
.

@larsmans
Copy link
Member

"to separate functionality from syntactic sugar"

Hear, hear.

We have convenience functions for clustering methods as well. There's nothing wrong with that, as long as the docs explain what's going on in terms of the full-featured API. I'd rather have that than overloading the syntax of the Pipeline constructor to take a "list of {estimator or (string, estimator) tuple}". That's making things more complicated, but for us and the few users who do care to read the docs.

@jnothman
Copy link
Member

For an idea of what else might belong in shortcuts:

class transformer(TransformerMixin, BaseEstimator):
    def __init__(self, transform):
        self.transform = transform
    def __repr__(self):
        return '{0.__class__.__name__}({0.transform!r})'.format(self)

@GaelVaroquaux
Copy link
Member

Well, no, it externalises complexity from the Pipeline class (required
because of the scikit learn convention that constructor parameters are
saved as-is) for the sake of a simplified constructor.

Point taken.

@GaelVaroquaux
Copy link
Member

class transformer(TransformerMixin, BaseEstimator):

Agreed, although I think that I would like it with a capital 'T', as it's
a class. I know, I know, I am horribly conventional and boring :).

@larsmans
Copy link
Member

The Python stdlib violates this convention itself: it has several classes posing as functions, e.g. itertools.chain. The intended use is more important than the implementation. We just have to trick Sphinx into treating such classes as functions.

@mblondel
Copy link
Member Author

We have convenience functions for clustering methods as well

One difference is that these functions do actual computations and return fitted parameters. Your pipeline merely returns an object: it's a function but it actually acts like a constructor. So I think a better name would be make_pipeline.

I don't really like this either. I'm currently using Celery, which uses the same operator with the same meaning, but I always use its explicit syntax.

A syntactic sugar is a shorter syntax for an existing functionality. Nobody is forced to used it if they don't like it. Personally, if such syntactic sugar were available, I would use it in IPython shells or experiment scripts but probably not in library code. I also don't buy that the syntactic sugar is less readable than Pipeline. The barrier to entry to scikit-learn is low, but one still needs to know a few scikit-learn concepts such as pipelines, the way we do grid search etc. So if you learn what is a pipeline in scikit-learn, you might as well learn its syntactic sugar. Anyway, since most people seem to object, I'll stop fighting...

@larsmans
Copy link
Member

Agree with make_pipeline. It would also cause less confusion with the pipeline module.

Nobody is forced to used it if they don't like it.

Except when they have to read other people's code. By forcing people to write explicit, readable code, we make their colleagues' lives easier :)

Also, on a more fundamental level, this operator overloading has to be done (due to how Python works) on BaseEstimator. IMHO, that's a violation of the separation of concerns. I personally don't consider an estimator a one-element pipeline.

@mblondel mblondel closed this Dec 2, 2013
@larsmans
Copy link
Member

@2675 is a tentative implementation of make_pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants