[Proof of concept] Syntactic sugar for pipelines #2589

mblondel · 2013-11-14T09:36:41Z

Inspired by https://github.com/aht/stream.py by @aht

This PR adds a simple syntactic sugar for easily creating pipelines.

>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> pipeline = StandardScaler() | PCA(n_components=2) | SVC(random_state=0)
>>> pipeline.__class__
<class 'sklearn.pipeline.Pipeline'>
>>> y_pred = pipeline.fit(X, y).predict(X)
>>> np.mean(y == y_pred)
0.92000000000000004

I used the bitwise or operator because of the similarity between pipelines and UNIX pipes but the right shift operator could be nice too (easier to read and convey the idea of direction):

>>> pipeline = StandardScaler() >> PCA(n_components=2) >> SVC(random_state=0)

coveralls · 2013-11-14T09:44:24Z

Coverage remained the same when pulling 608d3ad on mblondel:pipeline_syntax into a8089b0 on scikit-learn:master.

agramfort · 2013-11-14T09:51:04Z

really fun :)

I am just wondering about names of the steps. What if you do:

StandardScaler() | PCA(n_components=2) | StandardScaler() | SVC(random_state=0)

are you going to have a name collision?

also now it's not super explicit how to use set_params as names are hidden.

and it's a bit against the zen of python

"There should be one-- and preferably only one --obvious way to do it."

pprett · 2013-11-14T09:56:20Z

this will not create one pipeline with multiple steps but rather a sequence of pipelines?

If it would create just a single pipeline I'd propose to do naming based on position rather then class name.

Otherwise, I'm a friend of syntactic sugar -- even if it contradicts the zen

mblondel · 2013-11-14T10:11:46Z

are you going to have a name collision?

Yes. I was thinking of adding an integer suffix (e.g. StandardScaler2 in your example).

this will not create one pipeline with multiple steps but rather a sequence of pipelines?

This creates a single pipeline.

Edit: StandardScaler() | PCA(n_components=2) | SVC(random_state=0) creates a single pipeline (in the end) but it creates two intermediary pipelines (which are gonna be garbage collected).

aht · 2013-11-14T10:15:13Z

Awesome!

agramfort · 2013-11-14T10:17:13Z

Otherwise, I'm a friend of syntactic sugar -- even if it contradicts the
zen

don't curse please :)

mblondel · 2013-11-14T11:16:25Z

The question is whether we want to make pipelines just an utility tool or a core element of the scikit-learn "language".

jnothman · 2013-11-14T23:21:53Z

This looks very neat for interactive experimentation: the list of tuples required for the pipeline constructor makes it exceedingly cumbersome. Yet I agree with @agramfort that it's not very pythonic, and that step naming is an issue. And given his response to other Pipeline syntactic sugar, I expect @GaelVaroquaux will strongly object.

GaelVaroquaux · 2013-11-14T23:25:57Z

Yet I agree with @agramfort that it's not very pythonic, and that step
naming is an issue. And given his response to other Pipeline syntactic
sugar, I expect @GaelVaroquaux will strongly object.

You absolutely got it. I was waiting a bit to step in, but I do
absolutely object. If every library in Python was playing these games,
our code would be completely unreadable.

That said, I do like the implicit naming. I think that a syntax in the
form:

pipe = Pipeline((StandardScaler(), SVM()))

would be useful, more explicit, and get us most of the way of the
proposed improvement. It would add roughly 10 characters to the suggested
syntax, but what are 10 characters in the face of readability :).

So am I actually enthousiastic about half of the proposal!

mblondel · 2013-11-15T05:38:44Z

I know that @GaelVaroquaux is our BDFL but out of curiosity I would still love to hear other developers' opinion :)

GaelVaroquaux · 2013-11-15T06:16:58Z

And that's why I waited to answer. But maybe I answered too quickly.

Note that I actually do think that this discussion is shaping an
improvement to the pipeline API (implicit naming).

glouppe · 2013-11-15T09:07:04Z

While I agree this is elegant and handy, I am not convinced this is the most readable thing to do. In particular, for those who are not aware that operators can actually be overriden, I think this would look totally magical. (I talk from experience, as a teaching assistant in various programming courses... Unfortunately, you would be surprised to see that even the most basic programming concepts are not always fully grasped, let alone redefining operators!)

That being said, I agree that implicit naming would be a good compromise to improve the pipeline syntax. In its current form, it is too cumbersome to write.

mblondel · 2013-11-15T13:12:03Z

While I agree this is elegant and handy, I am not convinced this is the most readable thing to do. In particular, for those who are not aware that operators can actually be overriden, I think this would look totally magical.

People use mobile phones every day without knowing anything about signal processing :) People just need to know that a | b | c or a >> b >> c is the pipeline of a, b and c (and if that's not obvious, that's what variables are for).

glouppe · 2013-11-15T13:16:44Z

They have to know what a Pipeline is for understanding the code. They don't with the explicit syntax.

larsmans · 2013-11-21T19:24:30Z

I don't really like this either. I'm currently using Celery, which uses the same operator with the same meaning, but I always use its explicit syntax. The problem with operator overloading is that if the operator is not an established one, then you have to look up what it means. But you can't type ?| or |? in IPython, and you can't google it either.

What I would personally like better is a freestanding utility function:

def pipeline(*components):
    names = map(_generate_name, components)
    return Pipeline(zip(names, components))

This allows

pipeline(CountVectorizer(), TruncatedSVD(), LinearSVC())

which is both short and explicit.

jnothman · 2013-11-21T21:46:28Z

I agree a freestanding utility might be better than trying to implement the
name generation in Pipeline, which would involve introducing steps_ (a
dynamic property, or produced by fit), and requires a change to
is_classifier which tries to use estimator.steps[-1][1]...

On Fri, Nov 22, 2013 at 6:24 AM, Lars Buitinck notifications@github.comwrote:

I don't really like this either. I'm currently using Celery, which uses
the same operator with the same meaning, but I always use its explicit
syntax. The problem with operator overloading is that if the operator is
not an established one, then you have to look up what it means. But you
can't type ?| or |? in IPython, and you can't google it either.

What I would personally like better is a freestanding utility function:

def pipeline(*components):
names = map(_generate_name, components)
return Pipeline(zip(names, components))

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2589#issuecomment-29015013
.

larsmans · 2013-11-21T22:07:03Z

Also it makes the call syntax simpler still by relying on *args (no list passing). Hacking that into the Pipeline constructor would make it hard to maintain.

GaelVaroquaux · 2013-11-21T23:20:29Z

But such a util means that you are teaching a new constructor to users that deviates from the stand scikit-learn of doing this, just to save 2 characters.

jnothman · 2013-11-21T23:36:42Z

Well, no, it externalises complexity from the Pipeline class (required
because of the scikit learn convention that constructor parameters are
saved as-is) for the sake of a simplified constructor. Other packages (e.g.
django) might put such things in a shortcuts namespace; it might make
sense for scikit-learn to provide such shortcuts to enable rapid
experimentation and concise code, and to separate functionality from
syntactic sugar.

On Fri, Nov 22, 2013 at 10:20 AM, Gael Varoquaux
notifications@github.comwrote:

But such a util means that you are teaching a new constructor to users
that deviates from the stand scikit-learn of doing this, just to save 2
characters.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/2589#issuecomment-29034691
.

larsmans · 2013-11-21T23:46:31Z

"to separate functionality from syntactic sugar"

Hear, hear.

We have convenience functions for clustering methods as well. There's nothing wrong with that, as long as the docs explain what's going on in terms of the full-featured API. I'd rather have that than overloading the syntax of the Pipeline constructor to take a "list of {estimator or (string, estimator) tuple}". That's making things more complicated, but for us and the few users who do care to read the docs.

jnothman · 2013-11-22T03:44:12Z

For an idea of what else might belong in shortcuts:

class transformer(TransformerMixin, BaseEstimator):
    def __init__(self, transform):
        self.transform = transform
    def __repr__(self):
        return '{0.__class__.__name__}({0.transform!r})'.format(self)

GaelVaroquaux · 2013-11-22T06:32:43Z

Well, no, it externalises complexity from the Pipeline class (required
because of the scikit learn convention that constructor parameters are
saved as-is) for the sake of a simplified constructor.

Point taken.

GaelVaroquaux · 2013-11-22T06:34:33Z

class transformer(TransformerMixin, BaseEstimator):

Agreed, although I think that I would like it with a capital 'T', as it's
a class. I know, I know, I am horribly conventional and boring :).

larsmans · 2013-11-22T09:42:10Z

The Python stdlib violates this convention itself: it has several classes posing as functions, e.g. itertools.chain. The intended use is more important than the implementation. We just have to trick Sphinx into treating such classes as functions.

mblondel · 2013-11-23T01:41:02Z

We have convenience functions for clustering methods as well

One difference is that these functions do actual computations and return fitted parameters. Your pipeline merely returns an object: it's a function but it actually acts like a constructor. So I think a better name would be make_pipeline.

I don't really like this either. I'm currently using Celery, which uses the same operator with the same meaning, but I always use its explicit syntax.

A syntactic sugar is a shorter syntax for an existing functionality. Nobody is forced to used it if they don't like it. Personally, if such syntactic sugar were available, I would use it in IPython shells or experiment scripts but probably not in library code. I also don't buy that the syntactic sugar is less readable than Pipeline. The barrier to entry to scikit-learn is low, but one still needs to know a few scikit-learn concepts such as pipelines, the way we do grid search etc. So if you learn what is a pipeline in scikit-learn, you might as well learn its syntactic sugar. Anyway, since most people seem to object, I'll stop fighting...

larsmans · 2013-11-23T13:47:18Z

Agree with make_pipeline. It would also cause less confusion with the pipeline module.

Nobody is forced to used it if they don't like it.

Except when they have to read other people's code. By forcing people to write explicit, readable code, we make their colleagues' lives easier :)

Also, on a more fundamental level, this operator overloading has to be done (due to how Python works) on BaseEstimator. IMHO, that's a violation of the separation of concerns. I personally don't consider an estimator a one-element pipeline.

larsmans · 2013-12-18T22:39:07Z

@2675 is a tentative implementation of make_pipeline.

Add syntactic sugar for pipelines.

608d3ad

mblondel closed this Dec 2, 2013

Uh oh!

[Proof of concept] Syntactic sugar for pipelines #2589

[Proof of concept] Syntactic sugar for pipelines #2589

Uh oh!

Conversation

mblondel commented Nov 14, 2013

Uh oh!

coveralls commented Nov 14, 2013

Uh oh!

agramfort commented Nov 14, 2013

Uh oh!

pprett commented Nov 14, 2013

Uh oh!

mblondel commented Nov 14, 2013

Uh oh!

aht commented Nov 14, 2013

Uh oh!

agramfort commented Nov 14, 2013

Uh oh!

mblondel commented Nov 14, 2013

Uh oh!

jnothman commented Nov 14, 2013

Uh oh!

GaelVaroquaux commented Nov 14, 2013

Uh oh!

mblondel commented Nov 15, 2013

Uh oh!

GaelVaroquaux commented Nov 15, 2013

Uh oh!

glouppe commented Nov 15, 2013

Uh oh!

mblondel commented Nov 15, 2013

Uh oh!

glouppe commented Nov 15, 2013

Uh oh!

larsmans commented Nov 21, 2013

Uh oh!

jnothman commented Nov 21, 2013

Uh oh!

larsmans commented Nov 21, 2013

Uh oh!

GaelVaroquaux commented Nov 21, 2013

Uh oh!

jnothman commented Nov 21, 2013

Uh oh!

larsmans commented Nov 21, 2013

Uh oh!

jnothman commented Nov 22, 2013

Uh oh!

GaelVaroquaux commented Nov 22, 2013

Uh oh!

GaelVaroquaux commented Nov 22, 2013

Uh oh!

larsmans commented Nov 22, 2013

Uh oh!

mblondel commented Nov 23, 2013

Uh oh!

larsmans commented Nov 23, 2013

Uh oh!

larsmans commented Dec 18, 2013

Uh oh!

Uh oh!