[MRG+1] Stacking classifier with pipelines API #8960

caioaao · 2017-05-31T01:22:37Z

See discussion on #7427 for more info.

TODO:

Add support for methods other than predict_proba;
Improve tests

Improvements over #7427

Works with several kinds of estimators (one/multi label classification - input and output -, regressors, etc);
Less entropy, as it reuses the pipeline API;
Just tested it in a kaggle competition (look for Caio Oliveira);
Support for stacking multiple layers;
Blending each estimator separately makes it easier to pretrain models, cache/share blending results, etc (see one awesome application for this here);
More flexibility on using transformed data.

Possible improvements

Restacking option;

Improvements from review:

Rewrite the docs (maybe tests to?) to stop mixing classification and regression problems;
Document "simple usage";
Document "advanced usage";
Write tests for StackLayer;
~~Write examples for interesting use cases, including: efficient hyperparam optimization techniques; blending without CV; pre-training base estimators~~.

Relevant discussions

Why to use `StackingTransformer` instead of implementing classes

Summary: #8960 (comment)

Other discussions:

Is having more than two layers worth it?

[MRG+1] Stacking classifier with pipelines API #8960 (comment);

Factory methods vs Class inheritance

[MRG+1] Stacking classifier with pipelines API #8960 (comment)

caioaao · 2017-05-31T02:29:35Z

forgot to mention: huge thanks to @MLWave for the kaggle ensemble guide and the initial implementation

Should fix broken tests

- Make it generic (works with classifiers and regressors); - Better naming;

Also improved testing by using a faster CV

jnothman

This looks quite elegant, though -- unaware of the literature in this space -- I feel like an estimator matrix rather than a single stacked layer is probably excessive. I certainly don't see why one should want the matrix to be rectangular. Specifying a single layer short-hand as a flat list should be possible.

You should add narrative documentation to doc/modules/ensemble and an example.

jnothman · 2017-06-02T03:48:17Z

sklearn/ensemble/stacking.py

+class BlendedEstimator(BaseEstimator, MetaEstimatorMixin, TransformerMixin):
+    """Transformer to turn estimators into blended estimators
+
+    This is used for stacking models. Blending an estimator prevents data leaks


I think you should define what blending is. It's a simple concept, but shouldn't be presumed knowledge.

jnothman · 2017-06-02T03:48:35Z

sklearn/ensemble/stacking.py

+        self.n_jobs = n_jobs
+
+    def fit(self, *args, **kwargs):
+        self.base_estimator = self.base_estimator.fit(*args, **kwargs)


I think we should raise NotImplementedError if this is not going to behave the same as fit_transform.

jnothman · 2017-06-02T03:48:49Z

sklearn/ensemble/stacking.py

+def _identity_transformer():
+    """Contructs a transformer that does nothing"""
+    return FunctionTransformer(lambda x: x)
+


PEP8: extra blank line

jnothman · 2017-06-02T03:49:41Z

sklearn/ensemble/stacking.py

+
+    Parameters
+    ----------
+    estimators_matrix: 2D matrix with base estimators. Each row will be


spaces before colons are required

jnothman · 2017-06-02T03:49:48Z

sklearn/ensemble/stacking.py

+
+    Returns
+    -------
+    p: Pipeline


can just say Pipeline here.

jnothman · 2017-06-02T03:54:50Z

sklearn/ensemble/stacking.py

+                  for estimator in base_estimators]
+    if restacking:
+        estimators.append(_identity_transformer())
+    return make_union(*estimators)


This is not a good idea. It means that the estimators will be named BlendedEstimator1, BlendedEstimator2, etc. in get_params. You should use FeatureUnion directly to give better prefixes; and you will need a way for the user to specify names themselves, like what FeatureUnion provides.

I agree the names are horrrible and should be fixed. about the second part, FeatureUnion already exists if someone wants to specify the names. I don't see a reason to create a new api for that. the only arguable point here is that, if someone want's to create a stack layer by hand using FeatureUnion, the restacking will have to be implemented by hand as well. I'll give this a thought, but I can't see a clean solution to this yet (other than creating a second function that accepts named estimators instead of unnamed ones).

caioaao · 2017-06-02T13:54:19Z

@jnothman I'm not too familiarized with the literature either, but this shows how some big kaggle players used a stacked model with three layers. I also didn't understand what you meant by "rectangular matrix".

I'll fix the rest of the comments. If you think having the stack factory is too much, I'm not opposed to deleting the fn

EDIT:
I think I understood the "rectangular matrix" issue. It was actually a bad name choice that I made, which was enforced by the stack_estimators example code. The estimators_matrix is actually a 2D array, not a matrix. Hopefully it's clearer now

EDIT2:
I started working in the documentation, but I'd rather wait for us to agree on the final API before going any further.

caioaao · 2017-06-02T15:13:44Z

I have one doubt: both cross_val_predict and FeatureUnion accept an n_jobs param. which should I choose to pipe the param to?

This is a more technically correct term

jnothman · 2018-06-10T23:02:02Z

I'm strongly against modifying Pipeline to special-case blending!

caioaao · 2018-06-11T03:29:08Z

@jnothman I'm sorry, I didn't mean modifying the original Pipeline. What I meant was to extract the common code from Pipeline to a new _BasePipeline class and inherit from it to build both Pipeline (with the same behavior) and StackingPipeline. I tried doing it today but I can't make the tests pass on my machine (even on the master branch). I'll open a PR as soon as I fix my env

GaelVaroquaux · 2018-06-11T05:07:00Z

about adding a new verb: this would be (for the most common use cases) only internally used. clients wouldn't have to know about it most of the time.

I am not very enthusiastic about it, because it means that code written for scikit-learn will either need to know about it, or will behave in an unintended way.

this way we can modify the Pipeline class to use them instead of the output from fit_transform

Solutions that require modifying the pipeline seem wrong to me. What is wrong with the solution of using a meta estimator?

MLWave · 2018-06-11T13:06:46Z

As a data point, I do not think that the nested stacking concerns a lot of users. I am not sure that we should focus our API designs on it.

Nested stacking is a pretty fundamental part of the original paper, Stacked Generalization (1992), and perhaps as relevant as providing bootstrap functionality for RandomForestClassifier:

For example, we can consider the case where each level has a single learning set, and all such learning sets feed serially into the set one level above, all according to the exact same rules. (Such a structure is a multi-layer net, where each node is a learning set, there exists one node per layer, and information is fed from one node to the next via the N generalizers.) For such a scenario the successive levels act upon the learning set like successive iterations of an iterated map. Therefore the usual non-linear analysis questions apply: when does one get periodic behavior? when does one get chaotic behavior? what are the dimensions of the attractors? etc. Once answered, such questions would presumably help determine how many levels to stack such a system.

All stacking research (and autoML research) eventually encounters multiple layers. Instead of writing your own code for this, a multi-layer Super Learner (van der Laan et al, 2007) would capture all of that. You can create general stacked estimators, much like multi-layer stacked auto-encoders. You can create a multi-layer net of MLP's. You can do Super Boosting, by stacking GBM's 6 layers deep. You can optimize an entire multi-layer net of estimators, to find the best combinations of accuracy, and deployability.
Stacking to obtain SoTa (like in Kaggle competitions) has been multi-layer for years now. Though they all make use of Scikit-Learn, the multi-layer code is all manual glue code (and poorly battle-tested).
Multi-layer stacking has commercial applications. Having code in Scikit-Learn, all with decent quality and tests, makes commercial application more feasible.
Finally, the amount of flexibility of combining the recent column transformer with a stacking transformer, means you can build complete recipes for tackling a problem (next to pipelining the transformations, it pipelines the modeling parts too). This, to me, is very attractive, and would see heavy use.

caioaao · 2018-06-11T13:14:08Z

@GaelVaroquaux IMO the discussion of choosing between this or #11047 boils down to this: scikit-learn's current API doesn't currently support a flexible stacking implementation, so we have these choices so far:

Break sklearn's contracts (fit().transform() != fit_transform());
Extend sklearn's API (increasing its vocabulary by adding something like blend() and making a custom pipeline for this type of model)
Restrict the stacking implementation to a simple use case that can be handled by sklearn's current API

As you can see, I'm strongly against making the stacking implementation less flexible, thus leaving two choices, one of which you seem to be strongly against too.

Also, I don't see how making the PIpeline/FeatureUnion code more reusable would be a bad thing. Even #11047 would benefit from that as it would remove most of the code there. I may be a little pushy in this subject, but my background/professional experience is in software engineering and code quality is really important to me. Having a class that does a lot of stuff (combining blending and pipeline stuff) is not good in any case, specially if you already have another class that does exactly what you want for half of the problem.

jnothman · 2018-06-11T23:51:27Z

Just to make sure it's clear, there's nothing stopping a StackingClassifier of #11047 have multiple layers. That's just a matter of nesting:

StackingClassifier(
    [('x', XClassifier()), ('y', YClassifier()), ('z', ZClassifier())],
    StackingClassifier(
        [('x', XClassifier()), ('y', YClassifier()), ('z', ZClassifier())],
        DecisionTreeClassifier()
    )
)

compared to:

Pipeline([
    ('layer1', make_stack_layer(XClassifier(), YClassifier(), ZClassifier())),
    ('layer2', make_stack_layer(XClassifier(), YClassifier(), ZClassifier())),
    ('predict', DecisionTreeClassifier())
])

I think the former example is a bit less intuitive as analogous to layers (for people not coming from the Lisp world!). But it does not break any conventions we might want to hold on to. I definitely think that if we go down the StackingClassifier path, an example illustrating usage with multiple layers is necessary.

@jorisvandenbossche, I think get_params/set_params names get lengthly in any case. A tool like searchgrid, which makes search params an attribute of the object, stops that being an issue, though (as long as it handles non-grid search cases, which it doesn't atm).

caioaao · 2018-06-14T00:23:52Z

@jnothman this is the refactor I wanted to do on Pipeline: caioaao@92ec504
It's a bit hacky but I thought it'd be the least intrusive way of making pipeline behavior extendable.

This way we can just override the properties that return functions used for caching and I'll be able to create a stacking-specific pipeline

jnothman · 2018-06-14T00:56:39Z

I've probably just forgotten something, but can you explain why this is better than just nesting StackingClassifier like in my example?

…

On Thu, 14 Jun 2018 at 10:23, Caio Oliveira ***@***.***> wrote: @jnothman <https://github.com/jnothman> this is the refactor I wanted to do on Pipeline: ***@***.*** <caioaao@92ec504> It's a bit hacky but I thought it'd be the least intrusive way of making pipeline behavior extendable. This way we can just override the properties that return functions used for caching and I'll be able to create a stacking-specific pipeline — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8960 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz60GvjFkq5v4FKA5PUSq8r6mJ4668ks5t8a0bgaJpZM4NrCxA> .

caioaao · 2018-06-25T01:58:36Z

@jnothman I think I wasn't clear about my approach. I started a PoC that you can check here:
caioaao/scikit-learn@0b58da2...5b6116b
Basically what I'm doing is to extend Pipeline to use another method for blending. This way this API is closer to the Pipeline API and also reuses lots of code from it. It's a less hacky way of doing stacks with multiple layers

caioaao · 2018-06-25T16:00:25Z

I just realized I moved a lot of code around and it may not be easy to see where I'm going with this implementation, so here it goes.

The idea is that we should be able to access and manipulate the stack layers. There are some reasons to this:

As discussed, we want to be able to create multi-layer models;
There are some approaches to blending outputs from each layer and one of them is using k-fold predictions (implemented here with CV). The choice of which type of blending is usually related to trade-offs between training performance, data leakage and data usage on subsequent layers. Marios Michailidis (@kaz-Anova) does a great job on explaining the different strategies in this lecture;
Ease of distributed capabilities. Training/tuning a stacked ensemble is an CPU-intensive process and being able to distribute the load is pretty good (even better when you think about the dynamics of teams in competitive data science competitions).

What I'm aiming for in the patch is to have a simple API for the base case (the one implemented in #11047) and hiding the new blend verb from the average client, while also allowing for experienced users to access the flexibility of the framework.

The base case would be like this:

from sklearn.ensemble import make_stacked_ensemble
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR


layer0 = [LinearRegression(), LinearSVR(random_state=RANDOM_SEED)]
layer1 = [LinearRegression(), LinearSVR(random_state=RANDOM_SEED)]
final_estimator = LinearRegression()

# this will return a StackingPipeline
final_clf = make_stacked_ensemble(layer0, layer1, final_estimator)

final_clf.fit(Xtrain, ytrain)
ypreds = final_clf.predict(Xtest)

And for accessing the internal layers:

from sklearn.ensemble import StackingPipeline, make_stack_layer
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR


layer0 = make_stack_layer(LinearRegression(), LinearSVR(random_state=RANDOM_SEED))
layer1 = make_stack_layer(LinearRegression(), LinearSVR(random_state=RANDOM_SEED))
final_estimator = LinearRegression()

final_clf = StackingPipeline([layer0, layer1, final_estimator])

final_clf.fit(Xtrain, ytrain)
ypreds = final_clf.predict(Xtest)

Also, this implementation would solve the problem with fit().transform() being different from fit_transform():

from sklearn.ensemble import StackableTransformer, StackingLayer
from sklearn.linear_model import LinearRegression


stackable_t = StackableTransformer(LinearRegression())
# stackable_t.fit().transform() === stackable_t.fit_transform()

layer = StackingLayer([stackable_t])
# layer.fit().transform() === layer.fit_transform()
# the stackable transformer passed to StackingLayer must implement `blend`

stackable_t.blend() #=> CV prediction
layer.blend() #=> CV predictions

This way we can just swap StackableTransformer (which I'll rename to CVStackableTransformer) with another transformer that implements blend to use another method of ensembling (train-test split, for instance).

TLDR:

Has a sane and useful simple use case with a simple API;
Exposes an API that supports most of the more advanced use cases;
Relies on a good chunk of the Pipeline/FeatureUnion code, making the code easier to maintain.

caioaao · 2018-07-05T20:03:58Z

Since this PR is dragging for a long time and we have some conflicting opinions, I'm thinking about doing it in a separate library. I'd like to merge my pipeline refactor here so I can reuse the code from sklearn, is that ok with you? If not, I can just copy the pipeline code and modify it, but I don't know if it's ok based on sklearn's license.

glemaitre · 2018-07-05T20:10:53Z

Actually I forgot to mention earlier that we are going to have a sprint at SciPy next week and I was awaiting to find an outcome to this PR either way and basically discuss the fit_transform issue which is also the case for some other estimator in the future. So this feature might not be the priority in 0.20 but will get through 0.21.

caioaao · 2018-07-05T21:28:53Z

@glemaitre anyway I think it's best to create another library for this as it'll give me more freedom to do what I want. We can discuss it again and, if doable, integrate the library back into scikit learn in the future.

jnothman · 2018-07-06T02:41:50Z

I think both making a library of it and getting resolution at the sprint are good ideas. Hopefully the separate library won't be essential for long, or will be able to provide something extra for power users

caioaao · 2018-07-10T21:30:21Z

great! then all I need atm is for this PR to be merged: #11446
it's quite simple and it'll help a lot on reusing pipeline API

caioaao · 2018-07-27T17:29:12Z

I just released a beta version of the stacking framework based on this PR. If you guys want to check it, here's the link: http://github.com/caioaao/wolpert

I'm closing this PR for now and will focus on evolving the library instead. thank you all for your time and comments :)

jnothman · 2018-07-29T03:56:13Z

Sigh. Thanks to you. I still think we need to work out how to resolve this. It seems a silly thing to get stuck on.

…

On 28 July 2018 at 03:29, Caio Oliveira ***@***.***> wrote: Closed #8960 <#8960>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8960 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6_l4J-VFsViRv69Z5JbRUrPwA4vyks5uK03rgaJpZM4NrCxA> .

ngoix · 2018-10-09T10:44:45Z

Just one comment about fit_transform != fit.transform:

I believe we can circumvent this problem the same way we did for LocalOutlierFactor, with fit.predict and fit_predict (see #10700). Two different behaviours are supported depending on how the class is initialised (the transductive behaviour disables predict while the non-transductive one disables fit_predict).

In other words, StackableTransformer could override TransformerMixin.fit_transform the same way LocalOutlierFactor override OutlierMixin.fit_predict.

Then, make_stack_layer could just use StackingTransformer initialised in its transductive behaviour.

jnothman · 2018-10-09T11:37:08Z

No, it's not the same situation at all. Standard stacking will pass on blended predictions at training time, but prediction of a single model at test time. It's not a different application / setting / modelling situation as in LOF.

jnothman · 2018-10-09T11:39:29Z

It might be best if we solve this with fit_resample. That is explicitly designed for the case where training and prediction transformations differ. (Although for resampling, the transform would usually be the identity function, while here it isn't)

caioaao · 2018-10-10T20:15:58Z

I agree with @jnothman here, but I feel using fit_resample would be hacky. The training predictions have a different semantics here and makes sense for it to be a different verb

jnothman · 2018-10-10T20:44:34Z

I'm not sure that's agreeing with me :) We don't have a better verb than "resample" for the case where we want an estimator to change X, y, etc. in a pipeline at fit time. If you can come up with one, it's welcome. But the question of whether an estimator supporting that functionality can also perform transformations at predict time is open; here it would be necessary.

caioaao · 2018-10-10T23:11:46Z

Yeah, I agreed with you about it not being the same situation as LocalOutlierFactor. My argument about fit_resample is that it's doing more than just resampling the data as it's also transforming it and potentially not even resampling it, so it'd be misleading to use resample for this as well

GaelVaroquaux · 2018-10-11T12:11:59Z

It might be best if we solve this with fit_resample.

Yes (though naming is challenging). I think that this is a right situation to explore.

caioaao added 6 commits May 30, 2017 21:39

Add stacking classifier

ea43c0c

Some tests + bugfix

7d1be9a

More tests

a992e6c

Add support for other methods

0c5b6cf

Better testing

645ad78

fix doctest

8352e45

caioaao added 6 commits May 30, 2017 23:55

Add MetaEstimatorMixin

dfec50c

Should fix broken tests

Add class to metaestimators list

03ee4f1

Improvements on stacking

1f8ef3c

- Make it generic (works with classifiers and regressors); - Better naming;

Add support for fit_params to stacking classifier

9c717c3

Also improved testing by using a faster CV

Fix stack transformer output ndim

6383956

PEP8

71e30ee

caioaao changed the title ~~Stacking classifier with pipelines API~~ [MRG] Stacking classifier with pipelines API Jun 1, 2017

caioaao mentioned this pull request Jun 1, 2017

[WIP] Add new feature StackingClassifier #7427

Closed

4 tasks

caioaao added 3 commits June 1, 2017 08:59

Add n_jobs param to blended transformer

83ac5b4

More useful param piping

e433e91

Restacking support

75ac0a0

caioaao force-pushed the stacking-classifier branch from e79178d to 75ac0a0 Compare June 1, 2017 16:20

jnothman reviewed Jun 2, 2017

View reviewed changes

caioaao added 8 commits June 2, 2017 12:17

Explaining blending + pep8 fixes

79e4475

Explain stack_estimators better

fa61c1b

Better names + layer transformer_weights support

f3e8963

Define fit as not implemented

13ec675

Docs guidelines

044e462

Rename blended classifier

1a63d06

This is a more technically correct term

Initial docs

30be5a6

wrong ident

621f80d

glemaitre modified the milestones: 0.20, 0.21 Jun 13, 2018

caioaao mentioned this pull request Jul 5, 2018

[MRG] Refactor pipeline namespace to make it more reusable #11446

Closed

caioaao closed this Jul 27, 2018

qinhanmin2014 mentioned this pull request Aug 7, 2018

Combining Trained Classifiers to predict #11768

Closed

jnothman mentioned this pull request Feb 7, 2019

SLEP needed: fit_transform does something other than fit(X).transform(X) scikit-learn/enhancement_proposals#12

Open

Uh oh!

[MRG+1] Stacking classifier with pipelines API #8960

[MRG+1] Stacking classifier with pipelines API #8960

Uh oh!

Conversation

caioaao commented May 31, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improvements over #7427

Possible improvements

Improvements from review:

Relevant discussions

Why to use StackingTransformer instead of implementing classes

Is having more than two layers worth it?

Factory methods vs Class inheritance

Uh oh!

caioaao commented May 31, 2017

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

jnothman Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

caioaao Jun 2, 2017

Choose a reason for hiding this comment

Uh oh!

caioaao commented Jun 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caioaao commented Jun 2, 2017

Uh oh!

jnothman commented Jun 10, 2018 via email

Uh oh!

caioaao commented Jun 11, 2018

Uh oh!

GaelVaroquaux commented Jun 11, 2018 via email

Uh oh!

MLWave commented Jun 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caioaao commented Jun 11, 2018

Uh oh!

jnothman commented Jun 11, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caioaao commented Jun 14, 2018

Uh oh!

jnothman commented Jun 14, 2018 via email

Uh oh!

caioaao commented Jun 25, 2018

Uh oh!

caioaao commented Jun 25, 2018

Uh oh!

caioaao commented Jul 5, 2018

Uh oh!

glemaitre commented Jul 5, 2018 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

caioaao commented Jul 5, 2018

Uh oh!

jnothman commented Jul 6, 2018 via email

Uh oh!

caioaao commented Jul 10, 2018

Uh oh!

caioaao commented Jul 27, 2018

caioaao commented May 31, 2017 •

edited

Loading

Why to use `StackingTransformer` instead of implementing classes

caioaao commented Jun 2, 2017 •

edited

Loading

MLWave commented Jun 11, 2018 •

edited

Loading

jnothman commented Jun 11, 2018 •

edited

Loading

glemaitre commented Jul 5, 2018 via email •

edited

Loading

ngoix commented Oct 9, 2018 •

edited

Loading