-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Stacking classifier with pipelines API #8960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
forgot to mention: huge thanks to @MLWave for the kaggle ensemble guide and the initial implementation |
Should fix broken tests
- Make it generic (works with classifiers and regressors); - Better naming;
Also improved testing by using a faster CV
e79178d
to
75ac0a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks quite elegant, though -- unaware of the literature in this space -- I feel like an estimator matrix rather than a single stacked layer is probably excessive. I certainly don't see why one should want the matrix to be rectangular. Specifying a single layer short-hand as a flat list should be possible.
You should add narrative documentation to doc/modules/ensemble and an example.
sklearn/ensemble/stacking.py
Outdated
class BlendedEstimator(BaseEstimator, MetaEstimatorMixin, TransformerMixin): | ||
"""Transformer to turn estimators into blended estimators | ||
|
||
This is used for stacking models. Blending an estimator prevents data leaks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should define what blending is. It's a simple concept, but shouldn't be presumed knowledge.
sklearn/ensemble/stacking.py
Outdated
self.n_jobs = n_jobs | ||
|
||
def fit(self, *args, **kwargs): | ||
self.base_estimator = self.base_estimator.fit(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should raise NotImplementedError
if this is not going to behave the same as fit_transform
.
sklearn/ensemble/stacking.py
Outdated
def _identity_transformer(): | ||
"""Contructs a transformer that does nothing""" | ||
return FunctionTransformer(lambda x: x) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: extra blank line
sklearn/ensemble/stacking.py
Outdated
|
||
Parameters | ||
---------- | ||
estimators_matrix: 2D matrix with base estimators. Each row will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spaces before colons are required
sklearn/ensemble/stacking.py
Outdated
|
||
Returns | ||
------- | ||
p: Pipeline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can just say Pipeline here.
sklearn/ensemble/stacking.py
Outdated
for estimator in base_estimators] | ||
if restacking: | ||
estimators.append(_identity_transformer()) | ||
return make_union(*estimators) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good idea. It means that the estimators will be named BlendedEstimator1
, BlendedEstimator2
, etc. in get_params
. You should use FeatureUnion
directly to give better prefixes; and you will need a way for the user to specify names themselves, like what FeatureUnion
provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the names are horrrible and should be fixed. about the second part, FeatureUnion
already exists if someone wants to specify the names. I don't see a reason to create a new api for that. the only arguable point here is that, if someone want's to create a stack layer by hand using FeatureUnion
, the restacking will have to be implemented by hand as well. I'll give this a thought, but I can't see a clean solution to this yet (other than creating a second function that accepts named estimators instead of unnamed ones).
@jnothman I'm not too familiarized with the literature either, but this shows how some big kaggle players used a stacked model with three layers. I also didn't understand what you meant by "rectangular matrix". I'll fix the rest of the comments. If you think having the stack factory is too much, I'm not opposed to deleting the fn EDIT: EDIT2: |
I have one doubt: both |
This is a more technically correct term
I'm strongly against modifying Pipeline to special-case blending!
|
@jnothman I'm sorry, I didn't mean modifying the original |
about adding a new verb: this would be (for the most common use cases) only
internally used. clients wouldn't have to know about it most of the time.
I am not very enthusiastic about it, because it means that code written
for scikit-learn will either need to know about it, or will behave in an
unintended way.
this way we can modify the Pipeline class to use them instead of the
output from fit_transform
Solutions that require modifying the pipeline seem wrong to me.
What is wrong with the solution of using a meta estimator?
|
|
@GaelVaroquaux IMO the discussion of choosing between this or #11047 boils down to this: scikit-learn's current API doesn't currently support a flexible stacking implementation, so we have these choices so far:
As you can see, I'm strongly against making the stacking implementation less flexible, thus leaving two choices, one of which you seem to be strongly against too. Also, I don't see how making the |
Just to make sure it's clear, there's nothing stopping a StackingClassifier of #11047 have multiple layers. That's just a matter of nesting: StackingClassifier(
[('x', XClassifier()), ('y', YClassifier()), ('z', ZClassifier())],
StackingClassifier(
[('x', XClassifier()), ('y', YClassifier()), ('z', ZClassifier())],
DecisionTreeClassifier()
)
) compared to: Pipeline([
('layer1', make_stack_layer(XClassifier(), YClassifier(), ZClassifier())),
('layer2', make_stack_layer(XClassifier(), YClassifier(), ZClassifier())),
('predict', DecisionTreeClassifier())
]) I think the former example is a bit less intuitive as analogous to layers (for people not coming from the Lisp world!). But it does not break any conventions we might want to hold on to. I definitely think that if we go down the StackingClassifier path, an example illustrating usage with multiple layers is necessary. @jorisvandenbossche, I think |
@jnothman this is the refactor I wanted to do on This way we can just override the properties that return functions used for caching and I'll be able to create a stacking-specific pipeline |
I've probably just forgotten something, but can you explain why this is
better than just nesting StackingClassifier like in my example?
…On Thu, 14 Jun 2018 at 10:23, Caio Oliveira ***@***.***> wrote:
@jnothman <https://github.com/jnothman> this is the refactor I wanted to
do on Pipeline: ***@***.***
<caioaao@92ec504>
It's a bit hacky but I thought it'd be the least intrusive way of making
pipeline behavior extendable.
This way we can just override the properties that return functions used
for caching and I'll be able to create a stacking-specific pipeline
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz60GvjFkq5v4FKA5PUSq8r6mJ4668ks5t8a0bgaJpZM4NrCxA>
.
|
@jnothman I think I wasn't clear about my approach. I started a PoC that you can check here: |
I just realized I moved a lot of code around and it may not be easy to see where I'm going with this implementation, so here it goes. The idea is that we should be able to access and manipulate the stack layers. There are some reasons to this:
What I'm aiming for in the patch is to have a simple API for the base case (the one implemented in #11047) and hiding the new The base case would be like this: from sklearn.ensemble import make_stacked_ensemble
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
layer0 = [LinearRegression(), LinearSVR(random_state=RANDOM_SEED)]
layer1 = [LinearRegression(), LinearSVR(random_state=RANDOM_SEED)]
final_estimator = LinearRegression()
# this will return a StackingPipeline
final_clf = make_stacked_ensemble(layer0, layer1, final_estimator)
final_clf.fit(Xtrain, ytrain)
ypreds = final_clf.predict(Xtest) And for accessing the internal layers: from sklearn.ensemble import StackingPipeline, make_stack_layer
from sklearn.linear_model import LinearRegression
from sklearn.svm import LinearSVR
layer0 = make_stack_layer(LinearRegression(), LinearSVR(random_state=RANDOM_SEED))
layer1 = make_stack_layer(LinearRegression(), LinearSVR(random_state=RANDOM_SEED))
final_estimator = LinearRegression()
final_clf = StackingPipeline([layer0, layer1, final_estimator])
final_clf.fit(Xtrain, ytrain)
ypreds = final_clf.predict(Xtest) Also, this implementation would solve the problem with from sklearn.ensemble import StackableTransformer, StackingLayer
from sklearn.linear_model import LinearRegression
stackable_t = StackableTransformer(LinearRegression())
# stackable_t.fit().transform() === stackable_t.fit_transform()
layer = StackingLayer([stackable_t])
# layer.fit().transform() === layer.fit_transform()
# the stackable transformer passed to StackingLayer must implement `blend`
stackable_t.blend() #=> CV prediction
layer.blend() #=> CV predictions This way we can just swap TLDR:
|
Since this PR is dragging for a long time and we have some conflicting opinions, I'm thinking about doing it in a separate library. I'd like to merge my pipeline refactor here so I can reuse the code from sklearn, is that ok with you? If not, I can just copy the pipeline code and modify it, but I don't know if it's ok based on sklearn's license. |
Actually I forgot to mention earlier that we are going to have a sprint at SciPy next week and I was awaiting to find an outcome to this PR either way and basically discuss the fit_transform issue which is also the case for some other estimator in the future. So this feature might not be the priority in 0.20 but will get through 0.21.
|
@glemaitre anyway I think it's best to create another library for this as it'll give me more freedom to do what I want. We can discuss it again and, if doable, integrate the library back into scikit learn in the future. |
I think both making a library of it and getting resolution at the sprint
are good ideas. Hopefully the separate library won't be essential for long,
or will be able to provide something extra for power users
|
great! then all I need atm is for this PR to be merged: #11446 |
I just released a beta version of the stacking framework based on this PR. If you guys want to check it, here's the link: http://github.com/caioaao/wolpert I'm closing this PR for now and will focus on evolving the library instead. thank you all for your time and comments :) |
Sigh. Thanks to you. I still think we need to work out how to resolve this.
It seems a silly thing to get stuck on.
…On 28 July 2018 at 03:29, Caio Oliveira ***@***.***> wrote:
Closed #8960 <#8960>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8960 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6_l4J-VFsViRv69Z5JbRUrPwA4vyks5uK03rgaJpZM4NrCxA>
.
|
Just one comment about I believe we can circumvent this problem the same way we did for In other words, Then, |
No, it's not the same situation at all. Standard stacking will pass on
blended predictions at training time, but prediction of a single model at
test time. It's not a different application / setting / modelling situation
as in LOF.
|
It might be best if we solve this with fit_resample. That is explicitly
designed for the case where training and prediction transformations differ.
(Although for resampling, the transform would usually be the identity
function, while here it isn't)
|
I agree with @jnothman here, but I feel using |
I'm not sure that's agreeing with me :)
We don't have a better verb than "resample" for the case where we want an
estimator to change X, y, etc. in a pipeline at fit time. If you can come
up with one, it's welcome. But the question of whether an estimator
supporting that functionality can also perform transformations at predict
time is open; here it would be necessary.
|
Yeah, I agreed with you about it not being the same situation as |
It might be best if we solve this with fit_resample.
Yes (though naming is challenging). I think that this is a right
situation to explore.
|
See discussion on #7427 for more info.
TODO:
predict_proba
;Improvements over #7427
Caio Oliveira
);Possible improvements
Improvements from review:
StackLayer
;Write examples for interesting use cases, including: efficient hyperparam optimization techniques; blending without CV; pre-training base estimators.Relevant discussions
Why to use
StackingTransformer
instead of implementing classesSummary: #8960 (comment)
Other discussions:
Is having more than two layers worth it?
Factory methods vs Class inheritance