Skip to content

[MRG] FEA: Stacking estimator for classification and regression #11047

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 127 commits into from
Sep 18, 2019

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented May 1, 2018

Reference Issues/PRs

closes #4816
closes #8960
closes #7427
closes #6674

requires #14305 to be merged.

What does this implement/fix? Explain your changes.

Implement 2 meta-estimators to perform stacking (classification and regression problems).

Any other comments?

@glemaitre
Copy link
Member Author

@jnothman @amueller @GaelVaroquaux It is what I had in mind.

@glemaitre
Copy link
Member Author

glemaitre commented May 1, 2018 via email

self.estimators = estimators
self.meta_estimator = meta_estimator
self.cv = cv
self.method_estimators = method_estimators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that method_estimators is not obvious name. output_confidence, metafeatures or something like that could be more targeted. In case of classification, if output_confidence is True the the level 1 estimator should be trained on the (cross-validated or not) outputs of predict_proba or decision_function of level 0 estimators else the level 1 estimator should be trained on prediction outputs. Another option could be a parameter like metafeatures with several options like:

  • predictions
  • confidences
  • cv_predictions
  • cv_confidences

Of course we could use both metafeatures and the original space to train the level 1 estimator.

As for the confidence outputs an option could be to aggregate the outputs using a descriptive statistic like mean or median. So the dimensionality of the metafeatures will be the same as the number of classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the use case in which we want to avoid the CV when fitting the final estimator.
method_estimators is the closest keywords that I found to be in accordance with the method keyword in the other estimator (in grid-search I think)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about predict_method?

@chkoar
Copy link
Contributor

chkoar commented May 6, 2018

It is very nice that you started this @glemaitre. If you need any help don't hesitate to ping me.

@glemaitre glemaitre changed the title [WIP] Stacking estimator for classification and regression [MRG] Stacking estimator for classification and regression May 6, 2018
@glemaitre
Copy link
Member Author

It is very nice that you started this @glemaitre. If you need any help don't hesitate to ping me.

Actually, it was some good PR which allow us to fix some issues. I would appreciate if you can make a review :)

@glemaitre
Copy link
Member Author

@amueller I drafted the documentation as well as an example. I hope that it will motivate you to give a go for the review ;)

NB: we should improve the example once the ColumnTransformer is merged. There is a huge case in which we applied different classifier on different column and use that as estimators. This is more the use case that I am used too actually.

@glemaitre
Copy link
Member Author

Artifacts for the documentation:

I see that there some glitches to be corrected.

Copy link
Member

@TomDLT TomDLT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I directly pushed minor improvements in the example.

Your implementation does not allow to pass X unchanged to the final estimator, does it?

@glemaitre
Copy link
Member Author

Your implementation does not allow to pass X unchanged to the final estimator, does it?

It does not allow it. What do you propose. I see 2 solutions:

  • Create a keyword "passthrough=True/False" and stack X accordingly
  • Create a strategy "input" for the DummyRegressor which will return the input and can be added to the estimators list.

The first one is less complicated to discover as a user I think. WDYT?

@TomDLT
Copy link
Member

TomDLT commented May 23, 2018

The first one is less complicated to discover as a user I think.

I agree.

@TomDLT
Copy link
Member

TomDLT commented May 23, 2018

Don't forget the .. versionadded:: 0.20 label in both new classes docstrings.

@rth
Copy link
Member

rth commented May 23, 2018

Your implementation does not allow to pass X unchanged to the final estimator, does it?

I'm not using stacking extensively, but IMHO this PR is complex enough; since passing X to the final estimator may not be the primary reason people would use stacking, maybe this feature can be proposed in a separate PR?

@thomasjpfan
Copy link
Member

Currently this PR is dropping when the stack_method is predict_proba, which applies to binary and multiclass.

@qinhanmin2014
Copy link
Member

Regarding dropping the column when stack_method='predict_proba' should we control which class to drop?

I think we can have such an option, but we should not drop any columns by default.

@glemaitre
Copy link
Member Author

I am not convinced that not dropping is a good default.

@glemaitre
Copy link
Member Author

I added the drop parameter.

@glemaitre
Copy link
Member Author

glemaitre commented Sep 18, 2019

I am still having doubt regading the dropping. We are discussing which estimator should benefit or not from it. I am still convinced that dropping would be best because apart of the tree which might be using this info (and I am not convinced about it as mentoned #11047 (comment)), it is really likely that user have linear and non-linear models in estimators and therefore dropping will be better than not dropping(this is really wrong)

@qinhanmin2014
Copy link
Member

I still prefer not to drop the first column by default, but since we've introduced drop parameter and the default final_estimator is linear model, I think that's acceptable.

@qinhanmin2014
Copy link
Member

Actually, apart from tree-based models, I'm not sure whether co-linear features are useful for regularized linear models (they're certainly not useful for unregularized linear models).
But I tend to believe that the influence is small, this is why I approve this PR.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really not sure about the importance of this dropping business. Without evidence to the contrary, I'd say YAGNI: dropping one of two columns in binary classification is obviously justified, for any final estimator, but otherwise it's unclear to me that we are providing the user with a useful parameter.

@glemaitre writes that "it is really likely that user have linear and non-linear models in estimators and therefore dropping will be better than not dropping". I don't get why having a particular type of estimator in estimators matters. Isn't the dropping all about what the final estimator does?

@glemaitre
Copy link
Member Author

Isn't the dropping all about what the final estimator does?

Completely true. Wrong reasoning on my side.

@jnothman
Copy link
Member

jnothman commented Sep 18, 2019 via email

@ogrisel
Copy link
Member

ogrisel commented Sep 18, 2019

+1 for always dropping for binary classifiers and never dropping for multiclass to keep things simple. The other cases are YAGNI in my opinion. No need for an option.

@jnothman
Copy link
Member

jnothman commented Sep 18, 2019 via email

@qinhanmin2014
Copy link
Member

+1 for always dropping for binary classifiers and never dropping for multiclass to keep things simple. The other cases are YAGNI in my opinion. No need for an option.

I think this is a better solution. So there's +3, maybe enough?

###############################################################################
# Stack of predictors on a single data set
###############################################################################
# It is sometimes tedious to find the model which will best perform on a given
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And somehow, I think the logic here is strange. I'm still not sure whether it's good to throw "bad" estimators to a stacker. It's true that a stacker can somehow filter "bad" estimators (e.g., by giving lower weight to them), but I think these "bad" estimators will still influence the performance of a stacker.
I think one should use cross validation to check the performance of all the estimators and throw some "good" estimators to a stacker to produce a "better" estimator.
But I'm not sure and I think this example is acceptable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Stacking is mostly useful to slightly improve the accuracy by combining the strength of good models (hopefully diverse enough to get not too correlated errors).

Throwing in bad models is likely to negatively impact the performance of the ensemble (and it's computationally wasteful).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I replaced AdaBoostRegressor and KNNRegressor by HistGradientBoostingRegressor and RandomForestRegressor to have only strong learner and changed the conclusion slightly. I will add a bit of timing as well to show that training a stack learner is computationally expensive

@glemaitre
Copy link
Member Author

Works for me

@jnothman
Copy link
Member

Let's do this... Thanks to @caioaao and @glemaitre both for great conception of these competing APIs and for persistence in finding a solution we all can accept!!

@jnothman jnothman merged commit bab5926 into scikit-learn:master Sep 18, 2019
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Oct 3, 2019 via email

@wderose
Copy link

wderose commented Nov 23, 2019

@caioaao @glemaitre : Thank you for this awesome feature. Should the stack_method="auto" priority match that of CalibratedClassifierCV?

decision_function -> proba -> predict

Instead of

proba -> decision_function -> predict

@jnothman
Copy link
Member

jnothman commented Nov 23, 2019 via email

@wderose
Copy link

wderose commented Nov 24, 2019

Raised the predict-function alignment issue in #15711.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

implementing stacking and other ensemble techniques