-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] FEA: Stacking estimator for classification and regression #11047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jnothman @amueller @GaelVaroquaux It is what I had in mind. |
Example and doc coming very soon :D
|
sklearn/ensemble/_stacking.py
Outdated
self.estimators = estimators | ||
self.meta_estimator = meta_estimator | ||
self.cv = cv | ||
self.method_estimators = method_estimators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that method_estimators
is not obvious name. output_confidence
, metafeatures
or something like that could be more targeted. In case of classification, if output_confidence
is True
the the level 1 estimator should be trained on the (cross-validated or not) outputs of predict_proba
or decision_function
of level 0 estimators else the level 1 estimator should be trained on prediction outputs. Another option could be a parameter like metafeatures
with several options like:
- predictions
- confidences
- cv_predictions
- cv_confidences
Of course we could use both metafeatures and the original space to train the level 1 estimator.
As for the confidence outputs an option could be to aggregate the outputs using a descriptive statistic like mean
or median
. So the dimensionality of the metafeatures will be the same as the number of classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What the use case in which we want to avoid the CV when fitting the final estimator
.
method_estimators
is the closest keywords that I found to be in accordance with the method
keyword in the other estimator (in grid-search I think)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about predict_method
?
It is very nice that you started this @glemaitre. If you need any help don't hesitate to ping me. |
Actually, it was some good PR which allow us to fix some issues. I would appreciate if you can make a review :) |
@amueller I drafted the documentation as well as an example. I hope that it will motivate you to give a go for the review ;) NB: we should improve the example once the |
Artifacts for the documentation:
I see that there some glitches to be corrected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I directly pushed minor improvements in the example.
Your implementation does not allow to pass X
unchanged to the final estimator, does it?
It does not allow it. What do you propose. I see 2 solutions:
The first one is less complicated to discover as a user I think. WDYT? |
I agree. |
Don't forget the |
I'm not using stacking extensively, but IMHO this PR is complex enough; since passing |
Currently this PR is dropping when the stack_method is |
I think we can have such an option, but we should not drop any columns by default. |
I am not convinced that not dropping is a good default. |
I added the |
I am still having doubt regading the dropping. We are discussing which estimator should benefit or not from it. I am still convinced that dropping would be best because apart of the tree which might be using this info (and I am not convinced about it as mentoned #11047 (comment)), |
I still prefer not to drop the first column by default, but since we've introduced drop parameter and the default final_estimator is linear model, I think that's acceptable. |
Actually, apart from tree-based models, I'm not sure whether co-linear features are useful for regularized linear models (they're certainly not useful for unregularized linear models). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really not sure about the importance of this dropping business. Without evidence to the contrary, I'd say YAGNI: dropping one of two columns in binary classification is obviously justified, for any final estimator, but otherwise it's unclear to me that we are providing the user with a useful parameter.
@glemaitre writes that "it is really likely that user have linear and non-linear models in estimators and therefore dropping will be better than not dropping". I don't get why having a particular type of estimator in estimators
matters. Isn't the dropping all about what the final estimator does?
Completely true. Wrong reasoning on my side. |
Should we resolve the drop decision at the monthly meeting on Monday?
|
+1 for always dropping for binary classifiers and never dropping for multiclass to keep things simple. The other cases are YAGNI in my opinion. No need for an option. |
That's how I'm feeling too. Happy to see an option added later with
empirical support
|
I think this is a better solution. So there's +3, maybe enough? |
############################################################################### | ||
# Stack of predictors on a single data set | ||
############################################################################### | ||
# It is sometimes tedious to find the model which will best perform on a given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And somehow, I think the logic here is strange. I'm still not sure whether it's good to throw "bad" estimators to a stacker. It's true that a stacker can somehow filter "bad" estimators (e.g., by giving lower weight to them), but I think these "bad" estimators will still influence the performance of a stacker.
I think one should use cross validation to check the performance of all the estimators and throw some "good" estimators to a stacker to produce a "better" estimator.
But I'm not sure and I think this example is acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Stacking is mostly useful to slightly improve the accuracy by combining the strength of good models (hopefully diverse enough to get not too correlated errors).
Throwing in bad models is likely to negatively impact the performance of the ensemble (and it's computationally wasteful).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I replaced AdaBoostRegressor
and KNNRegressor
by HistGradientBoostingRegressor
and RandomForestRegressor
to have only strong learner and changed the conclusion slightly. I will add a bit of timing as well to show that training a stack learner is computationally expensive
Works for me |
Let's do this... Thanks to @caioaao and @glemaitre both for great conception of these competing APIs and for persistence in finding a solution we all can accept!! |
Woooot! I hadn't seen that the merge had happened! This is great!
Congratulations and thank you to all involved.
|
@caioaao @glemaitre : Thank you for this awesome feature. Should the
Instead of
|
That's a good question. Open a new issue so that we can consider it before
release?
|
Raised the predict-function alignment issue in #15711. |
Reference Issues/PRs
closes #4816
closes #8960
closes #7427
closes #6674
requires #14305 to be merged.
What does this implement/fix? Explain your changes.
Implement 2 meta-estimators to perform stacking (classification and regression problems).
Any other comments?