Skip to content

[WIP] Add new feature StackingClassifier #7427

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

yl565
Copy link
Contributor

@yl565 yl565 commented Sep 14, 2016

PR to #4816. This is a continuation of #6674.

To-do:


This change is Reviewable

@yl565 yl565 changed the title [WIP] Add new feature StackingClassifier [MRG] Add new feature StackingClassifier Sep 23, 2016
@yl565
Copy link
Contributor Author

yl565 commented Sep 23, 2016

Stacking classifier implemented. All suggestions are welcomed. I'm also considering adding the ability to set estimators in set_params similar to #7288
@MechCoder You mentioned there are questions regarding API?

Copy link
Member

@amueller amueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice. Can you add an illustrative example?



class StackingClassifier(BaseEstimator, ClassifierMixin):
""" Stacking classifier for combining unfitted estimators
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what you mean by unfitted but I feel it is a bit awkward here and in the next sentence.


For integer/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all other cases? If y is a different format, I'd say

-------
self : object
"""
if isinstance(y, np.ndarray) and len(y.shape) > 1 and y.shape[1] > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe use _type_of_target? not sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain to me what do you mean by _type_of_target?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sklearn.utils.multiclass.type_of_target

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

raise NotImplementedError('Multilabel and multi-output'
' classification is not supported.')

if self.estimators is None or len(self.estimators) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if not self.estimators? Also, please add a got {} to the error.


if not is_classifier(self.meta_estimator):
raise AttributeError('Invalid `meta_estimator` attribute, '
'`meta_estimator` should be a classifier')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also print the class name?

The meta-estimator to combine the predictions of each individual
estimator

method : string, optional, default='predict_proba'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "default" should be "auto" which is predict_proba if it exists and decision_function otherwise.

raise ValueError('Underlying estimator `{0}` does not '
'support `{1}`.'.format(name, param))

self.le_ = LabelEncoder()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.le_ = LabelEncoder().fit(y) ?

self.meta_estimator_.fit(scores, transformed_y, **kwargs)
return self

def _form_meta_inputs(self, predicted):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this doesn't need to be a separate method? maybe just a loop in _est_predict or something?

@yl565
Copy link
Contributor Author

yl565 commented Oct 10, 2016

@amueller I have updated the codes to reflect your suggestions. Though I figured _form_meta_inputs is still needed for transforming the cross-validated scores in this line

@yl565
Copy link
Contributor Author

yl565 commented Oct 17, 2016

@amueller, do you mean show an example of its usage including the graphs in this issue's conversation?

@amueller
Copy link
Member

I mean an example to show how to use it and what it does. I haven't really put any thought into it ;)

@yl565
Copy link
Contributor Author

yl565 commented Nov 17, 2016

@jnothman Should I update this PR with _BaseComposition after #7674 merged? Or should I open a new PR after this PR merged?

@jnothman
Copy link
Member

Sure you can update it with _BaseComposition after #7674 is merged

On 17 November 2016 at 22:22, Yichuan Liu notifications@github.com wrote:

@jnothman https://github.com/jnothman Should I update this PR with
_BaseComposition after #7674
#7674 merged? Or
should I open a new PR after this PR merged?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7427 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6-Uc8WwqS5ogFN4oaWHUUFgNvpMVks5q_DjYgaJpZM4J9Uqg
.

@yl565 yl565 changed the title [MRG] Add new feature StackingClassifier [WIP] Add new feature StackingClassifier Nov 17, 2016
@yl565 yl565 mentioned this pull request Nov 20, 2016
@ivallesp
Copy link
Contributor

Hi,
As I wrote in #6674, I would like to colaborate on this given my experience in Kaggle.

I have been reviewing the work done by @yl565 and it is impressive. I really like how it is organized and I think it is a really well done work. However, I would like to add one functionality which may be key of the implementation: some problems, specially Kaggle ones, require to train thousands of classifiers; that's why I think that the current implementation is a bit monolithic. It would be nice being able to generate the train and test meta-predictors separately in order to be able to store them to disk and retrieve them in the future to either, create a next layer, or combine them using one meta-model.

So, what I mean is that, for example, if the parameter meta_estimator of the StackingClassifier class is null, the fit method would return an object with an attribute containing a matrix with all the training metapredictors, and the predict method would return a matrix with all the predictions of the models trained using the whole training set. In both cases it would be done in the same column order, assuring a correct link between datasets.

What do you think about it? Does it make sense to you? If so, I can help developing that funcionality.

Best,
Iván

@yl565
Copy link
Contributor Author

yl565 commented Nov 20, 2016

@ivallesp, I'm not sure I completely understand what you have in mind but it seems to me the predict method you are thinking about is transform in sklearn convention? I think we could add something like add_estimator(self, estimators, is_trained) and delete_estimator(self, estimator_names) to achieve re-using trained sub-estimators. I'm not sure how you could save a trained estimator on disc though...

@jnothman
Copy link
Member

Hi Iván,

Are you mostly talking about collapsing, for the sake of prediction, a set
of predictors, when they only involve matrix multiplication?

On 21 November 2016 at 03:44, Yichuan Liu notifications@github.com wrote:

@ivallesp https://github.com/ivallesp, I'm not sure I completely
understand what you have in mind but it seems to me the predict method
you are thinking about is transform in sklearn convention? I think we
could add something like add_estimator(self, estimators, is_trained) and delete_estimator(self,
estimator_names) to achieve re-using trained sub-estimators. I'm not sure
how you could save a trained estimator on disc though...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7427 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz60ZhOPkxYCtDVAy4hSPwKasUugSUks5rAHjYgaJpZM4J9Uqg
.

@ivallesp
Copy link
Contributor

ivallesp commented Nov 21, 2016

Sorry, it seems I was not very clear.

Stacked generalization is based in generating, using a list of models, training metapredictors by generating out-of-fold predictions using k-fold and then stacking them into vectors (meta-features; one per model) with the same length as the training set. That is what cross_val_predict method does. The second step consists on training the models using the whole training set to generate test metapredictors. On top of this, a meta model is trained in order to intelligently combine the meta-predictors in a more powerful prediction which theoretically, at least will be as bad as the best of your predictions.

What I mean is that it would be really interesting to be able to, once trained several models and generated the new training set and test set composed of meta-predictors, retrieve these matrices (or datasets) in order to be able to treat them in a different way. I mean, being able to stop just before applying the meta-model. That way, the user would be able to store these meta-predictors for the training and the test set in order to, for example, build a new stacked generalization (a new layer) on top of this. Another example would be appending the meta-predictors to the original training and test set and build a model that combines the meta-predictors and the original features.

In addition I would remark that sometimes it may be useful for, for instance, predict a transformed target variable; for example, in the case of a skewed target variable in a regression problem, you can build a stacker using 20 models with the original target variable, 20 more with the log of the target variable, and 20 more using the Box-Cox of it. That way the user would be able to combine these 60 meta-predictors and build a meta-model that combines them. For that, we would need to get the training-set associated metapredictors and the test-set associated metapredictors.

Am I now being clearer? If not, or not completely, please, do not hesitate to let me know and I would try to add more examples.

@jnothman
Copy link
Member

Right. I'm not able to give this enough attention to familiarise myself
better with the techniques you are suggesting, but I'm interested in
identifying an API that provides maximum flexibility while keeping it
simple. One option, as Yichuan suggests, is to have a way to do the stacked
classifier learning process, but then provide a transform that bypasses
the metaestimator so that you can put it in any pipeline context. I think
collapsing multiple estimators into one fast prediction, if you were ever
suggesting that, might be something for version 2.

On 21 November 2016 at 11:05, Iván Vallés notifications@github.com wrote:

Sorry, it seems I was not very clear.

Stacked generalization is based in generating, using a list of models,
training metapredictors by training and predicting out-of-fold predictions
using k-fold and then stacking them into vectors (meta-features; one per
model) with the same length as the training set. That is what
cross_val_predict method does. The second step consists on training the
models using the whole training set to generate test metapredictors. On top
of this, a meta model is trained in order to intelligently combine the
meta-predictors in a more powerful prediction which theoretically, at least
will be as bad as the best of your predictions.

What I mean is that it would be really interesting to be able to, once
trained several models and generated the new training set and test set
composed of meta-predictors, retrieve these matrices (or datasets) in order
to be able to treat them in a different way. I mean, being able to stop
just before applying the meta-model. That way, the user would be able to
store these meta-predictors for the training and the test set in order to,
for example, build a new stacked generalization (a new layer) on top of
this.

In addition I would remark that sometimes it may be useful for, for
instance, predict a transformed target variable; for example, in the case
of a skewed target variable in a regression problem, you can build a
stacker using 20 models with the original target variable, 20 more with the
log of the target variable, and 20 more using the Box-Cox of it. That way
the user would be able to combine these 60 meta-predictors and build a
meta-model that combines them. For that, we would need to get the
training-set associated metapredictors and the test-set associated
metapredictors.

Am I now being clearer? If not, or not completely, please, do not hesitate
to let me know and I would try to add more examples.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7427 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz62-eiaf9UqlrigV3kOvPXVGH-ErCks5rAOBjgaJpZM4J9Uqg
.

@yl565
Copy link
Contributor Author

yl565 commented Nov 21, 2016

I will add the transform method, could be useful for someone.

@ivallesp
Copy link
Contributor

thank you!

@yl565
Copy link
Contributor Author

yl565 commented Nov 21, 2016

I'm thinking about something like transform(self, X, is_apply_meta=True) so that when is_apply_meta=True, the transform of the meta-estimator will be called if exist. Otherwise (or is_apply_meta=False), the output will be a matrix the columns of which are output from the sub-estimators. @jnothman, what's your opinion?

@jnothman
Copy link
Member

I think it's fine to assume applying meta is false. Just describe the
transformation correctly. After all, that entire meta functionality can be
produced with a Pipeline.

On 21 November 2016 at 23:39, Yichuan Liu notifications@github.com wrote:

I'm thinking about something like transform(self, X, is_apply_meta=True)
so that when is_apply_meta=True, the transform of the meta-estimator will
be called if exist. Otherwise (or is_apply_meta=False), the output will
be a matrix the columns of which are output from the sub-estimators.
@jnothman https://github.com/jnothman, what's your opinion?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7427 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz63kjyCp63cVVyEV3zam4V6LkqosZks5rAZEMgaJpZM4J9Uqg
.

@jnothman
Copy link
Member

jnothman commented Jan 9, 2017

Add to your todo list: narrative documentation (in doc/) and an example in examples/ comparing voting classifier with a couple of stacking meta-classifiers (although the real boon here is that it can be used for regression too)

@jnothman
Copy link
Member

jnothman commented Jan 9, 2017

We probably want a StackingClassifier and a StackingRegressor (though I think we could in this case build them into one class...)

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've not really tested the cv != 1 case.

There should also be a test that the estimators are cloned (i.e. the original inputs are unaffected).

Otherwise, this is looking pretty good!

"""
if any(s in type_of_target(y) for s in ['multilabel', 'multioutput']):
raise NotImplementedError('Multilabel and multi-output'
' classification is not supported.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not?

self.le_ = LabelEncoder().fit(y)
self.classes_ = self.le_.classes_

transformed_y = self.le_.transform(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain why we need to do this.

self.classes_ = self.le_.classes_

transformed_y = self.le_.transform(y)
if self.cv == 1: # Do not cross-validation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't work if cv is an array, though that is unlikely.

delayed(_parallel_fit)(clone(clf),
X, transformed_y, kwargs)
for _, clf in self.estimators)
scores = self._est_predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel scores is the best name. Perhaps y_pred or y_score or predictions or Xt

self.meta_estimator_.fit(scores, transformed_y, **kwargs)
return self

def _form_meta_inputs(self, clf, predicted):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call this clean_scores or clean_predictions or whatever?



def test_sample_weight():
"""Tests sample_weight parameter of StackingClassifier"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nosetests, docstrings in test functions make test transcripts harder to read. make this a comment instead.



def test_classify_iris():
"""Check classification by majority label on dataset iris."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nosetests, docstrings in test functions make test transcripts harder to read. make this a comment instead.



def test_predict_on_toy_problem():
"""Manually check predicted class labels for toy dataset."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with nosetests, docstrings in test functions make test transcripts harder to read. make this a comment instead.


y = np.array([1, 1, 1, 2, 2, 2])

assert_equal(all(clf1.fit(X, y).predict(X)), all([1, 1, 1, 2, 2, 2]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean assert_array_equal? all will return a boolean, so you're asserting the equality of booleans here.

eclf1 = StackingClassifier(
estimators=[('lr', clf1), ('rf', clf2), ('nb', clf3)],
meta_estimator=clfm
).fit(X, y, sample_weight=np.ones((len(y),)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the cv=1 case, at least, it should be possible to test sample_weight as corresponding to a repetition of elements.

@caioaao
Copy link
Contributor

caioaao commented Mar 4, 2017

maybe I'm late here, but here's my two cents:
stacking is already "hard": what it must do is help avoiding data leakage during training. this implementation looks like it's doing a lot of stuff (I just skimmed through the code, but I saw it's even label encoding stuff - I'm not sure it's a good idea to do this much stuff here).
an idea: the first layer in the stacking can be seen as a transformer: it'll receive a feature set and outputs a new feature set. each classifier being stacked is independent from the other from the same layer, so it looks like a perfect situation to use pipeline's API. the real trick to turn a pipeline into a stacking classifier is the blending, and that's what's missing from sklearn. there's no need to implement a class to do what pipeline API does, just a class to blend a classifier and make it suitable for use as a transformer.
I've implemented those ideas here: https://gist.github.com/caioaao/28bf77e9a95ae6b70b14141feacb1f84
It doesn't have tests and it's probably lacking some asserts to make it more robust, so it's not useful for a PR in sklearn, but it may be useful for comparison purposes

@jnothman
Copy link
Member

jnothman commented Mar 4, 2017 via email

@caioaao
Copy link
Contributor

caioaao commented Mar 5, 2017

@jnothman didn't knew about cross_val_predict. that actually makes things simpler. updated the gist to use it. there are, of course, some improvements that can be made (like being able to pass other parameters to cross_val_predict) https://gist.github.com/caioaao/28bf77e9a95ae6b70b14141feacb1f84

about requiring the user to construct a FeatureUnion: I really like a functional approach better and I think the code is clearer when you compose stuff using functions instead of creating new classes, but it doesn't stop you from, instead of using the make_stack_layer, writing a class that just uses FeatureUnion under the hood, instead of doing an ad-hoc implementation that in the end provides basically the same functionality. I'm actually against this choice and think make_stacking_classifier(stacked_estimators, meta_classifier) would be cleaner, but that's a design choice that's not really aligned with the rest of sklearn's api - an example is LassoCV, RidgeCV, etc

@jnothman
Copy link
Member

jnothman commented Mar 5, 2017 via email

@caioaao
Copy link
Contributor

caioaao commented Mar 6, 2017

LassoCV was just an example of a class that just wraps a composition of two classes.
About convenienc over functional purity: maybe what I said was misinterpreted: what I meant with function composition is that make_stack_classifier would call make_feature_union and make_pipeline and return it. I don't see how clf = make_stack_classifier([RandomForest(), LinearSVC()], LogisticRegression()) would be less convenient than clf = StackClassifier([RandomForest(), LinearSVC()], LogisticRegression()), but then again, this is just a design preference and the former isn't so well aligned with the rest of sklearn's API :)

@yl565
Copy link
Contributor Author

yl565 commented Mar 8, 2017

@jnothman Since #7674 does not seems to be able to merge anytime soon, do you think its better to remove "update with _BaseComposition after #7674 is merged" from the to-do list so we can proceed with this PR

@jnothman
Copy link
Member

jnothman commented Mar 8, 2017 via email

@GaelVaroquaux
Copy link
Member

What's the status on this PR? I'll have some free time next week, and I was thinking of reviewing this PR.

@caioaao
Copy link
Contributor

caioaao commented May 30, 2017

as this looks stale, I'd really like to have a shot at implementing it as I said before. if I can do it before the weekend, would you guys mind taking it into consideration before merging/choosing this one?

@yl565
Copy link
Contributor Author

yl565 commented May 30, 2017

Have been busy with thesis. I'll try working on it this week.
I still need to incorporate #7674 into this.
There is also the issue of whether or not to support Multilabel and multi-output classification which would make this PR more complicated.

@jnothman
Copy link
Member

jnothman commented Jun 1, 2017 via email

@caioaao
Copy link
Contributor

caioaao commented Jun 1, 2017

I implemented my comments in #8960 . As I said there, it's ready to handle several types of estimators (not just classifiers) and the implementation is simpler.

@jnothman
Copy link
Member

jnothman commented Jun 1, 2017 via email

@AlJohri
Copy link

AlJohri commented Oct 3, 2017

hi @jnothman, I'm interesting in helping out with this PR. I've manually implemented a stacked classifier for my personal project several times at this point but haven't come up with a solution that lets me do GridSearchCV with the stacked classifier.

would anyone mind summarizing what's left to get this PR merged? is the to-do list at the top of the PR still accurate?

EDIT my mistake, I didn't see #8960 was more up to date

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Superseded PR has been replace by a newer PR Waiting for Reviewer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants