-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG + 2] EnsembleClassifier implementation #4161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think EnsembleClassifier is probably a bit too general. Perhaps |
I agree, Joel. I like MajorityRuleClassifier, it pretty much summarizes the concept in one phrase. |
except that often you will be relying on the highest vote, not a strict On 26 January 2015 at 16:57, Sebastian Raschka notifications@github.com
|
Not that I've read the exact algorithm. On 26 January 2015 at 19:39, Joel Nothman joel.nothman@sydney.edu.au
|
""" | ||
if self.weights: | ||
avg = self.predict_proba(X) | ||
maj = np.apply_along_axis(lambda x: max(enumerate(x), key=operator.itemgetter(1))[0], axis=1, arr=avg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you refactor this to two lines maybe? Or perhaps replace it with the below...
maj = np.apply_along_axis(lambda x: max(enumerate(x),
key=operator.itemgetter(1))[0],
axis=1, arr=avg)
( PEP8 specifications say that max line length should be 79... )
The algorithm you have implemented is actually called "soft voting" rather than "majority voting". In the former, the prediction is the argmax of the sums of predict probas, in the latter the prediction is the argmax of the sums of predictions. By the way, a more general algorithm is "stacking", which basically learns a new estimator from the predictions of others. Shouldn't we try to go in that direction instead, with majority voting/soft voting as special cases? |
+1 |
It seemed to me that stacking was less used in practice, but Owen (from data bricks, Kaggle #1) said he uses it quite a bit. |
Hm, sounds like that implemented both at once then. If weights are provided, it is a "soft voting" (based on probas) and without the weights "majority voting" (based on class labels) Do you think it would make sense to split the current implementation into a SoftVotingClassifier and a MajorityRuleClassifier for clarity? About "stacking": I agree with @aemuller that this could be implemented additionally later. E.g., by using the attributes from the MajorityRuleClassifier as input for a new classifier. On the other hand, would this need an implementation? Maybe an example would be sufficient that shows how the returned predictions/probas from the MajorityRule/SoftVoting classifier can be used as input for a new classifier... |
@rasbt in stacking (see ESL) the probabilities of the first stage of classifiers are obtained using out of sample estimates using cross-validation, so it is a bit more complex. |
I wouldn't split this up as two classes, but I would add a different string options for using majority or soft voting. Maybe call it |
With you current interface it would also be possible to just add a |
@amueller Good idea! I added the About the |
self.voting = voting | ||
|
||
if weights and len(weights) != len(clfs): | ||
raise ValueError('Number of classifiers and weights must be equal') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also give the numbers here.
I don't see where the weights are used in the prediction currently. If you let the user specify both the weights and the voting, then you need to make sure that the user doesn't shoot herself in the foot easily. So if I don't think you need to add an additional keyword Something that might be tricky is that |
Parameters | ||
---------- | ||
clfs : array-like, shape = [n_classifiers] | ||
A list of scikit-learn classifier objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment about redundancy.
You should add the class in |
@amueller Thanks for the comments, I think/hope I've addressed all of your points except for the "predict_proba should be consistent" part.
The weights are being used now for the "hard" voting in the most recent update. I didn't implement it earlier because I really wouldn't want to use something like this but prefer the weighted avg. probabilities instead. But you are right about the "shooting in the foot" ;)
Here, the challenge was that not all classifiers have the |
Some tests and doctests are missing imports, see travis. |
Have you checked what happens in the multi-label case? If you don't want to implement that for the moment, please make sure an appropriate error message is raised. |
|
||
Returns | ||
------- | ||
If not `weights=None`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not correct any more, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I don't think this is valid numpy doc format ;)
Lastly, I think the you should flatten the array to be (n_samples, *)
and not have ndim=3, so that can directly be used in a pipeline.
your implementation currently assumes |
|
||
""" | ||
def __init__(self, clfs, voting='hard', weights=None): | ||
self.clfs = clfs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use estimators
instead of clfs
for consistency with the rest of the scikit.
I would rather use the same technique as for SVC I actually don't agree with @larsmans 's inline comment: introducing two estimators just for that sounds kind of silly. |
I still think that's the way to keep the codebase readable and maintainable. The I do prefer it to a |
Okay, so we ignore the part
for now? instead of
|
seems legit..
|
yeah I don't think we need a probability parameter, and I agree with using the |
Have to admit that I don't really see through this "property" trick. How would the |
the code inside the @Property should raise an A property triggers a function evaluation, but looks like an attribute access. So trying to access the |
Thanks. I hope the most recent commit implements what you were looking for. |
Before this pull request sets on some dust again, I was wondering if everything seems to be fine now!? |
+1. Thanks for the ping. |
avg : array-like, shape = [n_samples, n_classes] | ||
Weighted average probability for each class per sample. | ||
""" | ||
self._check_proba() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could inline this (the function is only called once) but... well.. however you like..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, yes, I also initially thought that this could be a little bit of an overkill ... but maybe it's slightly better this way in case there will be a more general check_proba function in some distant future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cough but really I don't have a strong opinion ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, okay, that's strong enough :)
I would have some cosmetic changes to do, but nothing significant. +1 for merge before it stalls. |
[MRG + 2] EnsembleClassifier implementation
thanks @rasbt ! |
@glouppe what changes? Do you want to open an issue or open a PR? |
Just saw that it finally got merged! :) Thank you all for the great help during the implementation. Since I am a big fan and frequent user of scikit-learn, it's really nice contribute, and I really learned a few great lessons here! |
Thanks for your contribution! @amueller I did make some changes already in a following commit. Only
|
@rasbt thanks for the contribution and your patience! |
It looks to me like the intro to ensembles (top of http://scikit-learn.org/dev/modules/ensemble.html) should be updated |
@jnothman I agree. Although the VotingClassifier approach is different from bagging, I'd still put the former into the "averaging methods" category. |
This is an implementation of an estimator to combine scikit-learn classification estimators into an ensemble classifier for majority rule class label prediction or classification based on weighted probability averages.
I added a section at the bottom of the ensemble.rst documentation with a more detailed description and examples.
I'd be happy to make improvements or adjustments if necessary. Please let me know, I am happy to help!