-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
MRG: AdaBoost for regression and multi-class classification #522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
went through the diff. looks good. |
sum(weight * clf.feature_importances_ for \ | ||
weight, clf in zip(self.boost_weights_, self.estimators_)) \ | ||
/ norm | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return self
here to be consistent with the doc string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I should fix the doc string. fit
returns self
as it should but fit_generator
is a generator which yields self
.
I like the examples and documentation. Nice work! |
@ndawe Thanks for the contribution - sample weights are a crucial feature of the tree module. Again, thanks for working on this feature - I really appreciate that! |
This is great indeed! I will definitely take time as well to review your code. Is still a work in progress ("WIP") or do you consider your code ready to be merged ("MRG")? (In any case, would you rename your PR with a "WIP" or "MRG" prefix? Thanks!) |
else: | ||
y.append(3) | ||
|
||
y = np.array(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this dataset a standard dataset used in the literature? If so I think it would be a good idea to move it the sklearn.datasets.samples_generator
package and a link to the main reference paper in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dataset in used in the paper introducing the SAMME [1] modification of AdaBoost but I don't know how standard it is in the literature. Being a simple dataset and easily accommodating any number of classes it might be a good one to have in sklearn.datasets.samples_generator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 then
When I run the tests of the random forest on this branch I get the following crash in the joblib.Parallel implementation of the random forest with 2 backtraces:
So it seems that the modification to the |
Test coverage of the boost module could probably be improved a bit:
By reading the missing lines above, I think all of the them should be covered by the test suites (including testing for expected exceptions using |
pl.plot(xrange(10), tree.feature_importances_[indices], "r") | ||
|
||
pl.plot(xrange(10), importances[indices], "b") | ||
pl.show() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this example should be combined with the existing plot_forest_importances.py
rather that introducing a new example. You can use the multi-plot support of the sphinx / examples integration: calling plot on several figures will generate a gallery of plots that are all inlined in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the boosting example here since it really isn't that impressive. Adaboost typically yields ~2 boosts on this dataset and the resulting plot isn't very interesting.
I think this PR should be marked as WIP as it lacks some narrative documentation giving an introduction to the concept of boosting and the Adaboost algorithm in particular with links to the reference papers and examples. |
Also the "SAMME" modification for multiclass problems should be introduced in the narrative documentation as well and if SAMME is an acronym it should be expanded at least once to increase googlability (and for readability). |
As others said, thanks you @ndawe for this very interesting contribution. As a side note I really appreciate the fact that this PR respects the scikit API, variable naming and coding style conventions and that it's pep8 and pyflakes clean :) |
# TODO request that classifiers return classification | ||
# of training sets when fitting | ||
# which would make the following line unnecessary | ||
p = estimator.predict(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could test something like:
if hasattr(estimator, 'fit_predict'):
# optim for estimators that are able to save redundant computations
# when calling fit + predict on the same input X
p = estimator.fit_predict(X, y, sample_weight=sample_weight, **kwargs)
else:
p = estimator.fit(X, y, sample_weight=sample_weight, **kwargs).predict(X)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've added this check. Will also consider adding fit_predict
to DecisionTreeClassifier
.
I confirm that the crash is not present in the current master so the issue is somehow related to the introduction of |
Line 211 in |
@ndawe indeed running a classifier on a regression task is probably a bad idea :) and it sounds like the cause of the issue. If wonder if this crash should still be considered a bug or not. @glouppe, @pprett any opinion? I will fix the forest test in master as it is a bug to test a classifier on a regression dataset anyway. |
Done. You can either merge master to your branch or rebase. |
BTW @ndawe don;t you think it would make sense to be able to pass multiple, unrelated base learner instances instead of a single instance (e.g. an SVM with gaussian kernels and a decision tree)? |
@ogrisel you mean boosting an ensemble? Boosting a random forest should already be possible but we could introduce additional classes |
I don't think we need to introduce a new Ensemble class, I was just thinking to replace the |
|
@ogrisel This is not standard and, imho, I don't think we should overcomplicate the interface to allow that special case. At least for now. |
@ogrisel I haven't seen boosting used in that way, but that's not to say it hasn't been done (at least I am not aware of it) or couldn't produce nice results. I agree with @glouppe that for now we should focus on getting the standard boosting solid for now. Regarding my previous comment, I do feel generic |
Alright if this is not the traditional way to use boosting then I am happy not having it implemented in the default AdaBoost class of scikit-learn. However I thought that boosting was often used by the winning teams of machine learning challenges such as the Netflix prize so as to combine unrelated classifiers (e.g. RBMs/DBNs + random forests + kernel SVMs + whatever) into a compound classifier best than any of them independently. |
Also about the |
@ogrisel I haven't read the whole thread but I'm not aware of any boosting approach that uses a heterogeneous ensemble (= different classes of base learners). The winners of the Netflix price (BellKor’s Pragmatic Chaos) you are referring to used a stacking approach. They used the predictions of a large number of level0 models (RBM, Knn, SVD, ...) to create a new feature representation and then trained a level 1 model on this features to generate the final prediction BTW: they used Gradient Tree Boosting ;-) |
Disclaimer: This is not related to boosting. @glouppe @bdholt1 @ndawe @amueller We recently discussed memory issues of |
FIX: some of Gael comments
@GaelVaroquaux I think we addressed all of your comments. However, I still don't see the commits you talked about, at https://github.com/GaelVaroquaux/scikit-learn/commits/treeweights Am I missing something? |
Sorry, I had only one. I just committed and pushed all the local changes |
+1 on that! It's a nice feature-set landing in the scikit. |
Thanks @glouppe for your help and everyone else for the reviews! @GaelVaroquaux's commits have been cherry-picked. I will add AdaBoost to the What's New file. |
uhh... when I see the poor performance of GBRT I feel tempted to revert my +1 on this PR ;-) PS: +1 for doing that in a separate PR - I probably need another year to tune the hell out of GBRT for this example... |
@amueller or @GaelVaroquaux Feel free to do your rebase-fu :) |
will do in half an hour. thanks a lot for the great work and the patience! Gilles Louppe notifications@github.com schrieb:
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet. |
I am on it actually. I just started. |
OK, I am getting nasty conflicts in grid_search.py (which should never |
Feel free. The rebase trouble is because this is the branch of the original PR and stuff got reverted, not rebased.... |
Yup! Messy history yields more messy history.
Merging, running the tests and pushing. |
Merged and pushed. 🍻 |
phew Yay!! Thanks everyone! (PS: I'll edit the what's new file myself.) |
Congratz !!! :-) |
awesome job guys ! |
Wow... Just went out for lunch and all this happened. Thanks for merging, and sorry for the messy history.... 🍻 |
great - thanks a lot guys! 2013/2/4 Alexandre Gramfort notifications@github.com
Peter Prettenhofer |
@pprett I just used default settings for gradient boosting (both boosting stumps though). So maybe it's not a fair comparison. Also, the difference could simply be a difference in the meaning of a given |
Sorry to ask this after merge but what was the motivation for naming the module |
Yes, I would also like to implement other weight boosting algorithms such as BrownBoost, LogitBoost, GentleBoost, etc. |
This PR adds:
Examples are provided:
[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.314
[2] http://www.stanford.edu/~hastie/Papers/samme.pdf