Skip to content

MRG: AdaBoost for regression and multi-class classification #522

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 4, 2013

Conversation

ndawe
Copy link
Member

@ndawe ndawe commented Jan 3, 2012

This PR adds:

  • a new ensemble.weight_boosting module with AdaBoostRegressor (using AdaBoost.R2 [1]) and AdaBoostClassifier (using the multi-class SAMME algorithm [2])
  • a new "Gaussian quantiles" dataset in datasets.samples_generator as used in [2]

Examples are provided:

hastie

twoclass

multiclass

regression

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.314
[2] http://www.stanford.edu/~hastie/Papers/samme.pdf

@satra
Copy link
Member

satra commented Jan 3, 2012

went through the diff. looks good.

sum(weight * clf.feature_importances_ for \
weight, clf in zip(self.boost_weights_, self.estimators_)) \
/ norm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return self here to be consistent with the doc string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I should fix the doc string. fit returns self as it should but fit_generator is a generator which yields self.

@jakevdp
Copy link
Member

jakevdp commented Jan 3, 2012

I like the examples and documentation. Nice work!

@pprett
Copy link
Member

pprett commented Jan 3, 2012

@ndawe Thanks for the contribution - sample weights are a crucial feature of the tree module.
I'm going to make a throughout review in the coming days - I just skimmed over the code and noticed one issue: when the sample mask is too sparse (below min_density) the data structures get updated (= fancy indexed via the sample_mask - see line 322 in tree.py). This should also apply to sample_weight because otherwise the weights and labels will become out-of-sync.
While we check the code I'd propose to add assert statements excessively to check that the invariants hold (e.g. sample_weight.shape[0] == y.shape[0]).

Again, thanks for working on this feature - I really appreciate that!

@glouppe
Copy link
Contributor

glouppe commented Jan 3, 2012

This is great indeed! I will definitely take time as well to review your code.

Is still a work in progress ("WIP") or do you consider your code ready to be merged ("MRG")? (In any case, would you rename your PR with a "WIP" or "MRG" prefix? Thanks!)

else:
y.append(3)

y = np.array(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this dataset a standard dataset used in the literature? If so I think it would be a good idea to move it the sklearn.datasets.samples_generator package and a link to the main reference paper in the docstring.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dataset in used in the paper introducing the SAMME [1] modification of AdaBoost but I don't know how standard it is in the literature. Being a simple dataset and easily accommodating any number of classes it might be a good one to have in sklearn.datasets.samples_generator

[1] www.stanford.edu/~hastie/Papers/samme.pdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 then

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2012

When I run the tests of the random forest on this branch I get the following crash in the joblib.Parallel implementation of the random forest with 2 backtraces:

nosetests -v sklearn/ensemble/tests/test_forest.py
Check classification on a toy dataset. ... ok
Check consistency on dataset iris. ... ok
Check consistency on dataset boston house prices. ... ok
Predict probabilities. ... Warning: divide by zero encountered in log
Warning: divide by zero encountered in log
ok
Check variable importances. ... ok
Check that base trees can be grid-searched. ... ok
Check parallel computations. ...
 *** glibc detected *** /usr/bin/python: malloc(): smallbin double linked list corrupted: 0x000000000418be30 ***
======= Backtrace: =========
[...] the first back trace has been redacted out as not scikit-learn specific
*** glibc detected *** /usr/bin/python: corrupted double-linked list: 0x0000000004108810 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76bb6)[0x7f7d3a387bb6]
/lib/x86_64-linux-gnu/libc.so.6(+0x7a931)[0x7f7d3a38b931]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x6e)[0x7f7d3a38d31e]
/usr/lib/pymodules/python2.7/numpy/core/multiarray.so(+0x2bc0a)[0x7f7d37819c0a]
/home/ogrisel/coding/scikit-learn/sklearn/tree/_tree.so(+0x7b27)[0x7f7d2f284b27]
[...]

So it seems that the modification to the _tree.pyx compiled extensions triggers a multiprocessing bug. Can anybody else reproduce this?

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2012

Test coverage of the boost module could probably be improved a bit:

nosetests --cover-package=sklearn.ensemble.boost --with-coverage sklearn/ensemble/tests/test_boost.py
..Warning: divide by zero encountered in log
...
Name                     Stmts   Miss  Cover   Missing
------------------------------------------------------
sklearn.ensemble.boost      81     12    85%   74, 81, 91, 150, 164, 179-180, 197-198, 247-249
----------------------------------------------------------------------
Ran 5 tests in 0.093s

OK

By reading the missing lines above, I think all of the them should be covered by the test suites (including testing for expected exceptions using assert_raises).

pl.plot(xrange(10), tree.feature_importances_[indices], "r")

pl.plot(xrange(10), importances[indices], "b")
pl.show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this example should be combined with the existing plot_forest_importances.py rather that introducing a new example. You can use the multi-plot support of the sphinx / examples integration: calling plot on several figures will generate a gallery of plots that are all inlined in the documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the boosting example here since it really isn't that impressive. Adaboost typically yields ~2 boosts on this dataset and the resulting plot isn't very interesting.

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2012

I think this PR should be marked as WIP as it lacks some narrative documentation giving an introduction to the concept of boosting and the Adaboost algorithm in particular with links to the reference papers and examples.

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2012

Also the "SAMME" modification for multiclass problems should be introduced in the narrative documentation as well and if SAMME is an acronym it should be expanded at least once to increase googlability (and for readability).

@ogrisel
Copy link
Member

ogrisel commented Jan 3, 2012

As others said, thanks you @ndawe for this very interesting contribution. As a side note I really appreciate the fact that this PR respects the scikit API, variable naming and coding style conventions and that it's pep8 and pyflakes clean :)

# TODO request that classifiers return classification
# of training sets when fitting
# which would make the following line unnecessary
p = estimator.predict(X)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could test something like:

if hasattr(estimator, 'fit_predict'):
    # optim for estimators that are able to save redundant computations
    # when calling fit + predict on the same input X
    p = estimator.fit_predict(X, y, sample_weight=sample_weight, **kwargs)
else:
    p = estimator.fit(X, y, sample_weight=sample_weight, **kwargs).predict(X)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added this check. Will also consider adding fit_predict to DecisionTreeClassifier.

@ndawe
Copy link
Member Author

ndawe commented Jan 3, 2012

@ogrisel I can reproduce the same crash. I rebased my branch on top of recent changes from @glouppe and regenerated/compiled _tree.c and still get the crash. Will investigate.

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2012

I confirm that the crash is not present in the current master so the issue is somehow related to the introduction of sample_weight in _tree.pyx. Valgrind or gdb are probably needed to understand what's happening.

@ndawe
Copy link
Member Author

ndawe commented Jan 4, 2012

Line 211 in test_forest.py uses a RandomForestClassifier (and again on line 225) instead of a RandomForestRegressor on the Boston dataset. np.unique then gives 229 classes... When changed to RandomForestRegressor, assert_array_equal(y1, y2) on line 222 fails due to difference on the order of e-15 but it doesn't crash.

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2012

@ndawe indeed running a classifier on a regression task is probably a bad idea :) and it sounds like the cause of the issue. If wonder if this crash should still be considered a bug or not. @glouppe, @pprett any opinion?

I will fix the forest test in master as it is a bug to test a classifier on a regression dataset anyway.

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2012

Done. You can either merge master to your branch or rebase.

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2012

BTW @ndawe don;t you think it would make sense to be able to pass multiple, unrelated base learner instances instead of a single instance (e.g. an SVM with gaussian kernels and a decision tree)?

@ndawe
Copy link
Member Author

ndawe commented Jan 4, 2012

@ogrisel you mean boosting an ensemble? Boosting a random forest should already be possible but we could introduce additional classes ClassifierEnsemble and RegressorEnsemble which could hold an arbitrary list of classifiers/regressors (no use of base_estimator or _make_estimator from BaseEnsemble). Prediction would be the (possibly weighted) prediction from each contained classifier/regressor and the entire ensemble could be boosted.

@ogrisel
Copy link
Member

ogrisel commented Jan 4, 2012

I don't think we need to introduce a new Ensemble class, I was just thinking to replace the base_estimator constructor param that expects a single instance by a base_estimators param that expects a list of classifiers and change the _make_estimator method to pick any of the them sequentially (or randomly) whenever it is called.

@mblondel
Copy link
Member

mblondel commented Jan 4, 2012

base_estimator could accept either a single estimator instance or a list of them (in which case n_estimators would be ignored).

@glouppe
Copy link
Contributor

glouppe commented Jan 4, 2012

@ogrisel This is not standard and, imho, I don't think we should overcomplicate the interface to allow that special case. At least for now.

@ndawe
Copy link
Member Author

ndawe commented Jan 4, 2012

@ogrisel I haven't seen boosting used in that way, but that's not to say it hasn't been done (at least I am not aware of it) or couldn't produce nice results. I agree with @glouppe that for now we should focus on getting the standard boosting solid for now. Regarding my previous comment, I do feel generic EnsembleClassifier and EnsembleRegressor classes would be convenient to have. Then you could group together arbitrary classifiers/regressors and boost them, etc. BoostedClassifier only requires that the base_estimator inherits from ClassifierMixin and is completely oblivious to whether it's boosting one classifier or a classifier which is a conglomerate of several. I would then suggest that BaseEnsemble should be stripped down to the bare essentials (basically the API of a list) and _make_estimator, base_estimator, and n_estimators should be moved down to a special EnsembleGenerator class used by random forests and boosting.

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2012

Alright if this is not the traditional way to use boosting then I am happy not having it implemented in the default AdaBoost class of scikit-learn. However I thought that boosting was often used by the winning teams of machine learning challenges such as the Netflix prize so as to combine unrelated classifiers (e.g. RBMs/DBNs + random forests + kernel SVMs + whatever) into a compound classifier best than any of them independently.

@ogrisel
Copy link
Member

ogrisel commented Jan 5, 2012

Also about the EnsembleClassifier / EnsembleRegressor / EnsembleGenerator design: it sounds a bit too nested to me esp. if it does not provide additional API than the flatter and current BaseEnsemble base class does. However the best way to discuss this would be by commenting actual working proof of concept impl. Let's postpone this discussion to another pull request and finish the review of this one first as I think it's soon mergeable (as soon as the ensemble chapter in the narrative doc is complemented to introduce boosting).

@pprett
Copy link
Member

pprett commented Jan 5, 2012

@ogrisel I haven't read the whole thread but I'm not aware of any boosting approach that uses a heterogeneous ensemble (= different classes of base learners). The winners of the Netflix price (BellKor’s Pragmatic Chaos) you are referring to used a stacking approach. They used the predictions of a large number of level0 models (RBM, Knn, SVD, ...) to create a new feature representation and then trained a level 1 model on this features to generate the final prediction

BTW: they used Gradient Tree Boosting ;-)

@pprett
Copy link
Member

pprett commented Jan 5, 2012

Disclaimer: This is not related to boosting.

@glouppe @bdholt1 @ndawe @amueller We recently discussed memory issues of RandomForest on the mailing list whcih might be due to sampling with replacement - I wonder if we could implement memory efficient sampling with replacement via sample weights? What do you think? Is it worth the effort given that we will need the copy as fallback if the base estimator does not support sample weights...

@glouppe
Copy link
Contributor

glouppe commented Feb 3, 2013

@GaelVaroquaux I think we addressed all of your comments. However, I still don't see the commits you talked about, at https://github.com/GaelVaroquaux/scikit-learn/commits/treeweights Am I missing something?

@GaelVaroquaux
Copy link
Member

@GaelVaroquaux I think we addressed all of your comments. However, I still
don't see the commits you talked about, at https://github.com/GaelVaroquaux/
scikit-learn/commits/treeweights Am I missing something?

Sorry, I had only one. I just committed and pushed all the local changes
that I had on the disk.

@glouppe
Copy link
Contributor

glouppe commented Feb 4, 2013

Okay then, @ndawe could you cherry-pick d8d434e and 2a6523a ? Also, I think it is time to edit the What's New file. I let you do it :)

It looks like we will finally be over with PR. I must say that I am more than happy with the result! Good job @ndawe :)

@GaelVaroquaux
Copy link
Member

It looks like we will finally be over with PR. I must say that I am more than
happy with the result!

+1 on that! It's a nice feature-set landing in the scikit.

@ndawe
Copy link
Member Author

ndawe commented Feb 4, 2013

Thanks @glouppe for your help and everyone else for the reviews! @GaelVaroquaux's commits have been cherry-picked. I will add AdaBoost to the What's New file.

@ndawe
Copy link
Member Author

ndawe commented Feb 4, 2013

I will submit a separate PR later comparing gradient boost and AdaBoost, if that's OK. I had a quick first look:

image

image

(BTW I think this must be the longest PR of all time... 😉 )

@pprett
Copy link
Member

pprett commented Feb 4, 2013

uhh... when I see the poor performance of GBRT I feel tempted to revert my +1 on this PR ;-)

PS: +1 for doing that in a separate PR - I probably need another year to tune the hell out of GBRT for this example...

@glouppe
Copy link
Contributor

glouppe commented Feb 4, 2013

@amueller or @GaelVaroquaux Feel free to do your rebase-fu :)

@amueller
Copy link
Member

amueller commented Feb 4, 2013

will do in half an hour. thanks a lot for the great work and the patience!

Gilles Louppe notifications@github.com schrieb:

@amueller or @GaelVaroquaux Feel free to do your rebase-fu :)


Reply to this email directly or view it on GitHub:
#522 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

@GaelVaroquaux
Copy link
Member

will do in half an hour. thanks a lot for the great work and the patience!

I am on it actually. I just started.

@GaelVaroquaux
Copy link
Member

I am on it actually. I just started.

OK, I am getting nasty conflicts in grid_search.py (which should never
have been touched by this PR :} ). I am in favor of giving up on the
rebase and merging. @amueller, WDYT?

@amueller
Copy link
Member

amueller commented Feb 4, 2013

Feel free. The rebase trouble is because this is the branch of the original PR and stuff got reverted, not rebased....
Go for it :)

@GaelVaroquaux
Copy link
Member

Feel free. The rebase trouble is because this is the branch of the original PR
and stuff got reverted, not rebased....

Yup! Messy history yields more messy history.

Go for it :)

Merging, running the tests and pushing.

@GaelVaroquaux GaelVaroquaux merged commit 1f95b18 into scikit-learn:master Feb 4, 2013
@GaelVaroquaux
Copy link
Member

Merging, running the tests and pushing.

Merged and pushed. 🍻

@glouppe
Copy link
Contributor

glouppe commented Feb 4, 2013

phew

Yay!! Thanks everyone!

(PS: I'll edit the what's new file myself.)

@amueller
Copy link
Member

amueller commented Feb 4, 2013

wohoo! Grats @glouppe and @ndawe

@arjoly
Copy link
Member

arjoly commented Feb 4, 2013

Congratz !!! :-)

@agramfort
Copy link
Member

awesome job guys !

@ndawe
Copy link
Member Author

ndawe commented Feb 4, 2013

Wow... Just went out for lunch and all this happened. Thanks for merging, and sorry for the messy history....

🍻

@pprett
Copy link
Member

pprett commented Feb 4, 2013

great - thanks a lot guys!

2013/2/4 Alexandre Gramfort notifications@github.com

awesome job guys !


Reply to this email directly or view it on GitHubhttps://github.com//pull/522#issuecomment-13074559.

Peter Prettenhofer

@ndawe
Copy link
Member Author

ndawe commented Feb 4, 2013

@pprett I just used default settings for gradient boosting (both boosting stumps though). So maybe it's not a fair comparison. Also, the difference could simply be a difference in the meaning of a given learning_rate between the algorithms.

@mblondel
Copy link
Member

Sorry to ask this after merge but what was the motivation for naming the module weight_boosting? Do you have plans to add other algorithms to it? If not, I think adaboost would be more explicit.

@ndawe
Copy link
Member Author

ndawe commented Feb 10, 2013

Yes, I would also like to implement other weight boosting algorithms such as BrownBoost, LogitBoost, GentleBoost, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.