MRG: AdaBoost for regression and multi-class classification #522

ndawe · 2012-01-03T01:12:26Z

This PR adds:

a new ensemble.weight_boosting module with AdaBoostRegressor (using AdaBoost.R2 [1]) and AdaBoostClassifier (using the multi-class SAMME algorithm [2])
a new "Gaussian quantiles" dataset in datasets.samples_generator as used in [2]

Examples are provided:

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.314
[2] http://www.stanford.edu/~hastie/Papers/samme.pdf

satra · 2012-01-03T01:28:16Z

went through the diff. looks good.

jakevdp · 2012-01-03T03:12:34Z

sklearn/ensemble/boost.py

+                sum(weight * clf.feature_importances_ for \
+                  weight, clf in zip(self.boost_weights_, self.estimators_)) \
+                / norm
+


Should this return self here to be consistent with the doc string?

Actually I should fix the doc string. fit returns self as it should but fit_generator is a generator which yields self.

jakevdp · 2012-01-03T03:14:24Z

I like the examples and documentation. Nice work!

pprett · 2012-01-03T08:07:57Z

@ndawe Thanks for the contribution - sample weights are a crucial feature of the tree module.
I'm going to make a throughout review in the coming days - I just skimmed over the code and noticed one issue: when the sample mask is too sparse (below min_density) the data structures get updated (= fancy indexed via the sample_mask - see line 322 in tree.py). This should also apply to sample_weight because otherwise the weights and labels will become out-of-sync.
While we check the code I'd propose to add assert statements excessively to check that the invariants hold (e.g. sample_weight.shape[0] == y.shape[0]).

Again, thanks for working on this feature - I really appreciate that!

glouppe · 2012-01-03T08:33:55Z

This is great indeed! I will definitely take time as well to review your code.

Is still a work in progress ("WIP") or do you consider your code ready to be merged ("MRG")? (In any case, would you rename your PR with a "WIP" or "MRG" prefix? Thanks!)

ogrisel · 2012-01-03T08:55:41Z

examples/ensemble/plot_boost_error.py

+    else:
+        y.append(3)
+
+y = np.array(y)


Is this dataset a standard dataset used in the literature? If so I think it would be a good idea to move it the sklearn.datasets.samples_generator package and a link to the main reference paper in the docstring.

This dataset in used in the paper introducing the SAMME [1] modification of AdaBoost but I don't know how standard it is in the literature. Being a simple dataset and easily accommodating any number of classes it might be a good one to have in sklearn.datasets.samples_generator

[1] www.stanford.edu/~hastie/Papers/samme.pdf

ogrisel · 2012-01-03T09:08:01Z

When I run the tests of the random forest on this branch I get the following crash in the joblib.Parallel implementation of the random forest with 2 backtraces:

nosetests -v sklearn/ensemble/tests/test_forest.py
Check classification on a toy dataset. ... ok
Check consistency on dataset iris. ... ok
Check consistency on dataset boston house prices. ... ok
Predict probabilities. ... Warning: divide by zero encountered in log
Warning: divide by zero encountered in log
ok
Check variable importances. ... ok
Check that base trees can be grid-searched. ... ok
Check parallel computations. ...
 *** glibc detected *** /usr/bin/python: malloc(): smallbin double linked list corrupted: 0x000000000418be30 ***
======= Backtrace: =========
[...] the first back trace has been redacted out as not scikit-learn specific
*** glibc detected *** /usr/bin/python: corrupted double-linked list: 0x0000000004108810 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x76bb6)[0x7f7d3a387bb6]
/lib/x86_64-linux-gnu/libc.so.6(+0x7a931)[0x7f7d3a38b931]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x6e)[0x7f7d3a38d31e]
/usr/lib/pymodules/python2.7/numpy/core/multiarray.so(+0x2bc0a)[0x7f7d37819c0a]
/home/ogrisel/coding/scikit-learn/sklearn/tree/_tree.so(+0x7b27)[0x7f7d2f284b27]
[...]

So it seems that the modification to the _tree.pyx compiled extensions triggers a multiprocessing bug. Can anybody else reproduce this?

ogrisel · 2012-01-03T09:12:04Z

Test coverage of the boost module could probably be improved a bit:

nosetests --cover-package=sklearn.ensemble.boost --with-coverage sklearn/ensemble/tests/test_boost.py
..Warning: divide by zero encountered in log
...
Name                     Stmts   Miss  Cover   Missing
------------------------------------------------------
sklearn.ensemble.boost      81     12    85%   74, 81, 91, 150, 164, 179-180, 197-198, 247-249
----------------------------------------------------------------------
Ran 5 tests in 0.093s

OK

By reading the missing lines above, I think all of the them should be covered by the test suites (including testing for expected exceptions using assert_raises).

ogrisel · 2012-01-03T09:15:18Z

examples/ensemble/plot_boost_importances.py

+    pl.plot(xrange(10), tree.feature_importances_[indices], "r")
+
+pl.plot(xrange(10), importances[indices], "b")
+pl.show()


I think that this example should be combined with the existing plot_forest_importances.py rather that introducing a new example. You can use the multi-plot support of the sphinx / examples integration: calling plot on several figures will generate a gallery of plots that are all inlined in the documentation.

I've removed the boosting example here since it really isn't that impressive. Adaboost typically yields ~2 boosts on this dataset and the resulting plot isn't very interesting.

ogrisel · 2012-01-03T09:18:32Z

I think this PR should be marked as WIP as it lacks some narrative documentation giving an introduction to the concept of boosting and the Adaboost algorithm in particular with links to the reference papers and examples.

ogrisel · 2012-01-03T09:20:22Z

Also the "SAMME" modification for multiclass problems should be introduced in the narrative documentation as well and if SAMME is an acronym it should be expanded at least once to increase googlability (and for readability).

ogrisel · 2012-01-03T09:24:33Z

As others said, thanks you @ndawe for this very interesting contribution. As a side note I really appreciate the fact that this PR respects the scikit API, variable naming and coding style conventions and that it's pep8 and pyflakes clean :)

ogrisel · 2012-01-03T09:30:21Z

sklearn/ensemble/boost.py

+            # TODO request that classifiers return classification
+            # of training sets when fitting
+            # which would make the following line unnecessary
+            p = estimator.predict(X)


You could test something like:

if hasattr(estimator, 'fit_predict'): # optim for estimators that are able to save redundant computations # when calling fit + predict on the same input X p = estimator.fit_predict(X, y, sample_weight=sample_weight, **kwargs) else: p = estimator.fit(X, y, sample_weight=sample_weight, **kwargs).predict(X)

Thanks, I've added this check. Will also consider adding fit_predict to DecisionTreeClassifier.

ndawe · 2012-01-03T22:51:27Z

@ogrisel I can reproduce the same crash. I rebased my branch on top of recent changes from @glouppe and regenerated/compiled _tree.c and still get the crash. Will investigate.

ogrisel · 2012-01-04T00:37:50Z

I confirm that the crash is not present in the current master so the issue is somehow related to the introduction of sample_weight in _tree.pyx. Valgrind or gdb are probably needed to understand what's happening.

ndawe · 2012-01-04T01:59:19Z

Line 211 in test_forest.py uses a RandomForestClassifier (and again on line 225) instead of a RandomForestRegressor on the Boston dataset. np.unique then gives 229 classes... When changed to RandomForestRegressor, assert_array_equal(y1, y2) on line 222 fails due to difference on the order of e-15 but it doesn't crash.

ogrisel · 2012-01-04T02:05:56Z

@ndawe indeed running a classifier on a regression task is probably a bad idea :) and it sounds like the cause of the issue. If wonder if this crash should still be considered a bug or not. @glouppe, @pprett any opinion?

I will fix the forest test in master as it is a bug to test a classifier on a regression dataset anyway.

ogrisel · 2012-01-04T02:10:23Z

Done. You can either merge master to your branch or rebase.

ogrisel · 2012-01-04T02:18:21Z

BTW @ndawe don;t you think it would make sense to be able to pass multiple, unrelated base learner instances instead of a single instance (e.g. an SVM with gaussian kernels and a decision tree)?

ndawe · 2012-01-04T16:37:01Z

@ogrisel you mean boosting an ensemble? Boosting a random forest should already be possible but we could introduce additional classes ClassifierEnsemble and RegressorEnsemble which could hold an arbitrary list of classifiers/regressors (no use of base_estimator or _make_estimator from BaseEnsemble). Prediction would be the (possibly weighted) prediction from each contained classifier/regressor and the entire ensemble could be boosted.

ogrisel · 2012-01-04T16:45:32Z

I don't think we need to introduce a new Ensemble class, I was just thinking to replace the base_estimator constructor param that expects a single instance by a base_estimators param that expects a list of classifiers and change the _make_estimator method to pick any of the them sequentially (or randomly) whenever it is called.

mblondel · 2012-01-04T16:48:55Z

base_estimator could accept either a single estimator instance or a list of them (in which case n_estimators would be ignored).

glouppe · 2012-01-04T19:09:45Z

@ogrisel This is not standard and, imho, I don't think we should overcomplicate the interface to allow that special case. At least for now.

ndawe · 2012-01-04T20:53:10Z

@ogrisel I haven't seen boosting used in that way, but that's not to say it hasn't been done (at least I am not aware of it) or couldn't produce nice results. I agree with @glouppe that for now we should focus on getting the standard boosting solid for now. Regarding my previous comment, I do feel generic EnsembleClassifier and EnsembleRegressor classes would be convenient to have. Then you could group together arbitrary classifiers/regressors and boost them, etc. BoostedClassifier only requires that the base_estimator inherits from ClassifierMixin and is completely oblivious to whether it's boosting one classifier or a classifier which is a conglomerate of several. I would then suggest that BaseEnsemble should be stripped down to the bare essentials (basically the API of a list) and _make_estimator, base_estimator, and n_estimators should be moved down to a special EnsembleGenerator class used by random forests and boosting.

ogrisel · 2012-01-05T09:05:40Z

Alright if this is not the traditional way to use boosting then I am happy not having it implemented in the default AdaBoost class of scikit-learn. However I thought that boosting was often used by the winning teams of machine learning challenges such as the Netflix prize so as to combine unrelated classifiers (e.g. RBMs/DBNs + random forests + kernel SVMs + whatever) into a compound classifier best than any of them independently.

ogrisel · 2012-01-05T09:12:13Z

Also about the EnsembleClassifier / EnsembleRegressor / EnsembleGenerator design: it sounds a bit too nested to me esp. if it does not provide additional API than the flatter and current BaseEnsemble base class does. However the best way to discuss this would be by commenting actual working proof of concept impl. Let's postpone this discussion to another pull request and finish the review of this one first as I think it's soon mergeable (as soon as the ensemble chapter in the narrative doc is complemented to introduce boosting).

pprett · 2012-01-05T09:49:52Z

@ogrisel I haven't read the whole thread but I'm not aware of any boosting approach that uses a heterogeneous ensemble (= different classes of base learners). The winners of the Netflix price (BellKor’s Pragmatic Chaos) you are referring to used a stacking approach. They used the predictions of a large number of level0 models (RBM, Knn, SVD, ...) to create a new feature representation and then trained a level 1 model on this features to generate the final prediction

BTW: they used Gradient Tree Boosting ;-)

pprett · 2012-01-05T09:56:28Z

Disclaimer: This is not related to boosting.

@glouppe @bdholt1 @ndawe @amueller We recently discussed memory issues of RandomForest on the mailing list whcih might be due to sampling with replacement - I wonder if we could implement memory efficient sampling with replacement via sample weights? What do you think? Is it worth the effort given that we will need the copy as fallback if the base estimator does not support sample weights...

FIX: some of Gael comments

glouppe · 2013-02-03T20:58:09Z

@GaelVaroquaux I think we addressed all of your comments. However, I still don't see the commits you talked about, at https://github.com/GaelVaroquaux/scikit-learn/commits/treeweights Am I missing something?

GaelVaroquaux · 2013-02-03T23:12:00Z

@GaelVaroquaux I think we addressed all of your comments. However, I still
don't see the commits you talked about, at https://github.com/GaelVaroquaux/
scikit-learn/commits/treeweights Am I missing something?

Sorry, I had only one. I just committed and pushed all the local changes
that I had on the disk.

glouppe · 2013-02-04T07:06:14Z

Okay then, @ndawe could you cherry-pick d8d434e and 2a6523a ? Also, I think it is time to edit the What's New file. I let you do it :)

It looks like we will finally be over with PR. I must say that I am more than happy with the result! Good job @ndawe :)

GaelVaroquaux · 2013-02-04T08:18:57Z

It looks like we will finally be over with PR. I must say that I am more than
happy with the result!

+1 on that! It's a nice feature-set landing in the scikit.

ndawe · 2013-02-04T11:13:15Z

Thanks @glouppe for your help and everyone else for the reviews! @GaelVaroquaux's commits have been cherry-picked. I will add AdaBoost to the What's New file.

ndawe · 2013-02-04T11:22:46Z

I will submit a separate PR later comparing gradient boost and AdaBoost, if that's OK. I had a quick first look:

(BTW I think this must be the longest PR of all time... 😉 )

pprett · 2013-02-04T11:39:51Z

uhh... when I see the poor performance of GBRT I feel tempted to revert my +1 on this PR ;-)

PS: +1 for doing that in a separate PR - I probably need another year to tune the hell out of GBRT for this example...

glouppe · 2013-02-04T11:45:19Z

@amueller or @GaelVaroquaux Feel free to do your rebase-fu :)

amueller · 2013-02-04T12:05:09Z

will do in half an hour. thanks a lot for the great work and the patience!

Gilles Louppe notifications@github.com schrieb:

@amueller or @GaelVaroquaux Feel free to do your rebase-fu :)

Reply to this email directly or view it on GitHub:
#522 (comment)

Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.

GaelVaroquaux · 2013-02-04T12:06:49Z

will do in half an hour. thanks a lot for the great work and the patience!

I am on it actually. I just started.

GaelVaroquaux · 2013-02-04T12:13:51Z

I am on it actually. I just started.

OK, I am getting nasty conflicts in grid_search.py (which should never
have been touched by this PR :} ). I am in favor of giving up on the
rebase and merging. @amueller, WDYT?

amueller · 2013-02-04T12:19:11Z

Feel free. The rebase trouble is because this is the branch of the original PR and stuff got reverted, not rebased....
Go for it :)

GaelVaroquaux · 2013-02-04T12:20:07Z

Feel free. The rebase trouble is because this is the branch of the original PR
and stuff got reverted, not rebased....

Yup! Messy history yields more messy history.

Go for it :)

Merging, running the tests and pushing.

GaelVaroquaux · 2013-02-04T12:23:24Z

Merging, running the tests and pushing.

Merged and pushed. 🍻

glouppe · 2013-02-04T12:26:12Z

phew

Yay!! Thanks everyone!

(PS: I'll edit the what's new file myself.)

amueller · 2013-02-04T12:27:08Z

wohoo! Grats @glouppe and @ndawe

arjoly · 2013-02-04T12:28:31Z

Congratz !!! :-)

agramfort · 2013-02-04T12:28:45Z

awesome job guys !

ndawe · 2013-02-04T12:31:20Z

Wow... Just went out for lunch and all this happened. Thanks for merging, and sorry for the messy history....

🍻

pprett · 2013-02-04T12:31:40Z

great - thanks a lot guys!

2013/2/4 Alexandre Gramfort notifications@github.com

awesome job guys !

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/522#issuecomment-13074559.

Peter Prettenhofer

ndawe · 2013-02-04T12:34:05Z

@pprett I just used default settings for gradient boosting (both boosting stumps though). So maybe it's not a fair comparison. Also, the difference could simply be a difference in the meaning of a given learning_rate between the algorithms.

mblondel · 2013-02-10T02:35:50Z

Sorry to ask this after merge but what was the motivation for naming the module weight_boosting? Do you have plans to add other algorithms to it? If not, I think adaboost would be more explicit.

ndawe · 2013-02-10T18:40:31Z

Yes, I would also like to implement other weight boosting algorithms such as BrownBoost, LogitBoost, GentleBoost, etc.

jakevdp reviewed Jan 3, 2012
View reviewed changes

ogrisel reviewed Jan 3, 2012
View reviewed changes

ndawe added 3 commits February 3, 2013 10:50

Merge pull request #29 from glouppe/adaboost

9b183a8

FIX: some of Gael comments

minor edits in docs

41f3f39

DOC: notes about examples and minor edits

be2394a

GaelVaroquaux added 2 commits February 4, 2013 12:10

COSMIT: prettify code and figure in example

f4a283a

COSMIT

1f95b18

GaelVaroquaux merged commit 1f95b18 into scikit-learn:master Feb 4, 2013

glouppe mentioned this pull request Feb 19, 2017

Understanding min_samples_leaf / min_samples_split #8399

Closed

MRG: AdaBoost for regression and multi-class classification #522

MRG: AdaBoost for regression and multi-class classification #522

Conversation

ndawe commented Jan 3, 2012

satra commented Jan 3, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jakevdp commented Jan 3, 2012

pprett commented Jan 3, 2012

glouppe commented Jan 3, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jan 3, 2012

ogrisel commented Jan 3, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ogrisel commented Jan 3, 2012

ogrisel commented Jan 3, 2012

ogrisel commented Jan 3, 2012

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ndawe commented Jan 3, 2012

ogrisel commented Jan 4, 2012

ndawe commented Jan 4, 2012

ogrisel commented Jan 4, 2012

ogrisel commented Jan 4, 2012

ogrisel commented Jan 4, 2012

ndawe commented Jan 4, 2012

ogrisel commented Jan 4, 2012

mblondel commented Jan 4, 2012

glouppe commented Jan 4, 2012

ndawe commented Jan 4, 2012

ogrisel commented Jan 5, 2012

ogrisel commented Jan 5, 2012

pprett commented Jan 5, 2012

pprett commented Jan 5, 2012

glouppe commented Feb 3, 2013

GaelVaroquaux commented Feb 3, 2013

glouppe commented Feb 4, 2013

GaelVaroquaux commented Feb 4, 2013

ndawe commented Feb 4, 2013

ndawe commented Feb 4, 2013

pprett commented Feb 4, 2013

glouppe commented Feb 4, 2013

amueller commented Feb 4, 2013

GaelVaroquaux commented Feb 4, 2013

GaelVaroquaux commented Feb 4, 2013

amueller commented Feb 4, 2013

GaelVaroquaux commented Feb 4, 2013

GaelVaroquaux commented Feb 4, 2013

glouppe commented Feb 4, 2013

amueller commented Feb 4, 2013

arjoly commented Feb 4, 2013

agramfort commented Feb 4, 2013

ndawe commented Feb 4, 2013

pprett commented Feb 4, 2013

ndawe commented Feb 4, 2013

mblondel commented Feb 10, 2013

ndawe commented Feb 10, 2013