n_jobs support in GradientBoostingClassifier #3628

mblondel · 2014-09-03T13:54:56Z

The following loop is embarrassingly parallel:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L552

Edit: I (@ogrisel) removed the Easy tag and put a Moderate tag instead. Based on the discussion below the most beneficial way to add n_jobs support for GBRT would be deep inside the Cython tree code (to benefit GB regression and adaboost models as wells instead of just GB classification).

The text was updated successfully, but these errors were encountered:

pprett · 2014-09-03T14:43:40Z

we could use joblib w/ threading the backend here but we need to ensure that each thread has a separate splitter and criterion cc @glouppe

glouppe · 2014-09-04T06:24:39Z

Indeed, we should pay attention ot this. The safest way is instantiate as many splitters and criteria objects as loss.K and then use something like splitters[k], criteria[k] when instantiating the decision trees to be fit.

larsmans · 2014-09-04T10:35:22Z

Does the new joblib backend allow sharing a thread pool across calls?

mblondel · 2014-09-04T15:03:42Z

@glouppe Creating one splitter per class would generate a lot of redundant work in the case of PresortBestSplitter because of X_argsorted. It would be nice if X_argsorted could be passed as an optional argument to the build method of a builder, and then the builder would pass it to the splitter if needed. In the mean time, perhaps it's better to just use BestSplitter.

ogrisel · 2014-09-22T08:50:10Z

Does the new joblib backend allow sharing a thread pool across calls?

At this time this is not the case, the thread pool is created at each call.

ogrisel · 2014-09-22T08:50:57Z

In the mean time, perhaps it's better to just use BestSplitter.

That would need to be benchmarked.

ogrisel · 2014-09-22T08:53:24Z

I think it might be more generally useful to do threading for finding the best feature to split on inside the _tree.pyx code, reusing a fixed thread pool.

This would benefit both GBR classification and regression and also Adaboost with trees. The refactoring of the cython code to do so does not seem trivial though.

mblondel · 2014-09-22T09:35:30Z

For RF, in the single machine case, the search for the best feature to split on should probably not be parallelized, since parallelization is already done for creating trees. Two levels of parallelization are likely to lead to hangs. For the multiple machine case, this would be useful though. For GBRT, this is of course not a problem since learning is sequential.

pprett · 2014-09-22T10:12:27Z

@ogrisel I strongly agree and would love to do that but I see some issues currently:

we need to make the criterion objects thread locals (and they have to be copied from the initialized prototype)
the feature value buffer Xf must be a thread local

nachi11 · 2014-11-19T19:02:50Z

I would like to contribute to scikit-learn, is this issue open to take up?

pprett · 2014-11-19T19:14:12Z

@nachi11 which one? Parallelizing the multi-class case or parallizing the induction of a single tree by computing the best split for each feature in parallel?

Both would be cool but the latter would be the path to fame & glory & a crate of beer

nachi11 · 2014-11-20T04:35:31Z

I will try the easier one first, I'll work on parallelizing the multi-class case. I have the environment configured. How do I start?

pprett · 2014-11-20T08:24:04Z

@nachi11 please read this guide carefully http://scikit-learn.org/stable/developers/#contributing .
Once you have something working create a pull request on github that describes your contributions and ping me - thanks!

ogrisel · 2014-11-20T09:03:06Z

Honestly I am not sure that working on n_jobs for the multiclass case of GBRT is interesting. The other (more complex) parallelization option is more useful although requires a much deeper understanding of the Cython code base of the trees and how they are used by higher level ensembles like Gradient boosting, AdaBoost, RFs and Extra Trees.

pprett · 2014-11-20T09:21:13Z

@ogrisel @nachi11 I agree that parallelization of the multi-class case is not too interesting but still a nice feature, however, once we have parallelization over features it will likely become irrelevant.

@nachi11 there might be more urgent issues to work on in sklearn

nachi11 · 2014-11-21T05:36:31Z

@pprett Could you please suggest me some urgent issues which can be solved by a beginner.

ogrisel · 2014-11-21T10:08:40Z

@nachi11 related to trees, #3832 should be easy.

nachi11 · 2014-11-22T20:02:29Z

@ogrisel Thanks I'll try to solve it and get back.

angadgill · 2016-01-13T22:07:12Z

Is there active development on this issue?

amueller · 2016-01-15T20:28:11Z

@angadgill no. See #3628 (comment) ;)

angadgill · 2016-01-15T22:43:45Z

Ah, I see. Thanks for pointing me to it. I'm considering taking this up as a research project. Any useful pointers?

JonahJ · 2016-06-15T21:54:50Z

+1

amueller · 2016-10-11T21:11:14Z

I think this is open to people who want to try the "multiclass" parallelization, though I agree with @ogrisel that that might not be that useful. We are convinced by benchmarks though.

Also still open for a brave soul to parallelize the tree on cython level.

amanp10 · 2017-03-21T08:20:06Z

Regarding the first task, Multiclass Parallelization , has it been done or is someone working on it?
I am planning to take up both the tasks as my GSoC 2017 project.

jmschrei · 2017-03-21T21:22:12Z

Parallelized decision tree building is still an unsolved issue, in part due to the complexity of the underlying code.

andreh7 · 2017-04-11T13:41:58Z

for the parallelized tree building, is it correct that essentially the loop over features starting here

scikit-learn/sklearn/tree/_splitter.pyx

Line 381 in 194c231

while (f_i > n_total_constants and # Stop early if remaining features

(ending at line 490) needs to be parallelized ?

jmschrei · 2017-04-11T18:54:12Z

Yes.

andreh7 · 2017-04-23T09:51:40Z

While working on #8779 I noticed the following related to sklearn/tree/_splitter.pyx
(I apologize if these are obvious):

There seems to be quite a bit of overlap between the methods node_split(..)
of the inheriting classes (BestSplitter, RandomSplitter, BestSparseSplitter
and RandomSparseSplitter). This could probably be reduced if these classes
just implement a method which looks at a single feature only and returns
the best split or that the feature was constant (I think this is what @glouppe
suggested on September 4 2014). I see that node_split(..) currently
returns 0 if there was no error; a method looking at only one feature
could return +1 to indicate that the feature was constant at this node.
One (minor in my opinion) drawback is that each thread allocates
its own memory for Xf, samples, sample_mask etc. but this is probably
unavoidable.
Despite the similarity, some things seem to have diverged. Looking
for example at the differences between RandomSplitter and BestSplitter,
the variable feature_offset in BestSplitter seems to be the same as
the variable feature_stride in RandomSplitter. Both introduce a local variable
X_feature_stride but BestSplitter then uses again self.X_feature_stride instead of
the local variable. current_feature_value is defined in both RandomSparseSplitter
and BestSparseSplitter but in different places. How difficult would it
be to move these four splitter classes to individual files (this would
allow to compare them more easily) ?
The outermost loop in the node_split(..) methods which loops over features picks
not-yet-explored features at random. Is there any reason why these have
to be picked at random ? In the end one has to consider all non-constant features
anyway.
The node_split(..) methods moves features which are constant within this node
towards the beginning of the features array and those which aren't towards the end.
This allows to skip features which are already constant at the current node
when investigating it in child nodes in the future. The modification of the
features array is clearly something which should not be done from multiple
threads but rather by the function spawning them (see also the first point).
How much does this issue here overlap with [WIP] Alternative architecture for regression tree #8458 / [RFC] Gradient Boosting improvement #8231 and [RFC] Tree module improvements #5212 ?

glemaitre · 2017-04-23T10:42:11Z

Regarding 5., you have a large overlap.

We are implementing the same architecture than in XGBoost for the exact approach. Currently, @raghavrv is cynthonizing the implementation. It should be almost ready in couple of days, in order to be benchmarked. With this implementation, it should be relatively easy to parallelize the gradient boosting, when finding the split for each feature, using joblib. It comes at the cost of having another tree code to maintain, so we need to be sure of the benefit to duplicate the code.

In the meanwhile, I started to work on using an approximation approach using histogram which is much faster with large number of samples. This is a feature which used in LightGBM, XGBoost, and FastBDT. We needed first a quantile transformer which took more time than expected since it can be used for pre-processing as well. It is almost ready to be merged and I will be able to focus on the histogram computation from now on.

@MechCoder I passed by scikit-garden and so an implementation using quantiles. Do you have anything which is going on which could be interesting for the current matter? Maybe you might be interesting by the QuantileTransformer as well.

MechCoder · 2017-05-04T18:55:27Z

Thanks for the ping and your great work on the quantile transformer. Sorry if this is dumb, but do you know why the quantile transformation helps in the case of a decision tree, i.e the split computation is independent of the feature scaling no?

Reg: scikit-garden, the quantiles returned are at predict time as an estimate of P(Y|X) at different percentiles and is not used as a preprocessing or scaling technique.

glemaitre · 2017-05-04T19:44:07Z

Thanks for the ping and your great work on the quantile transformer. Sorry if this is dumb, but do you know why the quantile transformation helps in the case of a decision tree, i.e the split computation is independent of the feature scaling no?

It will help in the approximation mode. The quantile transformer is used for binning the data later used to build the histogram. In addition, using something fitting in an int8 could allow to use the CPU more efficiently.

MechCoder · 2017-05-04T19:56:39Z

Ah yes, thanks! On May 4, 2017 3:44 PM, "Guillaume Lemaitre" <notifications@github.com> wrote: Thanks for the ping and your great work on the quantile transformer. Sorry if this is dumb, but do you know why the quantile transformation helps in the case of a decision tree, i.e the split computation is independent of the feature scaling no? It will help in the approximation mode. The quantile transformer is used for binning the data later used to build the histogram. In addition, using something fitting in an int8 could allow to use the CPU more efficiently. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3628 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABx9ELq1vlAp29YEDd4cxFo7yfZheoe8ks5r2iqqgaJpZM4Cd_iW> .

cmarmo · 2020-06-04T12:56:35Z

@glemaitre, @NicolasHug , the gradient_boosting.py file is no longer there: as far as I can understand only the cythonized version of the algorithm is now in scikit-learn. Might we close this one and open a new one, if necessary, about the new implementation? Thanks!

NicolasHug · 2020-06-04T12:59:31Z

yeah I think this was about parallelizing the k trees of a single iteration in a multiclass setting but this is irrelevant now that we have the histogram-based GBDT. Thanks for the ping @cmarmo

glemaitre · 2020-06-04T13:06:51Z

Parallelizing over the feature in the decision tree would still be a thing to try.

cmarmo · 2020-06-04T13:12:18Z

Parallelizing over the feature in the decision tree would still be a thing to try.

No doubt about that, but the reference in the description is a 404 not found link and a lot of things changed in skl. As this is a "help wanted" issue, maybe it is worth to open a new issue with updated references and description.

NicolasHug · 2020-06-04T13:17:58Z

parallelizing over the feature in the decision tree would still be a thing to try.

just curious, in what cases do you think this would be useful? For forests, we parallelize with joblib at the tree level so that would not be needed.
For GB, we have the histogram-based estimators, and for single trees I would assume these are fast enough since they're single trees (and people would rather use forests / GBDT in practice)

glemaitre · 2020-06-04T13:52:42Z

just curious, in what cases do you think this would be useful? For forests, we parallelize with joblib at the tree level so that would not be needed.

This level of parallelization would be a prange with openmp so it would be 2 levels of parallelization.
A potential use case is to use dask as a backend on the outer layer and use the underlying resources for the second layer.

For GB, we have the histogram-based estimators, and for single trees I would assume these are fast enough since they're single trees (and people would rather use forests / GBDT in practice)

I agree that HGBDT is the algorithm to use. However, the improvement would impact the three classes so, if the GBDT can be speed-up for "free" by having a prange, I don't think this is that bad.
However, I agree that this is not the most urgent priority as it would require some refactoring and simplification of the tree code as @adrinjalali spotted in the past.

adrinjalali · 2020-06-04T14:57:29Z

Yeah I think this should rather happen after the refactoring of the tree code, and I think once we get to doing that, we would also add the parallelization at the same time. So this can stay closed, but I don't think we'll get rid of the tree code or the old ensembles.

mblondel added the Enhancement label Sep 3, 2014

GaelVaroquaux added the Easy label Sep 3, 2014

ogrisel added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Nov 21, 2014

amueller added the Need Contributor label Oct 11, 2016

andreh7 mentioned this issue Apr 22, 2017

[WIP] added comments to tree splitter code #8779

Closed

lesteve added help wanted and removed Need Contributor labels Oct 18, 2017

NicolasHug closed this as completed Jun 4, 2020

n_jobs support in GradientBoostingClassifier #3628

n_jobs support in GradientBoostingClassifier #3628

Comments

mblondel commented Sep 3, 2014

pprett commented Sep 3, 2014

glouppe commented Sep 4, 2014

larsmans commented Sep 4, 2014

mblondel commented Sep 4, 2014

ogrisel commented Sep 22, 2014

ogrisel commented Sep 22, 2014

ogrisel commented Sep 22, 2014

mblondel commented Sep 22, 2014

pprett commented Sep 22, 2014

nachi11 commented Nov 19, 2014

pprett commented Nov 19, 2014

nachi11 commented Nov 20, 2014

pprett commented Nov 20, 2014

ogrisel commented Nov 20, 2014

pprett commented Nov 20, 2014

nachi11 commented Nov 21, 2014

ogrisel commented Nov 21, 2014

nachi11 commented Nov 22, 2014

angadgill commented Jan 13, 2016

amueller commented Jan 15, 2016

angadgill commented Jan 15, 2016

JonahJ commented Jun 15, 2016

amueller commented Oct 11, 2016

amanp10 commented Mar 21, 2017

jmschrei commented Mar 21, 2017

andreh7 commented Apr 11, 2017

jmschrei commented Apr 11, 2017

andreh7 commented Apr 23, 2017 • edited Loading

glemaitre commented Apr 23, 2017 • edited by lesteve Loading

MechCoder commented May 4, 2017

glemaitre commented May 4, 2017

MechCoder commented May 4, 2017 via email

cmarmo commented Jun 4, 2020

NicolasHug commented Jun 4, 2020

glemaitre commented Jun 4, 2020

cmarmo commented Jun 4, 2020 • edited Loading

NicolasHug commented Jun 4, 2020

glemaitre commented Jun 4, 2020

adrinjalali commented Jun 4, 2020

andreh7 commented Apr 23, 2017 •

edited

Loading

glemaitre commented Apr 23, 2017 •

edited by lesteve

Loading

cmarmo commented Jun 4, 2020 •

edited

Loading