Skip to content

n_jobs support in GradientBoostingClassifier #3628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue Sep 3, 2014 · 39 comments
Closed

n_jobs support in GradientBoostingClassifier #3628

mblondel opened this issue Sep 3, 2014 · 39 comments
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices

Comments

@mblondel
Copy link
Member

mblondel commented Sep 3, 2014

The following loop is embarrassingly parallel:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L552

Edit: I (@ogrisel) removed the Easy tag and put a Moderate tag instead. Based on the discussion below the most beneficial way to add n_jobs support for GBRT would be deep inside the Cython tree code (to benefit GB regression and adaboost models as wells instead of just GB classification).

@pprett
Copy link
Member

pprett commented Sep 3, 2014

we could use joblib w/ threading the backend here but we need to ensure that each thread has a separate splitter and criterion cc @glouppe

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2014

Indeed, we should pay attention ot this. The safest way is instantiate as many splitters and criteria objects as loss.K and then use something like splitters[k], criteria[k] when instantiating the decision trees to be fit.

@larsmans
Copy link
Member

larsmans commented Sep 4, 2014

Does the new joblib backend allow sharing a thread pool across calls?

@mblondel
Copy link
Member Author

mblondel commented Sep 4, 2014

@glouppe Creating one splitter per class would generate a lot of redundant work in the case of PresortBestSplitter because of X_argsorted. It would be nice if X_argsorted could be passed as an optional argument to the build method of a builder, and then the builder would pass it to the splitter if needed. In the mean time, perhaps it's better to just use BestSplitter.

@ogrisel
Copy link
Member

ogrisel commented Sep 22, 2014

Does the new joblib backend allow sharing a thread pool across calls?

At this time this is not the case, the thread pool is created at each call.

@ogrisel
Copy link
Member

ogrisel commented Sep 22, 2014

In the mean time, perhaps it's better to just use BestSplitter.

That would need to be benchmarked.

@ogrisel
Copy link
Member

ogrisel commented Sep 22, 2014

I think it might be more generally useful to do threading for finding the best feature to split on inside the _tree.pyx code, reusing a fixed thread pool.

This would benefit both GBR classification and regression and also Adaboost with trees. The refactoring of the cython code to do so does not seem trivial though.

@mblondel
Copy link
Member Author

For RF, in the single machine case, the search for the best feature to split on should probably not be parallelized, since parallelization is already done for creating trees. Two levels of parallelization are likely to lead to hangs. For the multiple machine case, this would be useful though. For GBRT, this is of course not a problem since learning is sequential.

@pprett
Copy link
Member

pprett commented Sep 22, 2014

@ogrisel I strongly agree and would love to do that but I see some issues currently:

  1. we need to make the criterion objects thread locals (and they have to be copied from the initialized prototype)
  2. the feature value buffer Xf must be a thread local

@nachi11
Copy link

nachi11 commented Nov 19, 2014

I would like to contribute to scikit-learn, is this issue open to take up?

@pprett
Copy link
Member

pprett commented Nov 19, 2014

@nachi11 which one? Parallelizing the multi-class case or parallizing the induction of a single tree by computing the best split for each feature in parallel?

Both would be cool but the latter would be the path to fame & glory & a crate of beer

@nachi11
Copy link

nachi11 commented Nov 20, 2014

I will try the easier one first, I'll work on parallelizing the multi-class case. I have the environment configured. How do I start?

@pprett
Copy link
Member

pprett commented Nov 20, 2014

@nachi11 please read this guide carefully http://scikit-learn.org/stable/developers/#contributing .
Once you have something working create a pull request on github that describes your contributions and ping me - thanks!

@ogrisel
Copy link
Member

ogrisel commented Nov 20, 2014

Honestly I am not sure that working on n_jobs for the multiclass case of GBRT is interesting. The other (more complex) parallelization option is more useful although requires a much deeper understanding of the Cython code base of the trees and how they are used by higher level ensembles like Gradient boosting, AdaBoost, RFs and Extra Trees.

@pprett
Copy link
Member

pprett commented Nov 20, 2014

@ogrisel @nachi11 I agree that parallelization of the multi-class case is not too interesting but still a nice feature, however, once we have parallelization over features it will likely become irrelevant.

@nachi11 there might be more urgent issues to work on in sklearn

@nachi11
Copy link

nachi11 commented Nov 21, 2014

@pprett Could you please suggest me some urgent issues which can be solved by a beginner.

@ogrisel ogrisel added Moderate Anything that requires some knowledge of conventions and best practices and removed Easy Well-defined and straightforward way to resolve labels Nov 21, 2014
@ogrisel
Copy link
Member

ogrisel commented Nov 21, 2014

@nachi11 related to trees, #3832 should be easy.

@nachi11
Copy link

nachi11 commented Nov 22, 2014

@ogrisel Thanks I'll try to solve it and get back.

@angadgill
Copy link

Is there active development on this issue?

@amueller
Copy link
Member

@angadgill no. See #3628 (comment) ;)

@angadgill
Copy link

Ah, I see. Thanks for pointing me to it. I'm considering taking this up as a research project. Any useful pointers?

@JonahJ
Copy link

JonahJ commented Jun 15, 2016

+1

@amueller
Copy link
Member

I think this is open to people who want to try the "multiclass" parallelization, though I agree with @ogrisel that that might not be that useful. We are convinced by benchmarks though.

Also still open for a brave soul to parallelize the tree on cython level.

@amanp10
Copy link
Contributor

amanp10 commented Mar 21, 2017

Regarding the first task, Multiclass Parallelization , has it been done or is someone working on it?
I am planning to take up both the tasks as my GSoC 2017 project.

@jmschrei
Copy link
Member

Parallelized decision tree building is still an unsolved issue, in part due to the complexity of the underlying code.

@andreh7
Copy link
Contributor

andreh7 commented Apr 11, 2017

for the parallelized tree building, is it correct that essentially the loop over features starting here

while (f_i > n_total_constants and # Stop early if remaining features
(ending at line 490) needs to be parallelized ?

@jmschrei
Copy link
Member

Yes.

@andreh7
Copy link
Contributor

andreh7 commented Apr 23, 2017

While working on #8779 I noticed the following related to sklearn/tree/_splitter.pyx
(I apologize if these are obvious):

  1. There seems to be quite a bit of overlap between the methods node_split(..)
    of the inheriting classes (BestSplitter, RandomSplitter, BestSparseSplitter
    and RandomSparseSplitter). This could probably be reduced if these classes
    just implement a method which looks at a single feature only and returns
    the best split or that the feature was constant (I think this is what @glouppe
    suggested on September 4 2014). I see that node_split(..) currently
    returns 0 if there was no error; a method looking at only one feature
    could return +1 to indicate that the feature was constant at this node.
    One (minor in my opinion) drawback is that each thread allocates
    its own memory for Xf, samples, sample_mask etc. but this is probably
    unavoidable.
  2. Despite the similarity, some things seem to have diverged. Looking
    for example at the differences between RandomSplitter and BestSplitter,
    the variable feature_offset in BestSplitter seems to be the same as
    the variable feature_stride in RandomSplitter. Both introduce a local variable
    X_feature_stride but BestSplitter then uses again self.X_feature_stride instead of
    the local variable. current_feature_value is defined in both RandomSparseSplitter
    and BestSparseSplitter but in different places. How difficult would it
    be to move these four splitter classes to individual files (this would
    allow to compare them more easily) ?
  3. The outermost loop in the node_split(..) methods which loops over features picks
    not-yet-explored features at random. Is there any reason why these have
    to be picked at random ? In the end one has to consider all non-constant features
    anyway.
  4. The node_split(..) methods moves features which are constant within this node
    towards the beginning of the features array and those which aren't towards the end.
    This allows to skip features which are already constant at the current node
    when investigating it in child nodes in the future. The modification of the
    features array is clearly something which should not be done from multiple
    threads but rather by the function spawning them (see also the first point).
  5. How much does this issue here overlap with [WIP] Alternative architecture for regression tree #8458 / [RFC] Gradient Boosting improvement #8231 and [RFC] Tree module improvements #5212 ?

@glemaitre
Copy link
Member

glemaitre commented Apr 23, 2017

Regarding 5., you have a large overlap.

We are implementing the same architecture than in XGBoost for the exact approach. Currently, @raghavrv is cynthonizing the implementation. It should be almost ready in couple of days, in order to be benchmarked. With this implementation, it should be relatively easy to parallelize the gradient boosting, when finding the split for each feature, using joblib. It comes at the cost of having another tree code to maintain, so we need to be sure of the benefit to duplicate the code.

In the meanwhile, I started to work on using an approximation approach using histogram which is much faster with large number of samples. This is a feature which used in LightGBM, XGBoost, and FastBDT. We needed first a quantile transformer which took more time than expected since it can be used for pre-processing as well. It is almost ready to be merged and I will be able to focus on the histogram computation from now on.

@MechCoder I passed by scikit-garden and so an implementation using quantiles. Do you have anything which is going on which could be interesting for the current matter? Maybe you might be interesting by the QuantileTransformer as well.

@MechCoder
Copy link
Member

Thanks for the ping and your great work on the quantile transformer. Sorry if this is dumb, but do you know why the quantile transformation helps in the case of a decision tree, i.e the split computation is independent of the feature scaling no?

Reg: scikit-garden, the quantiles returned are at predict time as an estimate of P(Y|X) at different percentiles and is not used as a preprocessing or scaling technique.

@glemaitre
Copy link
Member

Thanks for the ping and your great work on the quantile transformer. Sorry if this is dumb, but do you know why the quantile transformation helps in the case of a decision tree, i.e the split computation is independent of the feature scaling no?

It will help in the approximation mode. The quantile transformer is used for binning the data later used to build the histogram. In addition, using something fitting in an int8 could allow to use the CPU more efficiently.

@MechCoder
Copy link
Member

MechCoder commented May 4, 2017 via email

@cmarmo
Copy link
Contributor

cmarmo commented Jun 4, 2020

@glemaitre, @NicolasHug , the gradient_boosting.py file is no longer there: as far as I can understand only the cythonized version of the algorithm is now in scikit-learn. Might we close this one and open a new one, if necessary, about the new implementation? Thanks!

@NicolasHug
Copy link
Member

yeah I think this was about parallelizing the k trees of a single iteration in a multiclass setting but this is irrelevant now that we have the histogram-based GBDT. Thanks for the ping @cmarmo

@glemaitre
Copy link
Member

Parallelizing over the feature in the decision tree would still be a thing to try.

@cmarmo
Copy link
Contributor

cmarmo commented Jun 4, 2020

Parallelizing over the feature in the decision tree would still be a thing to try.

No doubt about that, but the reference in the description is a 404 not found link and a lot of things changed in skl. As this is a "help wanted" issue, maybe it is worth to open a new issue with updated references and description.

@NicolasHug
Copy link
Member

parallelizing over the feature in the decision tree would still be a thing to try.

just curious, in what cases do you think this would be useful? For forests, we parallelize with joblib at the tree level so that would not be needed.
For GB, we have the histogram-based estimators, and for single trees I would assume these are fast enough since they're single trees (and people would rather use forests / GBDT in practice)

@glemaitre
Copy link
Member

just curious, in what cases do you think this would be useful? For forests, we parallelize with joblib at the tree level so that would not be needed.

This level of parallelization would be a prange with openmp so it would be 2 levels of parallelization.
A potential use case is to use dask as a backend on the outer layer and use the underlying resources for the second layer.

For GB, we have the histogram-based estimators, and for single trees I would assume these are fast enough since they're single trees (and people would rather use forests / GBDT in practice)

I agree that HGBDT is the algorithm to use. However, the improvement would impact the three classes so, if the GBDT can be speed-up for "free" by having a prange, I don't think this is that bad.
However, I agree that this is not the most urgent priority as it would require some refactoring and simplification of the tree code as @adrinjalali spotted in the past.

@adrinjalali
Copy link
Member

Yeah I think this should rather happen after the refactoring of the tree code, and I think once we get to doing that, we would also add the parallelization at the same time. So this can stay closed, but I don't think we'll get rid of the tree code or the old ensembles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

No branches or pull requests