-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Multicore scalability of the Histogram-based GBDT #14306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think the numa-aware optimisation is low priority compared to the |
Thanks a lot for the report, I'll look into it. A very easy partial fix is to replace |
The arrays can be initialized to zero (e.g. using memset to avoid using GIL) in parallel in the prange threads using a |
@ogrisel @NicolasHug I have prepared replacing of np.zeroes in this PR #14380 |
Yes of course. Though unless with have tons of features, since max_bins is caped to 255 I don't think we need to do that in parallel. I think LightGBM has a heuristic rule for that. On top of initialization, we can start looking at memory allocation. LightGBM uses a LRU cache for the histograms and we could too. A simple one would be to re-use the histograms of any parent whose children have been splitted (splitted = split_info is computed). |
@SmirnovEgorRu may I ask how you came to this conclusion? What tools did you use?
Did you also make sure to initialize the arrays to zero when |
For reference here are the times (in seconds) I observed for scaling on a 12 physical core cpu:
Versions:
|
@thomasjpfan current master of XGBoost contains additional optimizations 2-3x and more vs. 0.9 version depending on # of cores and algorithm parameters, PR with optimizations. |
@SmirnovEgorRu I guess the main concern in this issue is the scalability of our implementation w.r.t the number of threads, not how fast it is compared to XGBoost. |
@NicolasHug I mean that it is possible to look at optimizations which have been done in XGBoost recently and move some of them into scikit. |
oh ok thanks for the heads up. Is there a reasonably detailed summary of the optimizations that you implemented? Or is reading the code our only hope here ^^ ? |
@NicolasHug some information is available in PRs ( dmlc/xgboost#4529, dmlc/xgboost#4433, dmlc/xgboost#4278, dmlc/xgboost#4310, dmlc/xgboost#3957) and here https://github.com/SmirnovEgorRu/xgboost-fast-hist-perf-lab |
Thanks, definitely ping us when the blog post is ready please :) |
@SmirnovEgorRu On dmlc/xgboost@53d4272 here is the same benchmark:
|
@thomasjpfan why do you hate me? (also I find plots easier to read than comparing lists that are far apart and have 6 decimals) |
what is it GBDT |
ok I see GBDT is GBM |
@Sandy4321 GBDT stands for Gradient Boosted Decision Trees which are basically GBM (Gradient Boosting Machines) with a Decision Tree as the weak learner). |
@NicolasHug @thomasjpfan can you do a lscpu to check whether this is a single socket processor with 12 cores or 2 sockets with 6 cores (and therefore 2 NUMA nodes)? Based on your scalability curves I would vote for single socket CPU but I would like to have the confirmation. |
I see thanks |
@ogrisel Single socket |
@ogrisel Have you experimented with |
I ran some benchmarks on a machine with 2 Intel Xeon E5-2620 CPUs @ 2.10GHz (8 cores/16 threads each), so 2 NUMA nodes. TLDR: The scikit-learn implementation scales just as well(bad) as lightgbm. python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-leaf-nodes 255 --n-trees 200 --lightgbm
A few observations:
I did observe the CPUs only being used at around 90% for the sklearn parts. This could explain why lightgbm is faster. Would appreciate your thoughts @SmirnovEgorRu @ogrisel @thomasjpfan :) |
@NicolasHug, the blog what I mentioned above is here now: https://software.intel.com/en-us/download/parallel-universe-magazine-issue-38-august-2019 |
Very interesting, thanks! If I understand correctly (please correct me if I'm wrong!), it seems that the main difference between our implementations is the histogram computation. In scikit-learn each feature histogram is built independently (one thread per feature). For each feature, we loop through all the samples at the current node. In contrast, XGBoost does not compute each feature histogram independently. Instead it loops over the samples and compute "partial histograms" (which I assume are of size n_features * n_bins) that are then aggregated together in a typical map/reduce procedure. The multi-threading is sample-wise, not feature-wise like in sklearn. For that reason it makes sense for XGBoost to use a C-aligned bin matrix, while we are using a F-aligned matrix. CC @ogrisel @adrinjalali who might be interested in that sort of details |
Thanks. Indeed this is very interesting. /cc @jeremiedbb @glemaitre who might also be interested in this. |
@NicolasHug in your benchmarks, how many physical cores (without hyperthreads) do you have on the host machine? If it's 16, then it's normal to reach a plateau at 16 openmp threads. But indeed the problem is not as bad as I remember. Still it would be interesting to re-benchmark sklearn on Higgs vs the latest xgboost master on a machine with many cores. Edit: indeed you have 16 physical cores. I had not read the beginning of your comment. |
@ogrisel my knowledge of CPUs is quite limited. Could you explain why it's normal to plateau at 16? |
Hyperthreaded cores are fake core and it's quite often the case that they do not provide any speed-up for compute intensive algorithms. So I would only run benchmarks with |
For ref, the overhead induced by using Some scalability issues w.r.t number of threads remain though -- we're still not as fast as LightGBM. |
Going back to #14306 (comment), from microsoft/LightGBM#2791 (comment): like XGBoost, LightGBM 3.0 has also implemented sample-wise parallelism to build histograms, instead of feature-wise parallelism. This seems to be preferable for small numbers of threads and features. |
Based on #18382, the scalability profile of scikit-learn is not too bad on large problems. On small problems it's bad when n_threads > n_physical cores. As hyperthreads are never useful, it would be good to improve joblib / loky to detect the number of physical cores and use that as the default number of openmp threads in HGBRT. Work on this is being tracked here: joblib/loky#223. Besides this over-subscription problem when n_threads > n_physical_cores, the scalability profile of scikit-learn is not too bad. As @NicolasHug analyzed in #18382 (comment), we might also want to review the openmp pragmas used by LightGBM to see if there are things with missed to improve speed in critical parallem sections on smaller problems. Finally LightGBM is indeed faster even in sequential model. This might be related to LightGBM 3.0 perf improvements (#14306 (comment)) but this is unrelated to multicore scalability. |
I just merged #26082 to use only as many threads as physical cores by default. This should improve things a bit (a least avoid some catastrophic slow-downs when fitting many HGBDT models on small dataset on machines with many cores and Hyper Threading / SMT enabled) but this is not a complete fix. |
I observed that our OpenMP based GBDT does not scale well beyond 4 or 8 threads.
@SmirnovEgorRu who already contributed scalability improvements to xgboost sent me the following analysis by email, reproduced here with his permission:
I had initial looking at histogram building functions and observed a few things:
Scikit-learn’s code spends too much time in np.zeroes() functions. It is used for creation and filling histograms by zeroes. Theoretically – it is a cheap operation, but this is sequential and uses not native code. I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.
For low decision-tree nodes - # of rows can be very small < 1000. In this case creation of parallel loop brings more overhead than value. Similar issue was covered here - hcho3/xgboost-fast-hist-perf-lab#15, to understand background of the issue – you can read readme in the repository. I hope, you will find smth interesting for you there.
Link above also says about impact of conversion float <-> double into performance. I have looked at Scikit-learn’s assembler of histogram building and observed conversions too. I haven’t tested, but removing of this conversion can help.
If exclude np.zeroes() – building of histograms scales okay if we use only one processor. If we say about multi-socket systems (NUMA) – it scales very poor.
Here I can recommend you:
Split bin-matrix to <# of numa –nodes> parts by rows and allocate memory on each numa-node separately.
Pin threads to numa-nodes. Each thread should use bin-matrix only from pinned numa-node.
/cc @NicolasHug @jeremiedbb @adrinjalali @glemaitre @pierreglaser
The text was updated successfully, but these errors were encountered: