Skip to content

Multicore scalability of the Histogram-based GBDT #14306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ogrisel opened this issue Jul 12, 2019 · 35 comments
Open

Multicore scalability of the Histogram-based GBDT #14306

ogrisel opened this issue Jul 12, 2019 · 35 comments

Comments

@ogrisel
Copy link
Member

ogrisel commented Jul 12, 2019

I observed that our OpenMP based GBDT does not scale well beyond 4 or 8 threads.

@SmirnovEgorRu who already contributed scalability improvements to xgboost sent me the following analysis by email, reproduced here with his permission:

I had initial looking at histogram building functions and observed a few things:

Scikit-learn’s code spends too much time in np.zeroes() functions. It is used for creation and filling histograms by zeroes. Theoretically – it is a cheap operation, but this is sequential and uses not native code. I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.

For low decision-tree nodes - # of rows can be very small < 1000. In this case creation of parallel loop brings more overhead than value. Similar issue was covered here - hcho3/xgboost-fast-hist-perf-lab#15, to understand background of the issue – you can read readme in the repository. I hope, you will find smth interesting for you there.

Link above also says about impact of conversion float <-> double into performance. I have looked at Scikit-learn’s assembler of histogram building and observed conversions too. I haven’t tested, but removing of this conversion can help.

If exclude np.zeroes() – building of histograms scales okay if we use only one processor. If we say about multi-socket systems (NUMA) – it scales very poor.

Here I can recommend you:

  • Split bin-matrix to <# of numa –nodes> parts by rows and allocate memory on each numa-node separately.

  • Pin threads to numa-nodes. Each thread should use bin-matrix only from pinned numa-node.

/cc @NicolasHug @jeremiedbb @adrinjalali @glemaitre @pierreglaser

@ogrisel
Copy link
Member Author

ogrisel commented Jul 12, 2019

I think the numa-aware optimisation is low priority compared to the np.zeros and the low decision-tree nodes issues as the numa-related issue is likely to impact a significantly lower number of users.

@NicolasHug
Copy link
Member

Thanks a lot for the report, I'll look into it.

A very easy partial fix is to replace zeros by empty in compute_histograms_subtraction. That wouldn't work for compute_histograms_brute though, we still need to initialize the arrays to zero.

@ogrisel
Copy link
Member Author

ogrisel commented Jul 15, 2019

we still need to initialize the arrays to zero.

The arrays can be initialized to zero (e.g. using memset to avoid using GIL) in parallel in the prange threads using a with nogil, parallel(): context manager: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html#cython.parallel.parallel

@SmirnovEgorRu
Copy link

@ogrisel @NicolasHug I have prepared replacing of np.zeroes in this PR #14380

@NicolasHug
Copy link
Member

The arrays can be initialized to zero (e.g. using memset to avoid using GIL) in parallel

Yes of course. Though unless with have tons of features, since max_bins is caped to 255 I don't think we need to do that in parallel. I think LightGBM has a heuristic rule for that.


On top of initialization, we can start looking at memory allocation. LightGBM uses a LRU cache for the histograms and we could too. A simple one would be to re-use the histograms of any parent whose children have been splitted (splitted = split_info is computed).

@NicolasHug
Copy link
Member

Scikit-learn’s code spends too much time in np.zeroes() functions

@SmirnovEgorRu may I ask how you came to this conclusion? What tools did you use?

I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.

Did you also make sure to initialize the arrays to zero when compute_histogram_brute is called? Else the fast training time might just be due to histogram values being super random and split-points being essentially wrong.

@thomasjpfan
Copy link
Member

thomasjpfan commented Aug 15, 2019

For reference here are the times (in seconds) I observed for scaling on a 12 physical core cpu:

OMP_NUM_THREADS=$i python benchmarks/bench_hist_gradient_boosting_higgsboson.py \
    --n-leaf-nodes 255 --n-trees 100
mean std
model threads
sklearn 1 196.556262 0.671263
2 114.000818 0.401527
3 80.522343 0.366852
4 67.179532 2.978047
6 56.606792 1.212810
8 50.620296 1.627436
10 46.861531 1.008298
12 45.162246 1.421321
lightgbm 1 219.747069 0.904041
2 110.890414 0.073561
3 80.927853 0.354567
4 64.754146 0.960912
6 53.909099 1.623050
8 49.128008 1.366232
10 42.063184 0.850042
12 41.903658 1.451349
xgboost 1 243.749912 1.228368
2 140.411577 8.560087
3 108.580659 3.343375
4 90.026446 0.600769
6 70.359140 1.018627
8 58.891632 0.557022
10 51.627672 0.443587
12 48.580013 0.293499

Versions:

lightgbm==2.2.1
xgboost==0.90
scikit-learn==0.21.2

Here is the raw data with other data in (tidy format)

@amueller
Copy link
Member

image

I hate tidy data

@amueller
Copy link
Member

amueller commented Aug 15, 2019

image

y is 200/mean, so it's a speedup over sklearn with a single core (approximately)

@SmirnovEgorRu
Copy link

@thomasjpfan current master of XGBoost contains additional optimizations 2-3x and more vs. 0.9 version depending on # of cores and algorithm parameters, PR with optimizations.
Could you, please, check your measurements with the latest version?

@NicolasHug
Copy link
Member

@SmirnovEgorRu I guess the main concern in this issue is the scalability of our implementation w.r.t the number of threads, not how fast it is compared to XGBoost.

@SmirnovEgorRu
Copy link

@NicolasHug I mean that it is possible to look at optimizations which have been done in XGBoost recently and move some of them into scikit.

@NicolasHug
Copy link
Member

oh ok thanks for the heads up. Is there a reasonably detailed summary of the optimizations that you implemented? Or is reading the code our only hope here ^^ ?

@SmirnovEgorRu
Copy link

@NicolasHug some information is available in PRs ( dmlc/xgboost#4529, dmlc/xgboost#4433, dmlc/xgboost#4278, dmlc/xgboost#4310, dmlc/xgboost#3957) and here https://github.com/SmirnovEgorRu/xgboost-fast-hist-perf-lab
Also, I'm planning to publish a blog about this, but later.

@NicolasHug
Copy link
Member

Thanks, definitely ping us when the blog post is ready please :)

@thomasjpfan
Copy link
Member

@SmirnovEgorRu On dmlc/xgboost@53d4272 here is the same benchmark:

mean std
model threads
xgboost 1 198.067337 1.851916
2 106.992374 2.691139
3 75.338827 1.414281
4 65.966215 0.719587
6 49.462472 0.591497
8 42.244526 0.211436
10 39.297977 0.313975
12 36.551886 0.111309

Data that is still tidy

@amueller
Copy link
Member

@thomasjpfan why do you hate me? (also I find plots easier to read than comparing lists that are far apart and have 6 decimals)

@Sandy4321
Copy link

what is it GBDT
and what is it Histogram-based GBDT??
thanks

@Sandy4321
Copy link

ok I see GBDT is GBM
now all clear

@ogrisel
Copy link
Member Author

ogrisel commented Aug 29, 2019

@Sandy4321 GBDT stands for Gradient Boosted Decision Trees which are basically GBM (Gradient Boosting Machines) with a Decision Tree as the weak learner).

@ogrisel
Copy link
Member Author

ogrisel commented Aug 29, 2019

@NicolasHug @thomasjpfan can you do a lscpu to check whether this is a single socket processor with 12 cores or 2 sockets with 6 cores (and therefore 2 NUMA nodes)?

Based on your scalability curves I would vote for single socket CPU but I would like to have the confirmation.

@Sandy4321
Copy link

GBDT stands for Gradient Boosted Decision Trees which are basically GBM (Gradient Boosting Machines) with a Decision Tree as the weak learner)

I see thanks
and may you share link to read about algorithm of Histogram-based GBDT
my guess it is Histogram-based split?

@thomasjpfan
Copy link
Member

@ogrisel Single socket

@thomasjpfan
Copy link
Member

@ogrisel Have you experimented with OMP_PROC_BIND?: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPROC_005fBIND.html

@NicolasHug
Copy link
Member

NicolasHug commented Nov 18, 2019

I ran some benchmarks on a machine with 2 Intel Xeon E5-2620 CPUs @ 2.10GHz (8 cores/16 threads each), so 2 NUMA nodes.

TLDR: The scikit-learn implementation scales just as well(bad) as lightgbm.

python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-leaf-nodes 255 --n-trees 200 --lightgbm

bench

,hist building,split finding,splitting,predicting,total sklearn,total lightgbm
4,81.71,1.844,15.025,6.989,167.46,154.8
8,48.849,1.872,9.978,5.08,105.962,96.558
12,39.798,1.801,8.478,4.307,88.005,76.316
16,33.573,1.823,8.413,4.336,77.661,64.42
20,42.688,2.458,9.994,4.415,90.873,71.924
24,37.815,2.431,9.299,3.83,83.259,68.586
28,32.452,2.716,9.8,3.707,76.601,65.821
32,32.919,3.458,10.48,3.542,78.742,68.538

A few observations:

  • LightGBM is faster
  • Both sklearn and lightgbm plateau after 16 threads. They both follow the same variations. The scalability of both libraries is comparable.
  • The variations are dictated by the histogram computation. This makes sense, since this is clearly the bottleneck.
  • Split finding is slower as the number of threads increases. Might be because there are not many features in the higgs boson dataset? It's marginal anyway.

I did observe the CPUs only being used at around 90% for the sklearn parts. This could explain why lightgbm is faster.

Would appreciate your thoughts @SmirnovEgorRu @ogrisel @thomasjpfan :)

@SmirnovEgorRu
Copy link

@NicolasHug, the blog what I mentioned above is here now: https://software.intel.com/en-us/download/parallel-universe-magazine-issue-38-august-2019
I hope you will find smth interesting :)

@NicolasHug
Copy link
Member

Very interesting, thanks!

If I understand correctly (please correct me if I'm wrong!), it seems that the main difference between our implementations is the histogram computation.

In scikit-learn each feature histogram is built independently (one thread per feature). For each feature, we loop through all the samples at the current node.

In contrast, XGBoost does not compute each feature histogram independently. Instead it loops over the samples and compute "partial histograms" (which I assume are of size n_features * n_bins) that are then aggregated together in a typical map/reduce procedure. The multi-threading is sample-wise, not feature-wise like in sklearn.

For that reason it makes sense for XGBoost to use a C-aligned bin matrix, while we are using a F-aligned matrix.

CC @ogrisel @adrinjalali who might be interested in that sort of details

@ogrisel
Copy link
Member Author

ogrisel commented Nov 23, 2019

Thanks. Indeed this is very interesting. /cc @jeremiedbb @glemaitre who might also be interested in this.

@ogrisel
Copy link
Member Author

ogrisel commented Nov 23, 2019

@NicolasHug in your benchmarks, how many physical cores (without hyperthreads) do you have on the host machine? If it's 16, then it's normal to reach a plateau at 16 openmp threads. But indeed the problem is not as bad as I remember. Still it would be interesting to re-benchmark sklearn on Higgs vs the latest xgboost master on a machine with many cores.

Edit: indeed you have 16 physical cores. I had not read the beginning of your comment.

@NicolasHug
Copy link
Member

@ogrisel my knowledge of CPUs is quite limited. Could you explain why it's normal to plateau at 16?

@ogrisel
Copy link
Member Author

ogrisel commented Jan 3, 2020

Hyperthreaded cores are fake core and it's quite often the case that they do not provide any speed-up for compute intensive algorithms. So I would only run benchmarks with n_threads <= n_physical_cores instead of n_threads <= n_logical_cores. It does not hurt to go measure beyond the physical cores limit but I would not expect miracles from hyperthreads.

@NicolasHug
Copy link
Member

For ref, the overhead induced by using np.zeros to initialize histograms was dealt with in #18341. Benchmarks above are (likely) outdated.

Some scalability issues w.r.t number of threads remain though -- we're still not as fast as LightGBM.

@NicolasHug
Copy link
Member

Going back to #14306 (comment), from microsoft/LightGBM#2791 (comment): like XGBoost, LightGBM 3.0 has also implemented sample-wise parallelism to build histograms, instead of feature-wise parallelism. This seems to be preferable for small numbers of threads and features.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 15, 2020

Based on #18382, the scalability profile of scikit-learn is not too bad on large problems. On small problems it's bad when n_threads > n_physical cores. As hyperthreads are never useful, it would be good to improve joblib / loky to detect the number of physical cores and use that as the default number of openmp threads in HGBRT. Work on this is being tracked here: joblib/loky#223. Besides this over-subscription problem when n_threads > n_physical_cores, the scalability profile of scikit-learn is not too bad.

As @NicolasHug analyzed in #18382 (comment), we might also want to review the openmp pragmas used by LightGBM to see if there are things with missed to improve speed in critical parallem sections on smaller problems.

Finally LightGBM is indeed faster even in sequential model. This might be related to LightGBM 3.0 perf improvements (#14306 (comment)) but this is unrelated to multicore scalability.

@ogrisel
Copy link
Member Author

ogrisel commented Apr 6, 2023

I just merged #26082 to use only as many threads as physical cores by default. This should improve things a bit (a least avoid some catastrophic slow-downs when fitting many HGBDT models on small dataset on machines with many cores and Hyper Threading / SMT enabled) but this is not a complete fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants