Multicore scalability of the Histogram-based GBDT #14306

ogrisel · 2019-07-12T08:50:16Z

I observed that our OpenMP based GBDT does not scale well beyond 4 or 8 threads.

@SmirnovEgorRu who already contributed scalability improvements to xgboost sent me the following analysis by email, reproduced here with his permission:

I had initial looking at histogram building functions and observed a few things:

Scikit-learn’s code spends too much time in np.zeroes() functions. It is used for creation and filling histograms by zeroes. Theoretically – it is a cheap operation, but this is sequential and uses not native code. I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.

For low decision-tree nodes - # of rows can be very small < 1000. In this case creation of parallel loop brings more overhead than value. Similar issue was covered here - hcho3/xgboost-fast-hist-perf-lab#15, to understand background of the issue – you can read readme in the repository. I hope, you will find smth interesting for you there.

Link above also says about impact of conversion float <-> double into performance. I have looked at Scikit-learn’s assembler of histogram building and observed conversions too. I haven’t tested, but removing of this conversion can help.

If exclude np.zeroes() – building of histograms scales okay if we use only one processor. If we say about multi-socket systems (NUMA) – it scales very poor.

Here I can recommend you:

Split bin-matrix to <# of numa –nodes> parts by rows and allocate memory on each numa-node separately.
Pin threads to numa-nodes. Each thread should use bin-matrix only from pinned numa-node.

/cc @NicolasHug @jeremiedbb @adrinjalali @glemaitre @pierreglaser

ogrisel · 2019-07-12T19:09:34Z

I think the numa-aware optimisation is low priority compared to the np.zeros and the low decision-tree nodes issues as the numa-related issue is likely to impact a significantly lower number of users.

NicolasHug · 2019-07-12T20:16:43Z

Thanks a lot for the report, I'll look into it.

A very easy partial fix is to replace zeros by empty in compute_histograms_subtraction. That wouldn't work for compute_histograms_brute though, we still need to initialize the arrays to zero.

ogrisel · 2019-07-15T12:10:48Z

we still need to initialize the arrays to zero.

The arrays can be initialized to zero (e.g. using memset to avoid using GIL) in parallel in the prange threads using a with nogil, parallel(): context manager: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html#cython.parallel.parallel

SmirnovEgorRu · 2019-07-15T19:35:41Z

@ogrisel @NicolasHug I have prepared replacing of np.zeroes in this PR #14380

NicolasHug · 2019-07-16T13:35:15Z

The arrays can be initialized to zero (e.g. using memset to avoid using GIL) in parallel

Yes of course. Though unless with have tons of features, since max_bins is caped to 255 I don't think we need to do that in parallel. I think LightGBM has a heuristic rule for that.

On top of initialization, we can start looking at memory allocation. LightGBM uses a LRU cache for the histograms and we could too. A simple one would be to re-use the histograms of any parent whose children have been splitted (splitted = split_info is computed).

NicolasHug · 2019-07-23T15:46:16Z

Scikit-learn’s code spends too much time in np.zeroes() functions

@SmirnovEgorRu may I ask how you came to this conclusion? What tools did you use?

I replaced usage of np.zeroes() by np.empty() and scaling was improved, for max number of threads performance improvement – around 30%.

Did you also make sure to initialize the arrays to zero when compute_histogram_brute is called? Else the fast training time might just be due to histogram values being super random and split-points being essentially wrong.

thomasjpfan · 2019-08-15T16:35:49Z

For reference here are the times (in seconds) I observed for scaling on a 12 physical core cpu:

OMP_NUM_THREADS=$i python benchmarks/bench_hist_gradient_boosting_higgsboson.py \
    --n-leaf-nodes 255 --n-trees 100

		mean	std
model	threads
sklearn	1	196.556262	0.671263
	2	114.000818	0.401527
	3	80.522343	0.366852
	4	67.179532	2.978047
	6	56.606792	1.212810
	8	50.620296	1.627436
	10	46.861531	1.008298
	12	45.162246	1.421321
lightgbm	1	219.747069	0.904041
	2	110.890414	0.073561
	3	80.927853	0.354567
	4	64.754146	0.960912
	6	53.909099	1.623050
	8	49.128008	1.366232
	10	42.063184	0.850042
	12	41.903658	1.451349
xgboost	1	243.749912	1.228368
	2	140.411577	8.560087
	3	108.580659	3.343375
	4	90.026446	0.600769
	6	70.359140	1.018627
	8	58.891632	0.557022
	10	51.627672	0.443587
	12	48.580013	0.293499

Versions:

lightgbm==2.2.1
xgboost==0.90
scikit-learn==0.21.2

Here is the raw data with other data in (tidy format)

amueller · 2019-08-15T16:44:31Z

I hate tidy data

amueller · 2019-08-15T16:47:10Z

y is 200/mean, so it's a speedup over sklearn with a single core (approximately)

SmirnovEgorRu · 2019-08-15T16:48:30Z

@thomasjpfan current master of XGBoost contains additional optimizations 2-3x and more vs. 0.9 version depending on # of cores and algorithm parameters, PR with optimizations.
Could you, please, check your measurements with the latest version?

NicolasHug · 2019-08-15T16:54:21Z

@SmirnovEgorRu I guess the main concern in this issue is the scalability of our implementation w.r.t the number of threads, not how fast it is compared to XGBoost.

SmirnovEgorRu · 2019-08-15T17:02:06Z

@NicolasHug I mean that it is possible to look at optimizations which have been done in XGBoost recently and move some of them into scikit.

NicolasHug · 2019-08-15T17:15:17Z

oh ok thanks for the heads up. Is there a reasonably detailed summary of the optimizations that you implemented? Or is reading the code our only hope here ^^ ?

SmirnovEgorRu · 2019-08-15T17:31:01Z

@NicolasHug some information is available in PRs ( dmlc/xgboost#4529, dmlc/xgboost#4433, dmlc/xgboost#4278, dmlc/xgboost#4310, dmlc/xgboost#3957) and here https://github.com/SmirnovEgorRu/xgboost-fast-hist-perf-lab
Also, I'm planning to publish a blog about this, but later.

NicolasHug · 2019-08-15T17:48:23Z

Thanks, definitely ping us when the blog post is ready please :)

thomasjpfan · 2019-08-15T19:14:52Z

@SmirnovEgorRu On dmlc/xgboost@53d4272 here is the same benchmark:

		mean	std
model	threads
xgboost	1	198.067337	1.851916
	2	106.992374	2.691139
	3	75.338827	1.414281
	4	65.966215	0.719587
	6	49.462472	0.591497
	8	42.244526	0.211436
	10	39.297977	0.313975
	12	36.551886	0.111309

Data that is still tidy

amueller · 2019-08-16T15:30:25Z

@thomasjpfan why do you hate me? (also I find plots easier to read than comparing lists that are far apart and have 6 decimals)

Sandy4321 · 2019-08-22T14:59:18Z

what is it GBDT
and what is it Histogram-based GBDT??
thanks

Sandy4321 · 2019-08-22T15:06:13Z

ok I see GBDT is GBM
now all clear

ogrisel · 2019-08-29T12:34:35Z

@Sandy4321 GBDT stands for Gradient Boosted Decision Trees which are basically GBM (Gradient Boosting Machines) with a Decision Tree as the weak learner).

ogrisel · 2019-08-29T12:39:23Z

@NicolasHug @thomasjpfan can you do a lscpu to check whether this is a single socket processor with 12 cores or 2 sockets with 6 cores (and therefore 2 NUMA nodes)?

Based on your scalability curves I would vote for single socket CPU but I would like to have the confirmation.

Sandy4321 · 2019-08-29T14:45:36Z

GBDT stands for Gradient Boosted Decision Trees which are basically GBM (Gradient Boosting Machines) with a Decision Tree as the weak learner)

I see thanks
and may you share link to read about algorithm of Histogram-based GBDT
my guess it is Histogram-based split?

thomasjpfan · 2019-08-29T14:59:38Z

@ogrisel Single socket

thomasjpfan · 2019-09-17T17:32:12Z

@ogrisel Have you experimented with OMP_PROC_BIND?: https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fPROC_005fBIND.html

NicolasHug · 2019-11-18T00:07:52Z

I ran some benchmarks on a machine with 2 Intel Xeon E5-2620 CPUs @ 2.10GHz (8 cores/16 threads each), so 2 NUMA nodes.

TLDR: The scikit-learn implementation scales just as well(bad) as lightgbm.

python benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-leaf-nodes 255 --n-trees 200 --lightgbm

,hist building,split finding,splitting,predicting,total sklearn,total lightgbm
4,81.71,1.844,15.025,6.989,167.46,154.8
8,48.849,1.872,9.978,5.08,105.962,96.558
12,39.798,1.801,8.478,4.307,88.005,76.316
16,33.573,1.823,8.413,4.336,77.661,64.42
20,42.688,2.458,9.994,4.415,90.873,71.924
24,37.815,2.431,9.299,3.83,83.259,68.586
28,32.452,2.716,9.8,3.707,76.601,65.821
32,32.919,3.458,10.48,3.542,78.742,68.538

A few observations:

LightGBM is faster
Both sklearn and lightgbm plateau after 16 threads. They both follow the same variations. The scalability of both libraries is comparable.
The variations are dictated by the histogram computation. This makes sense, since this is clearly the bottleneck.
Split finding is slower as the number of threads increases. Might be because there are not many features in the higgs boson dataset? It's marginal anyway.

I did observe the CPUs only being used at around 90% for the sklearn parts. This could explain why lightgbm is faster.

Would appreciate your thoughts @SmirnovEgorRu @ogrisel @thomasjpfan :)

SmirnovEgorRu · 2019-11-18T21:41:14Z

@NicolasHug, the blog what I mentioned above is here now: https://software.intel.com/en-us/download/parallel-universe-magazine-issue-38-august-2019
I hope you will find smth interesting :)

NicolasHug · 2019-11-18T23:20:09Z

Very interesting, thanks!

If I understand correctly (please correct me if I'm wrong!), it seems that the main difference between our implementations is the histogram computation.

In scikit-learn each feature histogram is built independently (one thread per feature). For each feature, we loop through all the samples at the current node.

In contrast, XGBoost does not compute each feature histogram independently. Instead it loops over the samples and compute "partial histograms" (which I assume are of size n_features * n_bins) that are then aggregated together in a typical map/reduce procedure. The multi-threading is sample-wise, not feature-wise like in sklearn.

For that reason it makes sense for XGBoost to use a C-aligned bin matrix, while we are using a F-aligned matrix.

CC @ogrisel @adrinjalali who might be interested in that sort of details

ogrisel · 2019-11-23T14:40:11Z

Thanks. Indeed this is very interesting. /cc @jeremiedbb @glemaitre who might also be interested in this.

ogrisel · 2019-11-23T14:45:15Z

@NicolasHug in your benchmarks, how many physical cores (without hyperthreads) do you have on the host machine? If it's 16, then it's normal to reach a plateau at 16 openmp threads. But indeed the problem is not as bad as I remember. Still it would be interesting to re-benchmark sklearn on Higgs vs the latest xgboost master on a machine with many cores.

Edit: indeed you have 16 physical cores. I had not read the beginning of your comment.

NicolasHug · 2019-11-23T18:48:52Z

@ogrisel my knowledge of CPUs is quite limited. Could you explain why it's normal to plateau at 16?

ogrisel · 2020-01-03T17:29:24Z

Hyperthreaded cores are fake core and it's quite often the case that they do not provide any speed-up for compute intensive algorithms. So I would only run benchmarks with n_threads <= n_physical_cores instead of n_threads <= n_logical_cores. It does not hurt to go measure beyond the physical cores limit but I would not expect miracles from hyperthreads.

NicolasHug · 2020-09-04T14:03:35Z

For ref, the overhead induced by using np.zeros to initialize histograms was dealt with in #18341. Benchmarks above are (likely) outdated.

Some scalability issues w.r.t number of threads remain though -- we're still not as fast as LightGBM.

NicolasHug · 2020-09-07T14:58:26Z

Going back to #14306 (comment), from microsoft/LightGBM#2791 (comment): like XGBoost, LightGBM 3.0 has also implemented sample-wise parallelism to build histograms, instead of feature-wise parallelism. This seems to be preferable for small numbers of threads and features.

ogrisel · 2020-09-15T15:02:10Z

Based on #18382, the scalability profile of scikit-learn is not too bad on large problems. On small problems it's bad when n_threads > n_physical cores. As hyperthreads are never useful, it would be good to improve joblib / loky to detect the number of physical cores and use that as the default number of openmp threads in HGBRT. Work on this is being tracked here: joblib/loky#223. Besides this over-subscription problem when n_threads > n_physical_cores, the scalability profile of scikit-learn is not too bad.

As @NicolasHug analyzed in #18382 (comment), we might also want to review the openmp pragmas used by LightGBM to see if there are things with missed to improve speed in critical parallem sections on smaller problems.

Finally LightGBM is indeed faster even in sequential model. This might be related to LightGBM 3.0 perf improvements (#14306 (comment)) but this is unrelated to multicore scalability.

ogrisel · 2023-04-06T08:32:11Z

I just merged #26082 to use only as many threads as physical cores by default. This should improve things a bit (a least avoid some catastrophic slow-downs when fitting many HGBDT models on small dataset on machines with many cores and Hyper Threading / SMT enabled) but this is not a complete fix.

ogrisel added the Performance label Jul 12, 2019

SmirnovEgorRu mentioned this issue Jul 15, 2019

Replacing np.zeroes by np.empty for Gradient Boosting for better scaling by cores #14380

Closed

NicolasHug mentioned this issue Jul 17, 2019

[MRG] GBDT: Reuse allocated memory of other histograms #14392

Closed

ogrisel mentioned this issue Jan 3, 2020

Make HistGradientBoostingRegressor/Classifer use _openmp_effective_n_threads to set the default maximum number of threads to use #16016

Closed

NicolasHug mentioned this issue Sep 4, 2020

[MRG] MNT Initialize histograms in parallel and don't call np.zero in Hist-GBDT #18341

Merged

ogrisel mentioned this issue Sep 11, 2020

BENCH threading scalability of Hist Gradient Boosting #18382

Merged

cmarmo added the module:ensemble label Mar 10, 2022

ogrisel mentioned this issue Apr 4, 2023

PERF set openmp to use only physical cores by default #26082

Merged

lorentzenchr mentioned this issue Aug 25, 2023

ENH HGBT histograms in blocks of features #27168

Open

ogrisel mentioned this issue Jan 17, 2025

HistGradientBoostingClassifier/Regressor 15x slowdown on small data problems compared to disabled OpenMP threading #30662

Open

ogrisel mentioned this issue Feb 28, 2025

HistGradientBoostingClassifier slow in prediction mode #16429

Open

Multicore scalability of the Histogram-based GBDT #14306

Multicore scalability of the Histogram-based GBDT #14306

Comments

ogrisel commented Jul 12, 2019 • edited by TomDLT Loading

Here I can recommend you:

ogrisel commented Jul 12, 2019

NicolasHug commented Jul 12, 2019

ogrisel commented Jul 15, 2019 • edited Loading

SmirnovEgorRu commented Jul 15, 2019

NicolasHug commented Jul 16, 2019

NicolasHug commented Jul 23, 2019

thomasjpfan commented Aug 15, 2019 • edited Loading

Versions:

amueller commented Aug 15, 2019

amueller commented Aug 15, 2019 • edited Loading

SmirnovEgorRu commented Aug 15, 2019

NicolasHug commented Aug 15, 2019

SmirnovEgorRu commented Aug 15, 2019

NicolasHug commented Aug 15, 2019

SmirnovEgorRu commented Aug 15, 2019

NicolasHug commented Aug 15, 2019

thomasjpfan commented Aug 15, 2019

amueller commented Aug 16, 2019

Sandy4321 commented Aug 22, 2019

Sandy4321 commented Aug 22, 2019

ogrisel commented Aug 29, 2019

ogrisel commented Aug 29, 2019

Sandy4321 commented Aug 29, 2019

thomasjpfan commented Aug 29, 2019

thomasjpfan commented Sep 17, 2019

NicolasHug commented Nov 18, 2019 • edited Loading

SmirnovEgorRu commented Nov 18, 2019

NicolasHug commented Nov 18, 2019

ogrisel commented Nov 23, 2019 • edited Loading

ogrisel commented Nov 23, 2019 • edited Loading

NicolasHug commented Nov 23, 2019

ogrisel commented Jan 3, 2020

NicolasHug commented Sep 4, 2020

NicolasHug commented Sep 7, 2020

ogrisel commented Sep 15, 2020 • edited Loading

ogrisel commented Apr 6, 2023 • edited Loading

ogrisel commented Jul 12, 2019 •

edited by TomDLT

Loading

ogrisel commented Jul 15, 2019 •

edited

Loading

thomasjpfan commented Aug 15, 2019 •

edited

Loading

amueller commented Aug 15, 2019 •

edited

Loading

NicolasHug commented Nov 18, 2019 •

edited

Loading

ogrisel commented Nov 23, 2019 •

edited

Loading

ogrisel commented Nov 23, 2019 •

edited

Loading

ogrisel commented Sep 15, 2020 •

edited

Loading

ogrisel commented Apr 6, 2023 •

edited

Loading