BENCH threading scalability of Hist Gradient Boosting #18382

ogrisel · 2020-09-11T18:01:49Z

I adapted the benchmark script to make a version that help us understand the performance profile w.r.t. the number of threads to tackle #14306.

I will post some results below in comments.

ping @NicolasHug @amueller @szilard @SmirnovEgorRu @jeremiedbb you might be interested in some of the results below.

ogrisel · 2020-09-11T18:06:26Z

First the effective hyper-paramters I used for the 4 libraries:

scikit-learn

{'early_stopping': False,
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'binary_crossentropy',
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 10,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': 0,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}

lightgbm

{'boost_from_average': True,
 'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'enable_bundle': False,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_depth': None,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_data_in_bin': 1,
 'min_split_gain': 0,
 'min_sum_hessian_in_leaf': 0.001,
 'n_estimators': 10,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'verbosity': -10}

xgboost

{'base_score': None,
 'booster': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': 'lossguide',
 'importance_type': 'gain',
 'interaction_constraints': None,
 'lambda': 0.0,
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_delta_step': None,
 'max_depth': 0,
 'max_leaves': 31,
 'min_child_weight': 0.001,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 10,
 'n_jobs': -1,
 'num_parallel_tree': None,
 'objective': 'reg:logistic',
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'scale_pos_weight': None,
 'silent': True,
 'subsample': None,
 'tree_method': 'hist',
 'validate_parameters': None,
 'verbosity': 0}

catboost

{'feature_border_type': 'Median',
 'iterations': 10,
 'leaf_estimation_method': 'Newton',
 'learning_rate': 0.1,
 'loss_function': 'Logloss',
 'max_bin': 255,
 'reg_lambda': 0.0,
 'verbose': False}

ogrisel · 2020-09-11T18:14:45Z

Run A:

Tall and narrow data (not enough features for big parallelism).

n_samples: 100k
n_features: 10
host: 36 physical cores, 72 hyper threads

All libraries are quite fast, lightgbm the fastest (as always) but scikit-learn suffers most from over-subscription especially when n_threads > n_physical_cores.

ogrisel · 2020-09-11T18:18:48Z

Run B:

Short(er) and wide(r) data (enough features but small dataset).

n_samples: 10k
n_features: 100
host: 36 physical cores, 72 hyper threads

Similar conclusion. Over-subscriptions is kind of catastrophic for scikit-learn on such a small problem.

Note that lightgbm and scikit-learn are significantly faster to predict than others.

ogrisel · 2020-09-11T18:20:48Z

Here is the lscpu of the benchmark machine I used:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:            4
CPU MHz:             1000.048
BogoMIPS:            4600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

ogrisel · 2020-09-11T18:29:36Z

Run C:

Big and wide enough problem to benefit from many cores:

n_samples: 1M
n_features: 100
host: 36 physical cores, 72 hyper threads

scikit-learn and lightgbm scale similarly but lightgbm is uniformly faster.
scikit-learn still suffers from oversubscription when n_threads > n_physical_cores but much less than before.
xgboost trashed more this time with n_threads > n_physical_cores

Note that lightgbm and scikit-learn are significantly faster to predict than others.

ogrisel · 2020-09-11T18:39:17Z

Some conclusions for scikit-learn:

scalability is good enough, except when n_threads > n_physical_cores: we need to make loky able to detect if SMT / HyperThreading is enabled or not and use the number of physical cores as default number of threads, see Possibility to ignore hyperthreading in loky.cpu_count joblib/loky#223.
we might also want to cap the default number of threads to the number of features in the data
we might want to have a look at the chunked row-wise histogram computation of LightGBM 3 that could explain the better speed of lightgbm when the number of features is small.

ogrisel · 2020-09-11T18:40:33Z

BTW, this script should return dicts for each run and wrap the results in a pandas df and make it possible to store it to the disk for later analysis but I was to lazy to refactor :P

NicolasHug · 2020-09-13T15:26:05Z

Thanks @ogrisel for the benchmarks

we might also want to cap the default number of threads to the number of features in the data

I just want to note that not everything is parallelized over features: only bin_mapper.transform(), the histograms, and the split finding.

In contrast, the predictors are parallelized over samples, the split_indices() partition procedure dispatches omp_get_max_threads() threads over an array of size n_samples_at_node, and _update_raw_predictions() is parallelized over tree leaves. So capping the number of threads to the number of features might not always be a good thing for these procedures.

I think a good first step would be to look at each prange and compare against the corresponding LightGBM OMP pragma directives. I remember seeing some stuff being done (e.g. not parallelize if n_samples < 1024, stuff like that). For example I feel like split_indices() might be sensitive to over-subscription if n_samples_at_node is small.

NicolasHug

LGTM

NicolasHug · 2020-09-13T18:52:40Z

benchmarks/bench_hist_gradient_boosting_threading.py

+
+def get_estimator_and_data():
+    if args.problem == 'classification':
+        X, y = make_classification(args.n_samples * 2,


Should this just be args.n_samples?

(I don't even remember why I did that on the older benchmark)

train / test split?

ogrisel · 2020-09-15T14:55:39Z

Let's merge, it's just a benchmark script for maintainers that cannot break the API and we can always improve / refactor later if needed.

SmirnovEgorRu · 2020-09-18T04:55:48Z

A few comments:

'n_estimators': 10

Usually I see a few hundreds or even thousands of estimators for gradient boosting. Is 10 typical?
At least XGBoost has a large constant for numpy -> DMatrix conversion (I suppose for 10 iterations it will dominate, not building trees itself). In version XGB <= 1.2 DMatrix creation is purely single thread. it's partially fixed in master now dmlc/xgboost#5877, but the issue is still presented.

'grow_policy': 'lossguide'

Using lossguide for XGBoost is okay, but it would be nice to test deptwhise also with appropriate 'max_depth', it has larger opportunities for threading.

What version of XGBoost did you use?

SmirnovEgorRu · 2020-09-18T05:19:27Z

Here is a good additional analysis of scaling XGB and LGBM: dmlc/xgboost#3810 (comment)

ogrisel · 2020-09-18T17:11:07Z

Thanks for the feedback @SmirnovEgorRu, I launched the benchmarks again with 100 trees on a different machine (because the previous one is used by colleagues at the moment).

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping:              4
CPU MHz:               2200.257
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4399.87
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

scikit-learn
{'early_stopping': False,
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'binary_crossentropy',
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 100,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': 0,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}
lightgbm
{'boost_from_average': True,
 'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'enable_bundle': False,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_depth': None,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_data_in_bin': 1,
 'min_split_gain': 0,
 'min_sum_hessian_in_leaf': 0.001,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'verbosity': -10}
xgboost
{'base_score': None,
 'booster': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': 'lossguide',
 'importance_type': 'gain',
 'interaction_constraints': None,
 'lambda': 0.0,
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_delta_step': None,
 'max_depth': 0,
 'max_leaves': 31,
 'min_child_weight': 0.001,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_parallel_tree': None,
 'objective': 'reg:logistic',
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'scale_pos_weight': None,
 'silent': True,
 'subsample': None,
 'tree_method': 'hist',
 'validate_parameters': None,
 'verbosity': 0}
catboost
{'feature_border_type': 'Median',
 'iterations': 100,
 'leaf_estimation_method': 'Newton',
 'learning_rate': 0.1,
 'loss_function': 'Logloss',
 'max_bin': 255,
 'reg_lambda': 0.0,
 'verbose': False}

"big problem": 1M samples 100 features:

"small wide" 10K 100 features:

"small narrow":

ogrisel · 2020-09-18T17:15:18Z

Conclusions:

xgboost is indeed more competitive with more trees.
scikit-learn (and others) can suffer from physical cores over-subscription (memory bandwith limited workload) although it is not as dramatic as with previous machine (maybe different cache sizes and number of threads).
lightgbm is still the best :)

BENCH threading scalabikity of HGBRT

6d63475

ogrisel changed the title ~~BENCH threading scalabiity of Hist Gradient Boosting~~ BENCH threading scalability of Hist Gradient Boosting Sep 11, 2020

ogrisel added Performance module:ensemble labels Sep 12, 2020

NicolasHug approved these changes Sep 13, 2020

View reviewed changes

ogrisel merged commit 68fb4db into scikit-learn:master Sep 15, 2020

ogrisel deleted the bench-hgbrt-threading branch September 15, 2020 14:55

ogrisel mentioned this pull request Sep 15, 2020

Multicore scalability of the Histogram-based GBDT #14306

Open

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020

BENCH threading scalabikity of HGBRT (scikit-learn#18382)

ca87554

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCH threading scalability of Hist Gradient Boosting #18382

BENCH threading scalability of Hist Gradient Boosting #18382

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020

ogrisel commented Sep 11, 2020

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020

NicolasHug commented Sep 13, 2020

NicolasHug left a comment

NicolasHug Sep 13, 2020

ogrisel Sep 15, 2020

ogrisel commented Sep 15, 2020

SmirnovEgorRu commented Sep 18, 2020

SmirnovEgorRu commented Sep 18, 2020

ogrisel commented Sep 18, 2020

ogrisel commented Sep 18, 2020

BENCH threading scalability of Hist Gradient Boosting #18382

BENCH threading scalability of Hist Gradient Boosting #18382

Conversation

ogrisel commented Sep 11, 2020 • edited Loading

ogrisel commented Sep 11, 2020

ogrisel commented Sep 11, 2020

Run A:

ogrisel commented Sep 11, 2020 • edited Loading

Run B:

ogrisel commented Sep 11, 2020 • edited Loading

ogrisel commented Sep 11, 2020

Run C:

ogrisel commented Sep 11, 2020 • edited Loading

ogrisel commented Sep 11, 2020

NicolasHug commented Sep 13, 2020

NicolasHug left a comment

Choose a reason for hiding this comment

NicolasHug Sep 13, 2020

Choose a reason for hiding this comment

ogrisel Sep 15, 2020

Choose a reason for hiding this comment

ogrisel commented Sep 15, 2020

SmirnovEgorRu commented Sep 18, 2020

SmirnovEgorRu commented Sep 18, 2020

ogrisel commented Sep 18, 2020

ogrisel commented Sep 18, 2020

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020 •

edited

Loading

ogrisel commented Sep 11, 2020 •

edited

Loading