Skip to content

BENCH threading scalability of Hist Gradient Boosting #18382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 15, 2020

Conversation

ogrisel
Copy link
Member

@ogrisel ogrisel commented Sep 11, 2020

I adapted the benchmark script to make a version that help us understand the performance profile w.r.t. the number of threads to tackle #14306.

I will post some results below in comments.

ping @NicolasHug @amueller @szilard @SmirnovEgorRu @jeremiedbb you might be interested in some of the results below.

@ogrisel ogrisel changed the title BENCH threading scalabiity of Hist Gradient Boosting BENCH threading scalability of Hist Gradient Boosting Sep 11, 2020
@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

First the effective hyper-paramters I used for the 4 libraries:

  • scikit-learn
{'early_stopping': False,
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'binary_crossentropy',
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 10,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': 0,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}
  • lightgbm
{'boost_from_average': True,
 'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'enable_bundle': False,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_depth': None,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_data_in_bin': 1,
 'min_split_gain': 0,
 'min_sum_hessian_in_leaf': 0.001,
 'n_estimators': 10,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'verbosity': -10}
  • xgboost
{'base_score': None,
 'booster': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': 'lossguide',
 'importance_type': 'gain',
 'interaction_constraints': None,
 'lambda': 0.0,
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_delta_step': None,
 'max_depth': 0,
 'max_leaves': 31,
 'min_child_weight': 0.001,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 10,
 'n_jobs': -1,
 'num_parallel_tree': None,
 'objective': 'reg:logistic',
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'scale_pos_weight': None,
 'silent': True,
 'subsample': None,
 'tree_method': 'hist',
 'validate_parameters': None,
 'verbosity': 0}
  • catboost
{'feature_border_type': 'Median',
 'iterations': 10,
 'leaf_estimation_method': 'Newton',
 'learning_rate': 0.1,
 'loss_function': 'Logloss',
 'max_bin': 255,
 'reg_lambda': 0.0,
 'verbose': False}

@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

Run A:

Tall and narrow data (not enough features for big parallelism).

  • n_samples: 100k
  • n_features: 10
  • host: 36 physical cores, 72 hyper threads

hgbrt_threading_n_samples_100_000_n_features_10

All libraries are quite fast, lightgbm the fastest (as always) but scikit-learn suffers most from over-subscription especially when n_threads > n_physical_cores.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

Run B:

Short(er) and wide(r) data (enough features but small dataset).

  • n_samples: 10k
  • n_features: 100
  • host: 36 physical cores, 72 hyper threads

hgbrt_threading_n_samples_10_000_n_features_100

Similar conclusion. Over-subscriptions is kind of catastrophic for scikit-learn on such a small problem.

Note that lightgbm and scikit-learn are significantly faster to predict than others.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

Here is the lscpu of the benchmark machine I used:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              72
On-line CPU(s) list: 0-71
Thread(s) per core:  2
Core(s) per socket:  18
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Stepping:            4
CPU MHz:             1000.048
BogoMIPS:            4600.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            25344K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke md_clear flush_l1d

@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

Run C:

Big and wide enough problem to benefit from many cores:

  • n_samples: 1M
  • n_features: 100
  • host: 36 physical cores, 72 hyper threads

hgbrt_threading_n_samples_1_000_000_n_features_100

  • scikit-learn and lightgbm scale similarly but lightgbm is uniformly faster.
  • scikit-learn still suffers from oversubscription when n_threads > n_physical_cores but much less than before.
  • xgboost trashed more this time with n_threads > n_physical_cores

Note that lightgbm and scikit-learn are significantly faster to predict than others.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

Some conclusions for scikit-learn:

  • scalability is good enough, except when n_threads > n_physical_cores: we need to make loky able to detect if SMT / HyperThreading is enabled or not and use the number of physical cores as default number of threads, see Possibility to ignore hyperthreading in loky.cpu_count joblib/loky#223.
  • we might also want to cap the default number of threads to the number of features in the data
  • we might want to have a look at the chunked row-wise histogram computation of LightGBM 3 that could explain the better speed of lightgbm when the number of features is small.

@ogrisel
Copy link
Member Author

ogrisel commented Sep 11, 2020

BTW, this script should return dicts for each run and wrap the results in a pandas df and make it possible to store it to the disk for later analysis but I was to lazy to refactor :P

@NicolasHug
Copy link
Member

Thanks @ogrisel for the benchmarks

we might also want to cap the default number of threads to the number of features in the data

I just want to note that not everything is parallelized over features: only bin_mapper.transform(), the histograms, and the split finding.

In contrast, the predictors are parallelized over samples, the split_indices() partition procedure dispatches omp_get_max_threads() threads over an array of size n_samples_at_node, and _update_raw_predictions() is parallelized over tree leaves. So capping the number of threads to the number of features might not always be a good thing for these procedures.

I think a good first step would be to look at each prange and compare against the corresponding LightGBM OMP pragma directives. I remember seeing some stuff being done (e.g. not parallelize if n_samples < 1024, stuff like that). For example I feel like split_indices() might be sensitive to over-subscription if n_samples_at_node is small.

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


def get_estimator_and_data():
if args.problem == 'classification':
X, y = make_classification(args.n_samples * 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be args.n_samples?

(I don't even remember why I did that on the older benchmark)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train / test split?

@ogrisel
Copy link
Member Author

ogrisel commented Sep 15, 2020

Let's merge, it's just a benchmark script for maintainers that cannot break the API and we can always improve / refactor later if needed.

@ogrisel ogrisel merged commit 68fb4db into scikit-learn:master Sep 15, 2020
@ogrisel ogrisel deleted the bench-hgbrt-threading branch September 15, 2020 14:55
@SmirnovEgorRu
Copy link

A few comments:

'n_estimators': 10

Usually I see a few hundreds or even thousands of estimators for gradient boosting. Is 10 typical?
At least XGBoost has a large constant for numpy -> DMatrix conversion (I suppose for 10 iterations it will dominate, not building trees itself). In version XGB <= 1.2 DMatrix creation is purely single thread. it's partially fixed in master now dmlc/xgboost#5877, but the issue is still presented.

'grow_policy': 'lossguide'

Using lossguide for XGBoost is okay, but it would be nice to test deptwhise also with appropriate 'max_depth', it has larger opportunities for threading.

What version of XGBoost did you use?

@SmirnovEgorRu
Copy link

Here is a good additional analysis of scaling XGB and LGBM: dmlc/xgboost#3810 (comment)

@ogrisel
Copy link
Member Author

ogrisel commented Sep 18, 2020

Thanks for the feedback @SmirnovEgorRu, I launched the benchmarks again with 100 trees on a different machine (because the previous one is used by colleagues at the moment).

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-39
Thread(s) per core:    2
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz
Stepping:              4
CPU MHz:               2200.257
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4399.87
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9,20-29
NUMA node1 CPU(s):     10-19,30-39
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
scikit-learn
{'early_stopping': False,
 'l2_regularization': 0.0,
 'learning_rate': 0.1,
 'loss': 'binary_crossentropy',
 'max_bins': 255,
 'max_depth': None,
 'max_iter': 100,
 'max_leaf_nodes': 31,
 'min_samples_leaf': 20,
 'monotonic_cst': None,
 'n_iter_no_change': 10,
 'random_state': 0,
 'scoring': 'loss',
 'tol': 1e-07,
 'validation_fraction': 0.1,
 'verbose': 0,
 'warm_start': False}
lightgbm
{'boost_from_average': True,
 'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'enable_bundle': False,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_depth': None,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_data_in_bin': 1,
 'min_split_gain': 0,
 'min_sum_hessian_in_leaf': 0.001,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_leaves': 31,
 'objective': 'binary',
 'random_state': None,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'silent': True,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0,
 'verbosity': -10}
xgboost
{'base_score': None,
 'booster': None,
 'colsample_bylevel': None,
 'colsample_bynode': None,
 'colsample_bytree': None,
 'gamma': None,
 'gpu_id': None,
 'grow_policy': 'lossguide',
 'importance_type': 'gain',
 'interaction_constraints': None,
 'lambda': 0.0,
 'learning_rate': 0.1,
 'max_bin': 255,
 'max_delta_step': None,
 'max_depth': 0,
 'max_leaves': 31,
 'min_child_weight': 0.001,
 'missing': nan,
 'monotone_constraints': None,
 'n_estimators': 100,
 'n_jobs': -1,
 'num_parallel_tree': None,
 'objective': 'reg:logistic',
 'random_state': None,
 'reg_alpha': None,
 'reg_lambda': None,
 'scale_pos_weight': None,
 'silent': True,
 'subsample': None,
 'tree_method': 'hist',
 'validate_parameters': None,
 'verbosity': 0}
catboost
{'feature_border_type': 'Median',
 'iterations': 100,
 'leaf_estimation_method': 'Newton',
 'learning_rate': 0.1,
 'loss_function': 'Logloss',
 'max_bin': 255,
 'reg_lambda': 0.0,
 'verbose': False}
  • "big problem": 1M samples 100 features:

1M_100

  • "small wide" 10K 100 features:

10k_100

  • "small narrow":
    100k_10

@ogrisel
Copy link
Member Author

ogrisel commented Sep 18, 2020

Conclusions:

  • xgboost is indeed more competitive with more trees.
  • scikit-learn (and others) can suffer from physical cores over-subscription (memory bandwith limited workload) although it is not as dramatic as with previous machine (maybe different cache sizes and number of threads).
  • lightgbm is still the best :)

jayzed82 pushed a commit to jayzed82/scikit-learn that referenced this pull request Oct 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants