-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
BENCH threading scalability of Hist Gradient Boosting #18382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
First the effective hyper-paramters I used for the 4 libraries:
{'early_stopping': False,
'l2_regularization': 0.0,
'learning_rate': 0.1,
'loss': 'binary_crossentropy',
'max_bins': 255,
'max_depth': None,
'max_iter': 10,
'max_leaf_nodes': 31,
'min_samples_leaf': 20,
'monotonic_cst': None,
'n_iter_no_change': 10,
'random_state': 0,
'scoring': 'loss',
'tol': 1e-07,
'validation_fraction': 0.1,
'verbose': 0,
'warm_start': False}
{'boost_from_average': True,
'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'enable_bundle': False,
'importance_type': 'split',
'learning_rate': 0.1,
'max_bin': 255,
'max_depth': None,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_data_in_bin': 1,
'min_split_gain': 0,
'min_sum_hessian_in_leaf': 0.001,
'n_estimators': 10,
'n_jobs': -1,
'num_leaves': 31,
'objective': 'binary',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': True,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'verbosity': -10}
{'base_score': None,
'booster': None,
'colsample_bylevel': None,
'colsample_bynode': None,
'colsample_bytree': None,
'gamma': None,
'gpu_id': None,
'grow_policy': 'lossguide',
'importance_type': 'gain',
'interaction_constraints': None,
'lambda': 0.0,
'learning_rate': 0.1,
'max_bin': 255,
'max_delta_step': None,
'max_depth': 0,
'max_leaves': 31,
'min_child_weight': 0.001,
'missing': nan,
'monotone_constraints': None,
'n_estimators': 10,
'n_jobs': -1,
'num_parallel_tree': None,
'objective': 'reg:logistic',
'random_state': None,
'reg_alpha': None,
'reg_lambda': None,
'scale_pos_weight': None,
'silent': True,
'subsample': None,
'tree_method': 'hist',
'validate_parameters': None,
'verbosity': 0}
{'feature_border_type': 'Median',
'iterations': 10,
'leaf_estimation_method': 'Newton',
'learning_rate': 0.1,
'loss_function': 'Logloss',
'max_bin': 255,
'reg_lambda': 0.0,
'verbose': False} |
Run A:Tall and narrow data (not enough features for big parallelism).
All libraries are quite fast, lightgbm the fastest (as always) but scikit-learn suffers most from over-subscription especially when n_threads > n_physical_cores. |
Run B:Short(er) and wide(r) data (enough features but small dataset).
Similar conclusion. Over-subscriptions is kind of catastrophic for scikit-learn on such a small problem. Note that lightgbm and scikit-learn are significantly faster to predict than others. |
Here is the lscpu of the benchmark machine I used:
|
Run C:Big and wide enough problem to benefit from many cores:
Note that lightgbm and scikit-learn are significantly faster to predict than others. |
Some conclusions for scikit-learn:
|
BTW, this script should return dicts for each run and wrap the results in a pandas df and make it possible to store it to the disk for later analysis but I was to lazy to refactor :P |
Thanks @ogrisel for the benchmarks
I just want to note that not everything is parallelized over features: only In contrast, the predictors are parallelized over samples, the I think a good first step would be to look at each |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
||
def get_estimator_and_data(): | ||
if args.problem == 'classification': | ||
X, y = make_classification(args.n_samples * 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be args.n_samples
?
(I don't even remember why I did that on the older benchmark)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
train / test split?
Let's merge, it's just a benchmark script for maintainers that cannot break the API and we can always improve / refactor later if needed. |
A few comments:
Usually I see a few hundreds or even thousands of estimators for gradient boosting. Is 10 typical?
Using lossguide for XGBoost is okay, but it would be nice to test What version of XGBoost did you use? |
Here is a good additional analysis of scaling XGB and LGBM: dmlc/xgboost#3810 (comment) |
Thanks for the feedback @SmirnovEgorRu, I launched the benchmarks again with 100 trees on a different machine (because the previous one is used by colleagues at the moment).
scikit-learn
{'early_stopping': False,
'l2_regularization': 0.0,
'learning_rate': 0.1,
'loss': 'binary_crossentropy',
'max_bins': 255,
'max_depth': None,
'max_iter': 100,
'max_leaf_nodes': 31,
'min_samples_leaf': 20,
'monotonic_cst': None,
'n_iter_no_change': 10,
'random_state': 0,
'scoring': 'loss',
'tol': 1e-07,
'validation_fraction': 0.1,
'verbose': 0,
'warm_start': False}
lightgbm
{'boost_from_average': True,
'boosting_type': 'gbdt',
'class_weight': None,
'colsample_bytree': 1.0,
'enable_bundle': False,
'importance_type': 'split',
'learning_rate': 0.1,
'max_bin': 255,
'max_depth': None,
'min_child_samples': 20,
'min_child_weight': 0.001,
'min_data_in_bin': 1,
'min_split_gain': 0,
'min_sum_hessian_in_leaf': 0.001,
'n_estimators': 100,
'n_jobs': -1,
'num_leaves': 31,
'objective': 'binary',
'random_state': None,
'reg_alpha': 0.0,
'reg_lambda': 0.0,
'silent': True,
'subsample': 1.0,
'subsample_for_bin': 200000,
'subsample_freq': 0,
'verbosity': -10}
xgboost
{'base_score': None,
'booster': None,
'colsample_bylevel': None,
'colsample_bynode': None,
'colsample_bytree': None,
'gamma': None,
'gpu_id': None,
'grow_policy': 'lossguide',
'importance_type': 'gain',
'interaction_constraints': None,
'lambda': 0.0,
'learning_rate': 0.1,
'max_bin': 255,
'max_delta_step': None,
'max_depth': 0,
'max_leaves': 31,
'min_child_weight': 0.001,
'missing': nan,
'monotone_constraints': None,
'n_estimators': 100,
'n_jobs': -1,
'num_parallel_tree': None,
'objective': 'reg:logistic',
'random_state': None,
'reg_alpha': None,
'reg_lambda': None,
'scale_pos_weight': None,
'silent': True,
'subsample': None,
'tree_method': 'hist',
'validate_parameters': None,
'verbosity': 0}
catboost
{'feature_border_type': 'Median',
'iterations': 100,
'leaf_estimation_method': 'Newton',
'learning_rate': 0.1,
'loss_function': 'Logloss',
'max_bin': 255,
'reg_lambda': 0.0,
'verbose': False}
|
Conclusions:
|
I adapted the benchmark script to make a version that help us understand the performance profile w.r.t. the number of threads to tackle #14306.
I will post some results below in comments.
ping @NicolasHug @amueller @szilard @SmirnovEgorRu @jeremiedbb you might be interested in some of the results below.