-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Faster Gradient Boosting Decision Trees with binned features #12807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
So what you mean is that the fix is OK? |
The fix is okay, we should include a comment on why the fix is needed. |
coverage run is using is looking up it's configuration file by default from |
@NicolasHug I fixed the coverage thingy. Reading the coverage report, there is a bunch of things that we could improve with respect to test coverage. But this is experimental and I don't want to delay the merge further. Let's merge once my last commit is green on CI. |
I cannot get coverage to ignore the setup.py files for some reason. Anyways, let's merge. |
\o/ Awesome, thanks a lot everyone for the help and the reviews!! |
@ogrisel @NicolasHug really nice job for this feature!!! |
…eatures (scikit-learn#12807)" This reverts commit 5392234.
…eatures (scikit-learn#12807)" This reverts commit 5392234.
Congratulations on this work! This is so important. Also, I love the way that the "experimental" import was handled. Beautiful choice! It should stand as a reference for the future in similar situations. |
This PR proposes a new implementation for Gradient Boosting Decision Trees. This isn't meant to be a replacement of the current sklearn implementation but rather an addition.
This addresses the second bullet point from #8231.
This is a port from pygbm (with @ogrisel, in Numba), which itself uses lots of the optimizations from LightGBM.
Algorithm details and refs
The main differences with the current sklearn implementation are:
Notes to reviewers
This is going to be a lot of work to review, so please feel free to tell me if there's anything I can do / add that could ease reviewing.
Here's a list of things that probably need to be discussed at some point or that are worth pointing out.
The code is a port of pygbm (from numba to cython). I've ported all the tests as well. So a huge part of the code has already been carefully reviewed (or written) by @ogrisel. There are still a few non-trivial changes to the pygbm's code, to accommodate for the numba -> cython translation.
Like [MRG] new K-means implementation for improved performances #11950, this PR uses OpenMP parallelism with Cython
The code is in
sklearn/ensemble._hist_gradient_boosting
and the estimators are exposed insklearn.experimental
(which is created here, as a result of a discussion during the Paris sprint).Like in LightGBM, the targets y, gains, values, and sums of gradient / hessians are doubles, and the gradients and hessians array are floats to save space (14c7d47).Y_DTYPE
and the associated C type for targetsy
is double and not float, because with float the numerical checks (test_loss.py
) would not pass. Maybe at some point we'll want to also allow floats since using doubles uses twice as much space (which is not negligible, see the attributes of theSplitter
class).I have only added a short note in the User Guide about the new estimators. I think that the gradient boosting section of the user guide could benefit from an in-depth rewriting. I'd be happy to do that, but in a later PR.
Currently the parallel code uses all possible threads. Do we want to expose
n_jobs
(openmp-wise, not joblib of course)?The estimator names are currently
HistGradientBoostingClassifier
andHistGradientBoostingRegressor
.API differences with current implementation:
Happy to discuss these points of course. In general I tried to match the parameters names with those of the current GBDTs.
New features:
validation_fraction
can also be an int to specify absolute size of the validation set (not just a proportion)Changed parameters and attributes:
n_estimators
parameter has been changed tomax_iter
because unlike the current GBDTs implementations, the underlying "predictor" aren't estimators. They are private and have nofit
method. Also, in multiclass classification we build C * max_iterestimators_
attribute has been removed for the same reason.train_score_
is of sizen_estimators + 1
instead ofn_estimators
because it contains the score of the 0th iteration (before the boosting process).oob_improvement_
is replaced byvalidation_score_
, also with sizen_estimators + 1
Unsupported parameters and attributes:
subsample
(doesn't really make sense here)criterion
(same)min_samples_split
is not supported, butmin_samples_leaf
is supported.samples_weights
-relatedmin_impurity_decrease
is not supported (we havemin_gain_to_split
but it is not exposed in the public API)warm_start
max_features
(probably not needed)staged_decision_function
,staged_predict_proba
, etc.init
estimatorfeature_importances_
loss_
attribute is not exposed.Future improvement, for later PRs (no specific order):
_in_fit
hackish attribute.Benchmarks
Done on my laptop, intel i5 7th Gen, 4 cores, 8GB Ram.
TLDR:
Comparison between proposed PR and current estimators:
on binary classification only, I don't think it's really needed to do more since the performance difference is striking. Note that for larger sample sizes the current estimators simply cannot run because of the sorting step that never terminates. I don't provide the benchmark code, it's exactly the same as that of

benchmarks/bench_fast_gradient_boosting.py
:Comparison between proposed PR and LightGBM / XGBoost:
On the Higgs-Boson dataset:
python benchmarks/bench_hist_gradient_boosting_higgsboson.py --lightgbm --xgboost --subsample 5000000 --n-trees 50
Sklearn: done in 28.787s, ROC AUC: 0.7330, ACC: 0.7346
LightGBM: done in 27.595s, ROC AUC: 0.7333, ACC: 0.7349
XGBoost: done in 41.726s, ROC AUC: 0.7335, ACC: 0.7351
Entire log:
regression task:

python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem regression --n-samples-max 5000000 --n-trees 50
Binary classification task:
python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 2 --n-samples-max 5000000 --n-trees 50
python benchmarks/bench_hist_gradient_boosting.py --lightgbm --xgboost --problem classification --n-classes 3 --n-samples-max 5000000 --n-trees 50