ENH compute histograms only for allowed features in HGBT #24856

lorentzenchr · 2022-11-07T20:11:30Z

Reference Issues/PRs

Follow-up of #21020.

What does this implement/fix? Explain your changes.

This PR restricts the computation of histograms in HistGradientBoostingRegressor and HistGradientBoostingClassifier to features that are allowed to be split on. This gives a boost in performance (fit time).

Any other comments?

thomasjpfan

Codewise, I suspect this will improve performance. May you run a quick benchmark to verify?

lorentzenchr · 2022-11-11T10:36:00Z

Summary: This PR clearly reduces fit time with interaction constraints (35% on Higgs benchmark) and no performance penalty without interactions.
Note: Numbers vary a lot from run to run.

With Interaction Constraints

Not interactions allowed.

MAIN with commit `85a8aa6` with interaction constraints

% python scikit-learn/benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100 --no-interactions 1
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.593 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 56.812 s, (2484 total leaves)
Time spent computing histograms: 27.538s
Time spent finding best splits:  0.160s
Time spent applying splits:      6.929s
Time spent predicting:           1.833s
fitted in 56.990s
predicted in 7.205s, ROC AUC: 0.7755, ACC: 0.7028

this PR with interaction constraints

% python scikit-learn/benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100 --no-interactions 1
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.816 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 36.839 s, (2484 total leaves)
Time spent computing histograms: 6.840s
Time spent finding best splits:  0.194s
Time spent applying splits:      7.018s
Time spent predicting:           1.828s
fitted in 37.044s
predicted in 7.586s, ROC AUC: 0.7755, ACC: 0.7028

Without interaction constraints

MAIN

% python scikit-learn/benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100                   
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.901 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 63.442 s, (3100 total leaves)
Time spent computing histograms: 29.477s
Time spent finding best splits:  0.772s
Time spent applying splits:      8.813s
Time spent predicting:           1.730s
fitted in 63.638s
predicted in 7.028s, ROC AUC: 0.8228, ACC: 0.7415

This PR

% python scikit-learn/benchmarks/bench_hist_gradient_boosting_higgsboson.py --n-trees 100                   
Training set with 8800000 records with 28 features.
Fitting a sklearn model...
Binning 1.971 GB of training data: 3.847 s
Fitting gradient boosted rounds:
...
Fit 100 trees in 58.557 s, (3100 total leaves)
Time spent computing histograms: 26.800s
Time spent finding best splits:  0.407s
Time spent applying splits:      7.132s
Time spent predicting:           1.794s
fitted in 58.794s
predicted in 6.642s, ROC AUC: 0.8228, ACC: 0.7415

…sboson

jjerphan

LGTM modulo a few suggestions.