-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Current situation
The speed performance of the scikit-learn gradient-boosting is relatively slow compared to the other implementations: xgboost and lightgbm.
Which bottlenecks have been identified
During some benchmarking, we could identified a major difference between scikit-learn and xgboost.
For xgboost, all samples for a given feature will be scanned. Each sample will be used to update the impurity statistic of the node to which it belongs to (cf. here).
On the contrary in scikit-learn, only the samples for a node will be selected (cf. here or here). Therefore, it involves repetitive scans which are not necessary.
What have been tested
In an effort of speeding up the gradient-boosting, @ogrisel thought about implementing a subsampling at each split, lowering the computation cost while sorting and computing the statistic. The obtained results do not allow to find a satisfactory trade-off time/accuracy performance to compete with the other implementations.
Proposal
I would like to contribute with two changes:
- Propose a prototype of the gradient boosting modifying the samples scanning and splitting strategies. It will required a change of the decision tree algorithm (that's why we want a prototype first). This change should allow to match the performance of xgboost while using the
exact
method. [WIP] Alternative architecture for regression tree #8458 - Implement an approximation splitter by binning the data similarly to lightgbm and xgboost. This change allowed xgboost to match lightgbm performance recently (cf. here and here).
Related issue
@GaelVaroquaux @glouppe @ogrisel @jmschrei @raghavrv If you have any comments, I will be happy to hear them.