Skip to content

[RFC] Gradient Boosting improvement #8231

@glemaitre

Description

@glemaitre

Current situation

The speed performance of the scikit-learn gradient-boosting is relatively slow compared to the other implementations: xgboost and lightgbm.

Which bottlenecks have been identified

During some benchmarking, we could identified a major difference between scikit-learn and xgboost.

For xgboost, all samples for a given feature will be scanned. Each sample will be used to update the impurity statistic of the node to which it belongs to (cf. here).
On the contrary in scikit-learn, only the samples for a node will be selected (cf. here or here). Therefore, it involves repetitive scans which are not necessary.

What have been tested

In an effort of speeding up the gradient-boosting, @ogrisel thought about implementing a subsampling at each split, lowering the computation cost while sorting and computing the statistic. The obtained results do not allow to find a satisfactory trade-off time/accuracy performance to compete with the other implementations.

Proposal

I would like to contribute with two changes:

  • Propose a prototype of the gradient boosting modifying the samples scanning and splitting strategies. It will required a change of the decision tree algorithm (that's why we want a prototype first). This change should allow to match the performance of xgboost while using the exact method. [WIP] Alternative architecture for regression tree #8458
  • Implement an approximation splitter by binning the data similarly to lightgbm and xgboost. This change allowed xgboost to match lightgbm performance recently (cf. here and here).

Related issue

#5212

@GaelVaroquaux @glouppe @ogrisel @jmschrei @raghavrv If you have any comments, I will be happy to hear them.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions