-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Gradient Boosting improvement #8231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@glouppe was very interested in the approximate splitter. I also think it would be a nice addition helping speed things up a lot... I'm also pinging @jnothman and @nelson-liu who might be interested in reviewing this... |
Can you elaborate on this? In our case, How does our standard implementation (without presorting) compare with xgboost? both in terms of algorithms and performance? |
+1 for binning |
I'm also a bit confused what you mean by repetitive scans. We have two different sorting mechanisms, presorting and not. Presorting works better on smaller datasets with a smaller number of features, whereas not presorting works better on larger amounts of data. XGBoost uses presorting (when I last looked at the code) and we do not, but we provide the option for allowing the user to do such. |
Ok, I think I understand. We could could rid of the |
@glouppe in xgboost, |
Yup, looks like something promising. +1 for both changes. (However, to avoid never-ending PRs, please propose the changes separately, possibly with the least amount of changes to the other parts of the codebase.) |
@glouppe Yep, I am gonna make separated PRs. |
Can we close this as #12807 ( |
Yes this is done. |
Current situation
The speed performance of the scikit-learn gradient-boosting is relatively slow compared to the other implementations: xgboost and lightgbm.
Which bottlenecks have been identified
During some benchmarking, we could identified a major difference between scikit-learn and xgboost.
For xgboost, all samples for a given feature will be scanned. Each sample will be used to update the impurity statistic of the node to which it belongs to (cf. here).
On the contrary in scikit-learn, only the samples for a node will be selected (cf. here or here). Therefore, it involves repetitive scans which are not necessary.
What have been tested
In an effort of speeding up the gradient-boosting, @ogrisel thought about implementing a subsampling at each split, lowering the computation cost while sorting and computing the statistic. The obtained results do not allow to find a satisfactory trade-off time/accuracy performance to compete with the other implementations.
Proposal
I would like to contribute with two changes:
exact
method. [WIP] Alternative architecture for regression tree #8458Related issue
#5212
@GaelVaroquaux @glouppe @ogrisel @jmschrei @raghavrv If you have any comments, I will be happy to hear them.
The text was updated successfully, but these errors were encountered: