Open
Description
I am planning on submitting several PRs in an attempt to merge #5041 in slowly, with the ultimate goal being a clean implementation of multithreaded decision tree building so that Gradient Boosting can be faster. With one of the main concepts merged (#5203), here is a list of separate PRs which I'd like to merge in the near future.
- Reorganize _tree.pyx into several files (see PR [MRG+1] split tree module into several packages #5230)[merged]
- Add proxy impurity improvement methods to both Gini and entropy (see PR [MRG] Proxy improvement methods added to entropy/gini #5233)[closed]
- Reevaluate constant feature caching [closed]
- Support sparse data for gradient boosting (see PR [MRG+1] Merge PresortBestSplitter and BestSplitter #5252)
- Add caching of computation between different split levels to avoid recomputation
- Ensure feature importance converge in ensemble (see PR [MRG+1] Stronger tests for variable importances #5261)
- Add tests to ensure the correctness of impurity values, wrt hand-computed values on toy data.
Longer range goals which I'd like to work towards (but have no clear plan as of right now) are the following:
- Add an approximate splitter
- Add multithreading support for single decision trees
- Add a partial fit method for tree building
- Support categorical variables
- Support missing values
At this point, it will be clearer to me what specific changes to Splitter, Criteria, and TreeBuilder need to be added to make multithreading a possibility. @glouppe @arjoly @GaelVaroquaux @pprett if you have any comments, I'd love to hear them.