[RFC] Tree module improvements

I am planning on submitting several PRs in an attempt to merge #5041 in slowly, with the ultimate goal being a clean implementation of multithreaded decision tree building so that Gradient Boosting can be faster. With one of the main concepts merged (#5203), here is a list of separate PRs which I'd like to merge in the near future.
- [x] Reorganize _tree.pyx into several files (see PR #5230)[merged]
- [x] Add proxy impurity improvement methods to both Gini and entropy (see PR #5233)[closed]
- [x] Reevaluate constant feature caching [closed]
- [x] Support sparse data for gradient boosting (see PR #5252)
- [ ] Add caching of computation between different split levels to avoid recomputation
- [x] Ensure feature importance converge in ensemble (see PR #5261)
- [ ] Add tests to ensure the correctness of impurity values, wrt hand-computed values on toy data.

Longer range goals which I'd like to work towards (but have no clear plan as of right now) are the following:
- [ ] Add an approximate splitter 
- [ ] Add multithreading support for single decision trees
- [ ] Add a partial fit method for tree building
- [ ] Support categorical variables
- [ ] Support missing values

At this point, it will be clearer to me what specific changes to Splitter, Criteria, and TreeBuilder need to be added to make multithreading a possibility. @glouppe @arjoly @GaelVaroquaux @pprett if you have any comments, I'd love to hear them.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC] Tree module improvements #5212

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC] Tree module improvements #5212

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions