-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
CC: @pprett @amueller @bdholt1
Hi folks,
Everyone will agree that tree-based methods have shown to perform quite well (e.g., the recent achievement of Peter!) and are increasingly used by our users. However, the tree module still has a major drawback: it is slow as hell in comparison to other machine learning packages.
For that reason, I think we should put some more effort into accelerating the tree module. In particular, I would like to suggest to move the whole Tree
class (not the estimators, but only our struct-of-arrays representation) from tree.py into Cython in _tree.pyx. First the code would be a lot faster. But second, it could also actually be more readable and maintainable if the whole tree construction process was packaged into a single file, in a single class. Currently, the construction process is indeed split across 2 files, estimator classes, the Tree class and all the Cython routines. (imo, this is a mess.)
To show that indeed the construction process could be a lot faster, I profiled recursive_partition
using line-profiler (see link below). Insignicant Python instructions do actually take quite some time in comparison to the important parts of the algorithm. E.g., line 314 vs line 320. A mere Python if-statement is only twice faster than finding the best threshold!!!
I let you examine the rest of the profiling report by yourself, but as far as I am concerned, I am convinced that we could indeed significantly speed up the tree module (and be 5-10x faster at least).
http://pastebin.com/0rC1QmPy (toggle text warping)
What's your opinion about this? Since I am increasingly using the module myself, I can actually work on that in the days to come.