-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[RFC] Tree module improvements #5212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am +1 for your two first proposals.
I think more code should be factorized for sure. However I am not convinced about merging them. There is convenience in having different splitting strategies decoupled into different objects/methods/functions/whatever, but sharing a common API. Having multiple strategies mixed into the same code is not a good thing in my opinion. For example, in addition to best and random splits, I would really like to evaluate "approximately best" splits, i.e., a splitting strategy that only evaluates a fixed number of possible cuts and take the best out of those. With a nicely decoupled API, this could be tried easily without intermixing code. In particular, one important point is that we should try to contain the splitting strategies into one place and try as much as possible not to percolate code into other parts of the codebase (e.g., having splitting specific code in criteria).
Yes this makes sense for splitters that try to find best splits. |
This part is useful for tree built in depth-first wise with datasets having categorical features or sparse features. However, this could be removed if brenchmarks show that there is no more interests for this. |
I would add to your list all the criterion optimizations that you have done in #5041:
All the 5 splitters has some common codes such as choosing which features to test for the next update, partitionning the sample array when a split is found or computing the impurity improvement. I believe that this could be re-factor while keeping the current code structure. The multi-threading logic could be probably put at this higher level. |
I have updated my list. I think it is appropriate to tackle each issue with a single corresponding PR sequentially. I will submit a PR for the reorganization as soon as #5203 is merged (which I will review tonight) @arjoly my benchmarks have shown that constant feature caching makes building a single decision tree up to 20% slower on many datasets including MNIST and covtypes. However, I have not tried with categorical features or sparse data. The issue would be to first determine if it should be removed, and only remove it if it slows down the tree in the majority of cases. @glouppe you bring up a good point. In the future I would like to add an approximate splitter as well. I think we're all on the same page though; try to reuse code as much as possible, but also keep splitting strategies distinct for each object. |
Please go ahead :) |
Looking back at our implementation of GBRT, I realize that for most cases, when the loss is not the squared error loss, all values and statistics calculated at leaves are mostly calculated for nothing because they are anyway overridden by subsequent calls to |
In addition, I am under the impression that implementations of |
I am updating the list to add the test about importance computations. |
Removing constant features caused things to slow down, but also caused a gain in accuracy for random forests. The difference could be due to changes to the features being randomly generated, as I changed |
Thanks for checking! |
Thanks for checking @jmschrei ! |
A style question I have is why we are using SIZE_t, DTYPE_t, DOUBLE_t, double, UINT32_t, INT32_t, unsigned char, and int, as different datatypes in these modules. It seems excessively confusing. Would it be useful for one of these PRs to merge all these datatypes into SIZE_t for ints, DTYPE_t for 32 bit floats, and DOUBLE_t for 64 bit floats? |
@arjoly I think I might've deleted your addition to the list about feature computations. If I did, can you re-add it at your convenience please? |
These are defined based on Numpy own types, not based on C types. I would not change that, since internal changes in Numpy might otherwise lead to errors in our implementation. This would also be against Numpy guidelines. CC: @larsmans |
I didn't mean physical merge them. I meant change all integers to be of type SIZE_t, rather than INT32_t, UINT32_t, unsigned char*, and int as well, and change all doubles to be of type DOUBLE_t or DTYPE_t appropriately. We'd use the numpy types. |
I would not do that systematically. In particular, SIZE_t should be used only for indexing arrays and cannot hold negative values. What are the parts of the code that you find inconsistent? (In most cases, types were chosen for a good reason, but there may still be some places left were better types could be used) |
I've added a new PR (#5278) which cleans up criterion and will make adding caching easier. It also means that all criteria store three pointers (node_sum, node_sum_left, and node_sum_right), which will need to be handled when multithreading happens. Another concern for multithreading is that a single thread pool cannot be built regardless of split using the current splitters, because accessing this thread pool requires the GIL. Currently, we would have to create a new threadpool once per possible split, which can be thousands of times for deep trees. A possible solution to this is to create a new splitter which does not have the GIL released, used for parallel single decision tree building. I am not fond of this option, and it will likely cause @GaelVaroquaux to wring my neck, but it would be simple. As a side note, @glouppe, when you say SIZE_t should only be used for indexing arrays, what do you mean? Should it not be used to generically store non-negative integers? |
it will likely cause @GaelVaroquaux to wring my neck, but it would be simple.
If it's simple, I will probably let you live.
|
Apologies for disappearing temporarily. I should have more time for this now. |
@jmschrei Do you plan to do "Add caching of computation between different split levels to avoid recomputation"? I may find some times to work on this at the sprint. |
I'm halfway through the PR. However, research has struck temporarily, and so I've had far more limited time than I expected to work on this. |
I should be able to finish it this weekend though. |
@jmschrei Great!!! I can't wait to review your pull request. |
I'll be working (directly or helping complete an existing PR) on the last 3 TODOs + 1 more in the following order...
|
Great! I have been slacking recently, glad that someone else is contributing. One thing to keep in mind is that eventually Criterion have to become more functional (stateless) so that they can allow for multithreaded individual tree building. Keep this in mind if you add more stored memory to that class. |
I am resurrecting this, but for whoever would like to work on trees, I think the approximate splitter contribution should be assigned high priority. This should not be that difficult to do (if you are familiar with the tree codebase ) and could yield huge speedups at very low performance decrease (if any). See e.g. http://arxiv.org/abs/1609.06119 where they compare a similar approach to CC: @nelson-liu |
Hi @glouppe Is the issue of adding "binning" support to the tree submodule still open and "help wanted"? (i.e. #5212 (comment)). I would suspect this would greatly improve tree runtime training/prediction at high sample sizes and features. Currently, GDBTs have the This has an exposed Python API interface: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/ensemble/_hist_gradient_boosting/binning.py. This just requires then the storage of the Binmapper in |
If I wanted to help, where should I look to gain the background needed to implement something like "Support categorical variables"? Thanks! |
@adam2392 I think you could have a look here and give an update maybe? |
Based on the Issue description:
Longer term, if we are not thinking of using C++ more, then I think there might be possible refactorings that can be explored that allow more general-purpose decision trees (without losing performance ofc). Is this what you meant by "update"? @adrinjalali ? |
I meant exactly what you posted ❤️ |
I comes back to implement a binning strategy instead of evaluating every possible splits as in the |
I am planning on submitting several PRs in an attempt to merge #5041 in slowly, with the ultimate goal being a clean implementation of multithreaded decision tree building so that Gradient Boosting can be faster. With one of the main concepts merged (#5203), here is a list of separate PRs which I'd like to merge in the near future.
Longer range goals which I'd like to work towards (but have no clear plan as of right now) are the following:
At this point, it will be clearer to me what specific changes to Splitter, Criteria, and TreeBuilder need to be added to make multithreading a possibility. @glouppe @arjoly @GaelVaroquaux @pprett if you have any comments, I'd love to hear them.
The text was updated successfully, but these errors were encountered: