Skip to content

[RFC] Tree module improvements #5212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 of 12 tasks
jmschrei opened this issue Sep 4, 2015 · 35 comments
Open
5 of 12 tasks

[RFC] Tree module improvements #5212

jmschrei opened this issue Sep 4, 2015 · 35 comments
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices module:tree

Comments

@jmschrei
Copy link
Member

jmschrei commented Sep 4, 2015

I am planning on submitting several PRs in an attempt to merge #5041 in slowly, with the ultimate goal being a clean implementation of multithreaded decision tree building so that Gradient Boosting can be faster. With one of the main concepts merged (#5203), here is a list of separate PRs which I'd like to merge in the near future.

Longer range goals which I'd like to work towards (but have no clear plan as of right now) are the following:

  • Add an approximate splitter
  • Add multithreading support for single decision trees
  • Add a partial fit method for tree building
  • Support categorical variables
  • Support missing values

At this point, it will be clearer to me what specific changes to Splitter, Criteria, and TreeBuilder need to be added to make multithreading a possibility. @glouppe @arjoly @GaelVaroquaux @pprett if you have any comments, I'd love to hear them.

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2015

I am +1 for your two first proposals.

Merge BestSplitter and RandomSplitter

I think more code should be factorized for sure. However I am not convinced about merging them. There is convenience in having different splitting strategies decoupled into different objects/methods/functions/whatever, but sharing a common API. Having multiple strategies mixed into the same code is not a good thing in my opinion. For example, in addition to best and random splits, I would really like to evaluate "approximately best" splits, i.e., a splitting strategy that only evaluates a fixed number of possible cuts and take the best out of those. With a nicely decoupled API, this could be tried easily without intermixing code.

In particular, one important point is that we should try to contain the splitting strategies into one place and try as much as possible not to percolate code into other parts of the codebase (e.g., having splitting specific code in criteria).

Add presorting as an option to all splitters to reduce the number of splitters

Yes this makes sense for splitters that try to find best splits.

@glouppe glouppe changed the title tree module improvements [RFC] Tree module improvements Sep 4, 2015
@arjoly
Copy link
Member

arjoly commented Sep 4, 2015

Remove constant feature caching should it prove to not actually be helpful (it wasn't in my original tests)

This part is useful for tree built in depth-first wise with datasets having categorical features or sparse features. However, this could be removed if brenchmarks show that there is no more interests for this.

@glouppe
Copy link
Contributor

glouppe commented Sep 4, 2015

Another thing that I would like to do, once @arjoly is done with #5203 and his next PR, is to split this huge _tree.pyx file into several files, so that it becomes easier to understand and review. I am adding that your list. (I can/will take care of it)

@arjoly
Copy link
Member

arjoly commented Sep 4, 2015

Another thing that I would like to do, once @arjoly is done with #5203 and his next PR, is to split this huge _tree.pyx file into several files, so that it becomes easier to understand and review. I am adding that your list.

+1

@arjoly
Copy link
Member

arjoly commented Sep 4, 2015

I would add to your list all the criterion optimizations that you have done in #5041:

  • pre-computing w * y_ik * y_ik and w * y_ik for MSE and friedmanMSE
  • avoiding to re-compute total statistics
  • avoiding to re-compute total weights.

All the 5 splitters has some common codes such as choosing which features to test for the next update, partitionning the sample array when a split is found or computing the impurity improvement. I believe that this could be re-factor while keeping the current code structure. The multi-threading logic could be probably put at this higher level.

@jmschrei
Copy link
Member Author

jmschrei commented Sep 4, 2015

I have updated my list. I think it is appropriate to tackle each issue with a single corresponding PR sequentially. I will submit a PR for the reorganization as soon as #5203 is merged (which I will review tonight)

@arjoly my benchmarks have shown that constant feature caching makes building a single decision tree up to 20% slower on many datasets including MNIST and covtypes. However, I have not tried with categorical features or sparse data. The issue would be to first determine if it should be removed, and only remove it if it slows down the tree in the majority of cases.

@glouppe you bring up a good point. In the future I would like to add an approximate splitter as well. I think we're all on the same page though; try to reuse code as much as possible, but also keep splitting strategies distinct for each object.

@glouppe
Copy link
Contributor

glouppe commented Sep 8, 2015

I will submit a PR for the reorganization as soon as #5203 is merged (which I will review tonight)

Please go ahead :)

@glouppe
Copy link
Contributor

glouppe commented Sep 9, 2015

Looking back at our implementation of GBRT, I realize that for most cases, when the loss is not the squared error loss, all values and statistics calculated at leaves are mostly calculated for nothing because they are anyway overridden by subsequent calls to _update_terminal_region in the loss functions. This is certainly something to investigate to make boosting faster, but it may require changes in the internal API of GBRT and trees.

@glouppe
Copy link
Contributor

glouppe commented Sep 9, 2015

In addition, I am under the impression that implementations of _update_terminal_region are not always very optimal -- the complexity of each leaf update is lower bounded by the total number of samples in the training set rather than the number of samples in that leaf, typically because of terminal_region = np.where(terminal_regions == leaf)[0]. I dont know if this is a real issue though. An opinion @pprett ?

@arjoly
Copy link
Member

arjoly commented Sep 10, 2015

I am updating the list to add the test about importance computations.

@jmschrei
Copy link
Member Author

BRANCH
Classification performance:
===========================
Classifier   train-time test-time error-rate
--------------------------------------------
RandomForest  79.8781s   0.3793s     0.0219  
CART          15.0324s   0.0216s     0.0444 

MASTER
Classification performance:
===========================
Classifier   train-time test-time error-rate
--------------------------------------------
RandomForest  37.2150s   0.4297s     0.0330  
CART          12.1810s   0.0224s     0.0424 

Removing constant features caused things to slow down, but also caused a gain in accuracy for random forests. The difference could be due to changes to the features being randomly generated, as I changed f_j = rand_int(n_drawn_constants, f_i - n_found_constants, random_state) to f_j = rand_int(f_i, n_features, random_state). I do get a 10-15% speed up on datasets which do not have constant features (like my Gaussian dataset), but if there are constant features there can be a significant slowdown. I do not think this is worth pursuing, and is likely a big reason why my PR was slower.

@glouppe
Copy link
Contributor

glouppe commented Sep 11, 2015

I do not think this is worth pursuing, and is likely a big reason why my PR was slower.

Thanks for checking!

@arjoly
Copy link
Member

arjoly commented Sep 11, 2015

Thanks for checking @jmschrei !

@jmschrei
Copy link
Member Author

A style question I have is why we are using SIZE_t, DTYPE_t, DOUBLE_t, double, UINT32_t, INT32_t, unsigned char, and int, as different datatypes in these modules. It seems excessively confusing. Would it be useful for one of these PRs to merge all these datatypes into SIZE_t for ints, DTYPE_t for 32 bit floats, and DOUBLE_t for 64 bit floats?

@jmschrei
Copy link
Member Author

@arjoly I think I might've deleted your addition to the list about feature computations. If I did, can you re-add it at your convenience please?

@glouppe
Copy link
Contributor

glouppe commented Sep 12, 2015

A style question I have is why we are using SIZE_t, DTYPE_t, DOUBLE_t, double, UINT32_t, INT32_t, unsigned char, and int, as different datatypes in these modules. It seems excessively confusing. Would it be useful for one of these PRs to merge all these datatypes into SIZE_t for ints, DTYPE_t for 32 bit floats, and DOUBLE_t for 64 bit floats?

These are defined based on Numpy own types, not based on C types. I would not change that, since internal changes in Numpy might otherwise lead to errors in our implementation. This would also be against Numpy guidelines. CC: @larsmans

@jmschrei
Copy link
Member Author

I didn't mean physical merge them. I meant change all integers to be of type SIZE_t, rather than INT32_t, UINT32_t, unsigned char*, and int as well, and change all doubles to be of type DOUBLE_t or DTYPE_t appropriately. We'd use the numpy types.

@glouppe
Copy link
Contributor

glouppe commented Sep 12, 2015

I meant change all integers to be of type SIZE_t, rather than INT32_t, UINT32_t, unsigned char*, and int as well

I would not do that systematically. In particular, SIZE_t should be used only for indexing arrays and cannot hold negative values.

What are the parts of the code that you find inconsistent? (In most cases, types were chosen for a good reason, but there may still be some places left were better types could be used)

@jmschrei
Copy link
Member Author

I've added a new PR (#5278) which cleans up criterion and will make adding caching easier.

It also means that all criteria store three pointers (node_sum, node_sum_left, and node_sum_right), which will need to be handled when multithreading happens. Another concern for multithreading is that a single thread pool cannot be built regardless of split using the current splitters, because accessing this thread pool requires the GIL. Currently, we would have to create a new threadpool once per possible split, which can be thousands of times for deep trees.

A possible solution to this is to create a new splitter which does not have the GIL released, used for parallel single decision tree building. I am not fond of this option, and it will likely cause @GaelVaroquaux to wring my neck, but it would be simple.

As a side note, @glouppe, when you say SIZE_t should only be used for indexing arrays, what do you mean? Should it not be used to generically store non-negative integers?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Sep 16, 2015 via email

@jmschrei
Copy link
Member Author

Also, PR #5252, regarding merging PresortBestSplitter and BestSplitter, is still awaiting review. I would like to have both #5252 and #5278 merged before adding caching across splits.

@jmschrei
Copy link
Member Author

jmschrei commented Oct 9, 2015

Apologies for disappearing temporarily. I should have more time for this now.

@arjoly
Copy link
Member

arjoly commented Oct 20, 2015

@jmschrei Do you plan to do "Add caching of computation between different split levels to avoid recomputation"? I may find some times to work on this at the sprint.

@jmschrei
Copy link
Member Author

I'm halfway through the PR. However, research has struck temporarily, and so I've had far more limited time than I expected to work on this.

@jmschrei
Copy link
Member Author

I should be able to finish it this weekend though.

@arjoly
Copy link
Member

arjoly commented Oct 20, 2015

@jmschrei Great!!! I can't wait to review your pull request.

@glouppe glouppe added the Moderate Anything that requires some knowledge of conventions and best practices label Oct 21, 2015
@raghavrv
Copy link
Member

raghavrv commented Nov 10, 2015

I'll be working (directly or helping complete an existing PR) on the last 3 TODOs + 1 more in the following order...

@jmschrei
Copy link
Member Author

Great! I have been slacking recently, glad that someone else is contributing.

One thing to keep in mind is that eventually Criterion have to become more functional (stateless) so that they can allow for multithreaded individual tree building. Keep this in mind if you add more stored memory to that class.

@glouppe
Copy link
Contributor

glouppe commented Sep 29, 2016

I am resurrecting this, but for whoever would like to work on trees, I think the approximate splitter contribution should be assigned high priority. This should not be that difficult to do (if you are familiar with the tree codebase ) and could yield huge speedups at very low performance decrease (if any).
The idea is simple and consists in binning the values of a feature before looking for the best split. This means we reduce by a factor of n_samples / n_bins the number of threshold evaluations. That can be seen as an intermediate solution between regular decision trees and extremely randomized trees.

See e.g. http://arxiv.org/abs/1609.06119 where they compare a similar approach to GradientBoostingClassifier.

CC: @nelson-liu

@adam2392
Copy link
Member

adam2392 commented Mar 7, 2023

I am resurrecting this, but for whoever would like to work on trees, I think the approximate splitter contribution should be assigned high priority. This should not be that difficult to do (if you are familiar with the tree codebase ) and could yield huge speedups at very low performance decrease (if any). The idea is simple and consists in binning the values of a feature before looking for the best split. This means we reduce by a factor of n_samples / n_bins the number of threshold evaluations. That can be seen as an intermediate solution between regular decision trees and extremely randomized trees.

Hi @glouppe

Is the issue of adding "binning" support to the tree submodule still open and "help wanted"? (i.e. #5212 (comment)). I would suspect this would greatly improve tree runtime training/prediction at high sample sizes and features.

Currently, GDBTs have the HistogramGDBT, which has a Cythonized Binmapper function: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/ensemble/_hist_gradient_boosting/_binning.pyx.

This has an exposed Python API interface: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/ensemble/_hist_gradient_boosting/binning.py.

This just requires then the storage of the Binmapper in BaseDecisionTree if binning is added. The testing and documentation should also be added, but the major other component of a PR (if this is still work desired) could be doing some benchmarking systematically increasing sample size and feature size to determine a good recommendation in the user docs of when to activate binning.

@SeniorMars
Copy link

If I wanted to help, where should I look to gain the background needed to implement something like "Support categorical variables"?

Thanks!

@adrinjalali
Copy link
Member

@adam2392 I think you could have a look here and give an update maybe?

@adam2392
Copy link
Member

adam2392 commented Aug 22, 2024

Based on the Issue description:

Longer range goals which I'd like to work towards (but have no clear plan as of right now) are the following:

Longer term, if we are not thinking of using C++ more, then I think there might be possible refactorings that can be explored that allow more general-purpose decision trees (without losing performance ofc).

Is this what you meant by "update"? @adrinjalali ?

@adrinjalali
Copy link
Member

I meant exactly what you posted ❤️

@glemaitre
Copy link
Member

Add an approximate splitter: Unsure what this is.

I comes back to implement a binning strategy instead of evaluating every possible splits as in the HistGradientBoostingClassifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement help wanted Moderate Anything that requires some knowledge of conventions and best practices module:tree
Projects
None yet
Development

No branches or pull requests