Skip to content

efficient grid search for random forests #3652

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mblondel opened this issue Sep 10, 2014 · 4 comments
Closed

efficient grid search for random forests #3652

mblondel opened this issue Sep 10, 2014 · 4 comments

Comments

@mblondel
Copy link
Member

To get the best performance off Random Forests, it is necessary to tune parameters like max_depth, min_samples_split and min_samples_leaf. If we wrap a RF in GridSearchCV, trees are built from scratch every time. However, for depth first tree induction, deeper trees share the same base as shallower trees. An idea to speed up grid search is thus to not rebuild trees from scratch every time. Here's an example for max_depth=[10, 9, .., 1].

  1. Set max_depth=10
  2. Build n_estimators fully developed trees
  3. Prune trees to have a maximum depth of max_depth
  4. Create a RF for this max_depth and evaluate it using the current train/test split
  5. Decrease max_depth and go to step 3

Such an algorithm could be wrapped in RandomForestClassifierCV / RandomForestRegressorCV classes.

@mblondel
Copy link
Member Author

Just remembered that our implementation currently only stores node values for terminal nodes. We would need an option to compute the node values for non-terminal nodes too.

Another idea is that trees don't need to be explicitly pruned. It would be easier/faster to add max_depth, min_samples_split and min_samples_leaf options directly to the apply and predict method of Tree. When evaluating a tree, we can just return the current node value when max_depth, min_samples_split and min_samples_leaf are satisfied.

@glouppe
Copy link
Contributor

glouppe commented Sep 13, 2014

This would not work for min_samples_leaf, as this criteria might discard
splits that may have been better otherwise. Eg, an unbalanced split of
the root.

Le 13 sept. 2014 04:18, "Mathieu Blondel" notifications@github.com a
écrit :

Just remembered that our implementation currently only stores node values
for terminal nodes. We would need an option to compute the node values for
non-terminal nodes too.

Another idea is that trees don't need to be explicitly pruned. It would
be easier/faster to add max_depth, min_samples_split and min_samples_leaf
options directly to the apply and predict method of Tree. When evaluating a
tree, we can just return the current node value when max_depth,
min_samples_split and min_samples_leaf are satisfied.


Reply to this email directly or view it on GitHub.

@glouppe
Copy link
Contributor

glouppe commented May 11, 2015

Just remembered that our implementation currently only stores node values for terminal nodes. We would need an option to compute the node values for non-terminal nodes too.

This is now no longer true :)

@adrinjalali
Copy link
Member

I don't think we're adding CV classes for random forests, and the idea of warm_start is discussed in other places. So closing this one.

@adrinjalali adrinjalali closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants