Skip to content

RandomForest{Classifier|Regressor}CV to efficiently find the best n_estimators #7243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
raghavrv opened this issue Aug 25, 2016 · 21 comments
Closed

Comments

@raghavrv
Copy link
Member

raghavrv commented Aug 25, 2016

Which can be expanded to other params after warm_start is introduced for other tree params in DecisionTreeRegressor/DecisionTreeClassifier

@ogrisel @glouppe @agramfort @jnothman @amueller @MechCoder @jmschrei votes please?

If there is sufficient interest I can raise a PR for this...

@raghavrv
Copy link
Member Author

raghavrv commented Aug 25, 2016

Also do we want an early_stopping like we (will soon) have after #7071 (GBCV)

@amueller
Copy link
Member

I'd love this, but it's not release critical. I'll only look at release critical stuff for now. I want the multiple metrics to make it in (at least multiple metrics with one scorer per metric).

@raghavrv
Copy link
Member Author

Ok I'll focus on that then... :)

@jmschrei
Copy link
Member

I think we need a more general API for dealing with decisions based on validation performance. This would be useful for post-pruning as well, but it makes more sense to me to share an API between GradientBoosting, this, and post-pruning, rather than make a separate object for each.

@amueller
Copy link
Member

@jmschrei input very welcome. The usual approach would be to create CV objects per estimator. I'm not a super big fan of that API but it's pretty established by now. I want to enable using these in GridSearchCV and then maybe at some point being able to get a better API.

What is the shared functionality that you'd have in the shared object?

@jmschrei
Copy link
Member

The shared functionality seems to be that they require splitting your dataset into a training and validation set, and use performance on the validation to make decisions about how to proceed with the training. I'm not sure if GridSearchCV can handle these cases, as they require iterative building of the models (adding new trees for RandomForest/GradientBoosting and removing nodes for post pruning). Do you think that if we had a "ValidationSetBuilder" (terrible name I know) or something which was a wrapper for a model-specific function, that might work? For example, for GradientBoosting/RandomForests, the model-specific function would add another tree to the ensemble, and for the post pruning it might remove the next node in a series.

I do understand that adding in ObjectCV is pretty well established and seems to work, I'm just worried about a blowup in the number of classes which are all doing pretty much the same thing.

@amueller
Copy link
Member

I'm also worried about that. Having a generic method for that would be great. Currently GridSearchCV can't handle it. In the ideal case, I'd like to be able to pack the thing in a pipeline and still do grid-search over it, though. With a EstimatorCV as it is currently, that performs a nested cross-validation which is pretty wasteful.

Anyhow, I'd be happy to see a prototype of this. How do you declare the dependency between the parameters, in the interface, though? For example one could imagine that a random forest can be built along two directions: by creating more depth, or by adding more trees.

Basically we want a _fit_more method that can take the one parameter that changes, and keeps the rest constant. For warm-startable models that should be relatively straight forward. It still needs to define which of the parameters are compatible with the warm-starting. That is estimator specific and therefore needs to be implemented there.

Not all of the CV estimators are based on warm starting, though. There's also RFECV which works slightly differently. Not sure what the best common interface would be.

@raghavrv
Copy link
Member Author

raghavrv commented Oct 16, 2016

I think we need a more general API for dealing with decisions based on validation performance.

@jmschrei I think #7305 could be one way to introduce such API?

For this issue can we proceed with our general GridSearchCV like approach? Is there sufficient interest in this amongst the core devs for me to proceed on this?

@jmschrei
Copy link
Member

I agree with @amueller that we want a _fit_more or _fit_next method which would be overridden by the specific estimator. I think the way xgboost does it is that it tries to see whether or not adding the next best leaf is better than starting with another tree, allowing you to optimize over both features at the same time, so trees should basically be one axis (complexity of forest).

@raghavrv I'm not so sure about these methods having a "_best_estimator" which is used. This kind of makes sense when you can have an ensemble of estimators with different hyperparameters, but with trees the idea is that since you're building them iteratively there is only one forest at a time, not multiple estimators. Maybe my understanding of your proposal is incorrect?

@raghavrv
Copy link
Member Author

Joel's #8230 will fix this...

@glouppe
Copy link
Contributor

glouppe commented Feb 13, 2017

There is no need to grid-search for n_estimators in forests. Please stop doing that.

The more trees, the better.

(This is not true for boosting, but that is not what this issue was about.)

@GaelVaroquaux
Copy link
Member

@glouppe : good point.

@glouppe
Copy link
Contributor

glouppe commented Feb 13, 2017

(One might sometimes get better results with less trees, but this is only due to luck. It would be as efficient to grid-search for random_state... )

@raghavrv
Copy link
Member Author

There is no need to grid-search for n_estimators in forests.

Why shouldn't a user use GridSearchCV to find a lower number of trees with almost similar results?

get better results with less trees

But not faster performing models! What if I am testing on a cluster but my actual deployment is on an IOT device with limited memory and processing speed. I'd be very interested to know a sufficient enough number of trees that I can use in my deployment...

@raghavrv
Copy link
Member Author

raghavrv commented Feb 13, 2017

Maybe the title should not have been "...find best n_estimators" (which seems to allude to the theoretical best, which is admittedly max(given_range_of_n_estimators)) but rather help gather data for a graph like this

image

which can be used to find the inflection point so I can fix on a "practically sufficient value for" number of trees...

Depending on the exact usecase, performance at n_estimators=30 maybe sufficiently good as performance at n_estimators=100. How do I find this magic number unless I get this graph?

@jnothman
Copy link
Member

jnothman commented Feb 13, 2017 via email

@raghavrv
Copy link
Member Author

But I think running it in grid search provides more statistics, with the new fit-time and multi-metrics... In this regard actually I am wondering if it would be useful to add a score_precision parameter to GridSearchCV?

@GaelVaroquaux
Copy link
Member

But I think running it in grid search provides more statistics, with the new fit-time and multi-metrics

I really don't like what is happening to GridSearchCV. It is being slowly transformed into some kind of swiss army tool that does too many things and is hard to understand.

As a result, the code is complex, the documentation is complex, the interface is complex. In addition, it doesn't even solve people's problems, because people cannot figure out how to use it.

This is going to lead us to major maintenance problem and in the long term bugs and design issues that will be very hard to fix.

I know that avoid this issue is difficult. It's a natural entropic failure. But the right solution is to come up with an ensemble of well specified and well suited tools. I am sorry, I don't have anymore time to do this, because I am too busy trying to manage stuff. But I encourage everybody to think about how things can be broken up. In particular, I think that it is a design flaw to try to put in the same manual model understanding and automatic model selection.

Swiss Army Hammer

@jnothman
Copy link
Member

jnothman commented Feb 14, 2017 via email

@dengemann
Copy link
Contributor

There is no need to grid-search for n_estimators in forests. Please stop doing that.

The more trees, the better.

I agree with @glouppe based on some experience with using the sklearn random forests; fit a few thousand and be happy, don't tune the number of trees. I think adding an API for something that is not useful is a bad idea and confuses people.

@raghavrv
Copy link
Member Author

Okay. Thanks all for the comment. I think this can be closed even before #8230

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants