-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
RandomForest{Classifier|Regressor}CV
to efficiently find the best n_estimators
#7243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also do we want an early_stopping like we (will soon) have after #7071 (GBCV) |
I'd love this, but it's not release critical. I'll only look at release critical stuff for now. I want the multiple metrics to make it in (at least multiple metrics with one scorer per metric). |
Ok I'll focus on that then... :) |
I think we need a more general API for dealing with decisions based on validation performance. This would be useful for post-pruning as well, but it makes more sense to me to share an API between GradientBoosting, this, and post-pruning, rather than make a separate object for each. |
@jmschrei input very welcome. The usual approach would be to create CV objects per estimator. I'm not a super big fan of that API but it's pretty established by now. I want to enable using these in What is the shared functionality that you'd have in the shared object? |
The shared functionality seems to be that they require splitting your dataset into a training and validation set, and use performance on the validation to make decisions about how to proceed with the training. I'm not sure if GridSearchCV can handle these cases, as they require iterative building of the models (adding new trees for RandomForest/GradientBoosting and removing nodes for post pruning). Do you think that if we had a "ValidationSetBuilder" (terrible name I know) or something which was a wrapper for a model-specific function, that might work? For example, for GradientBoosting/RandomForests, the model-specific function would add another tree to the ensemble, and for the post pruning it might remove the next node in a series. I do understand that adding in ObjectCV is pretty well established and seems to work, I'm just worried about a blowup in the number of classes which are all doing pretty much the same thing. |
I'm also worried about that. Having a generic method for that would be great. Currently GridSearchCV can't handle it. In the ideal case, I'd like to be able to pack the thing in a pipeline and still do grid-search over it, though. With a EstimatorCV as it is currently, that performs a nested cross-validation which is pretty wasteful. Anyhow, I'd be happy to see a prototype of this. How do you declare the dependency between the parameters, in the interface, though? For example one could imagine that a random forest can be built along two directions: by creating more depth, or by adding more trees. Basically we want a Not all of the CV estimators are based on warm starting, though. There's also RFECV which works slightly differently. Not sure what the best common interface would be. |
@jmschrei I think #7305 could be one way to introduce such API? For this issue can we proceed with our general |
I agree with @amueller that we want a _fit_more or _fit_next method which would be overridden by the specific estimator. I think the way xgboost does it is that it tries to see whether or not adding the next best leaf is better than starting with another tree, allowing you to optimize over both features at the same time, so trees should basically be one axis (complexity of forest). @raghavrv I'm not so sure about these methods having a "_best_estimator" which is used. This kind of makes sense when you can have an ensemble of estimators with different hyperparameters, but with trees the idea is that since you're building them iteratively there is only one forest at a time, not multiple estimators. Maybe my understanding of your proposal is incorrect? |
Joel's #8230 will fix this... |
There is no need to grid-search for n_estimators in forests. Please stop doing that. The more trees, the better. (This is not true for boosting, but that is not what this issue was about.) |
@glouppe : good point. |
(One might sometimes get better results with less trees, but this is only due to luck. It would be as efficient to grid-search for |
Why shouldn't a user use
But not faster performing models! What if I am testing on a cluster but my actual deployment is on an IOT device with limited memory and processing speed. I'd be very interested to know a sufficient enough number of trees that I can use in my deployment... |
Maybe the title should not have been "...find best which can be used to find the inflection point so I can fix on a "practically sufficient value for" number of trees... Depending on the exact usecase, performance at |
incidentally (and probably accidentally) validation_curve doesn't clone its
estimator so you can use warm_start there without modification.
On 14 Feb 2017 7:11 am, "(Venkat) Raghav (Rajagopalan)" < notifications@github.com> wrote:
Maybe the title should not have been "...find *best* n_estimators" (which
seems to allude to the theoretical best, which is admittedly
max(range_of_n_estimators)) but rather help gather data for a graph like
this
[image: image]
<https://cloud.githubusercontent.com/assets/9487348/22900964/82f65882-f230-11e6-8f22-41e6ac7bda8c.png>
which can be used to find the inflection point so I can fix on a "*practically
sufficient* value for" number of trees...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7243 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz69BaO5YSYDV7ovMXYVvTvSd-NLMbks5rcLjhgaJpZM4Js7ME>
.
|
But I think running it in grid search provides more statistics, with the new fit-time and multi-metrics... In this regard actually I am wondering if it would be useful to add a |
I really don't like what is happening to GridSearchCV. It is being slowly transformed into some kind of swiss army tool that does too many things and is hard to understand. As a result, the code is complex, the documentation is complex, the interface is complex. In addition, it doesn't even solve people's problems, because people cannot figure out how to use it. This is going to lead us to major maintenance problem and in the long term bugs and design issues that will be very hard to fix. I know that avoid this issue is difficult. It's a natural entropic failure. But the right solution is to come up with an ensemble of well specified and well suited tools. I am sorry, I don't have anymore time to do this, because I am too busy trying to manage stuff. But I encourage everybody to think about how things can be broken up. In particular, I think that it is a design flaw to try to put in the same manual model understanding and automatic model selection. |
Yes, I'm aware of this concern, and that we are building something close to
a space shuttle. However, I also think that facilitating efficient
best-practice parameter search is an extremely valuable contribution of
scikit-learn to the Python ML community.
I'm not sure quite where you want to divide "manual model understanding"
from "automatic model selection". Does this basically mean that features
designed to select a single model from a set of scores, refit and provide a
standard estimator interface should be separated from the evaluation of
those scores? the inspection of other aspects emergent from each parameter
setting? I think there is some sense in separating those features relating
to inspection/diagnostics from those for selecting the best, which is why
I'm backpedalling a bit on allowing multiple values to scoring; but it's
plainly clear that users feel they should be able to get multiple scores
out of a model.
In terms of inspection you have previously (some years ago) suggested that
users wrapping models so that they are memoized enables all the inspection
needs. This is true, but (a) I don't think this is intuitive to a lot of
users; and (b) you have not admitted any reusable solutions to memoization.
This issue relates to efficient search and choosing a best model not just
by one score alone. I'm not sure where that fits between "manual model
understanding" and "automatic model selection".
I would very happily see a cross_val_param_search function instead of
GridSearchCV that doesn't do the automatic model selection, but reports as
much as it can while doing the fitting efficiently to avoid user error in
coding up same. I just don't really see how that benefits us.
The set of parameters to grid search is growing. Numpydoc does not allow us
to group them under headings, beyond "Parameters" and "Other parameters" to
make them more accessible. I don't think it would hurt to change that, and
propose that subheadings be allowed in numpydoc parameter lists. Or we
should just work towards improving our model selection documentation,
perhaps making it a tutorial.
…On 14 February 2017 at 10:27, Gael Varoquaux ***@***.***> wrote:
But I think running it in grid search provides more statistics, with the
new fit-time and multi-metrics
I really don't like what is happening to GridSearchCV. It is being slowly
transformed into some kind of swiss army tool that does too many things and
is hard to understand.
As a result, the code is complex, the documentation is complex, the
interface is complex. In addition, it doesn't even solve people's problems,
because people cannot figure out how to use it.
This is going to lead us to major maintenance problem and in the long term
bugs and design issues that will be very hard to fix.
I know that avoid this issue is difficult. It's a natural entropic
failure. But the right solution is to come up with an ensemble of well
specified and well suited tools. I am sorry, I don't have anymore time to
do this, because I am too busy trying to manage stuff. But I encourage
everybody to think about how things can be broken up. In particular, I
think that it is a design flaw to try to put in the same manual model
understanding and automatic model selection.
[image: Swiss Army Hammer]
<https://camo.githubusercontent.com/34439ba2fb0dc93b78554a1f106ee6914bfcac46/687474703a2f2f7777772e706572696f6469637461626c652e636f6d2f53616d706c65732f3032362e35352f7331332e4a5047>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7243 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6ySjhvB6TM4tyVcg5PGaDNbW4IbVks5rcObNgaJpZM4Js7ME>
.
|
I agree with @glouppe based on some experience with using the sklearn random forests; fit a few thousand and be happy, don't tune the number of trees. I think adding an API for something that is not useful is a bad idea and confuses people. |
Okay. Thanks all for the comment. I think this can be closed even before #8230 |
Which can be expanded to other params after
warm_start
is introduced for other tree params inDecisionTreeRegressor
/DecisionTreeClassifier
@ogrisel @glouppe @agramfort @jnothman @amueller @MechCoder @jmschrei votes please?
If there is sufficient interest I can raise a PR for this...
The text was updated successfully, but these errors were encountered: