-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Feature request: Random Forest node splitting sample size #11993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm surprised there is no existing issue for this... I would consider supporting it as bootstrap=float to indicate the fraction of the sample size used to train each tree. |
Heh, that sounds pretty good to me; it might be better than an absolute number that I have to fiddle with on different sized data sets. Maybe "overloading" Implementation note: hopefully this parameter would be per RF model object... I think Jeremy's trick currently sets a global function, |
Yes, it would be a parameter of the RF estimators, as `bootstrap` is
currently.
|
Perfect!
|
One implementation detail that we need to consider is whether sample
weights should be considered when sampling. I would argue no simply on the
basis that they are not considered in the current bootstrap.
…On Tue, 4 Sep 2018 at 11:25, Jeremy Howard ***@***.***> wrote:
Perfect!
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#11993 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6-eaUXZ_9ahTtKCo66xSZHjCk79Cks5uXdaEgaJpZM4WYItw>
.
|
I don't think you can use weights both for sampling and use them as weights in the info gain function. I believe sklearn currently does the latter, right? So it would make sense to continue doing only that. |
The title indicates that the request for the sample size is at the node level but comments lean towards the tree level. |
Yes thanks @MDouriez that does resolve the issue at tree level! A node-level option would also be very helpful though... :) |
@jph00 Can you clarify how node-level would be useful? Is there literature on this you can point to? |
Sorry I don't have any literature I can point to - just my own research over the last few years, which I never got around to writing up, unfortunately. |
@cmarmo |
@reshamas, checking the pull requests that tried to solve it I'm under the impression that this issue does not have a straightforward solution. Perhaps it will require some experience to be taken over. |
@cmarmo |
During the development of a model, the faster one can iterate, the better. That means getting your training time down to a manageable level, albeit at the cost of some accuracy. With large training sets, training time for random forests can be an impediment to rapid iteration. @jph00 showed me this awesome trick that convinces scikit-learn's RF implementation to use a subsample of the full training data set when selecting split variables and values at each node:
We both find that splitting using say 20,000 samples rather than the full 400,000 samples gives nearly the same results but with much faster training time. Naturally it requires a bit of experimentation for each data set to find a suitable subsample size, but in general we find that feature engineering and other preparation work is totally fine using a subsample. We then reset the sample size for the final model.
We appreciate your consideration of this feature request!
The text was updated successfully, but these errors were encountered: