-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Using a RandomForest's warm_start
together with random_state
is poorly documented
#22041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just to be sure regarding the expectation after reading the docstring: were you expecting that the two trees in the forest to be identical due to the warm starting?
This is indeed the expected behaviour. |
It was rather that I could not tell from the documentation and had to resort to testing it out with the code snippet. My personal intuition was for it to work as it does, but when it came up in code review I realized that I couldn't even tell from the docs that it was correct.
I figured, I added it for clarity. |
OK I see. Improving the documentation would then be interesting. |
I am unsure which other estimators have a similar behavior with When |
At least all estimators in the |
I don't really have the time right now to experimentally verify/read docs and figure out how non-ensemble estimators behave. It seems to me that each individual estimator class has the documentation independently (as opposed to documenting shared parameters on the base class from which they are derived). Should the clarification be added to the general |
In linear models, it is just that the optimization will start with some initial weights instead of random weights. |
I would start with the tree-based model in the |
Describe the issue linked to the documentation
Consider the following example:
According to controlling randomness, when
random_state
is set:But calling
fit
multiple times in a warm start setting does not yield the same results (as expected, we want more trees, and we want different trees). The example above produces a forest with two unique trees, and the overall forest is identical to creating at once withRandomForestClassifier(n_estimators=2, warm_start=False, random_state=0)
. The same behavior is observed when anumpy.random.RandomState
is used.However, I found it (at first) impossible to determine this behavior from the documentation alone. As far as I am aware, the only hint that should have helped me is this warm_start documentation:
In hindsight, the internal
random state
-object likely counts as a "fitted model attribute" which would allow you to infer the behavior from the documentation.Suggest a potential alternative/fix
I am not sure if this behavior is consistent across all estimators which support the
warm_start
parameter. A clarification in thewarm_start
section makes the most sense to me. Either a single sentence or a small paragraph depending on whether or not there are differences between the different estimators.I'd be willing to set up the PR but I figure it makes sense to agree on the action (if any) and wording first.
The text was updated successfully, but these errors were encountered: