-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
[MRG+1]: Add clarification on random forest regressor default params #13248
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…emble of bagged trees with its default max_features parameter
Merge branch 'ensemble-doc-edits' of https://github.com/jonoleson/scikit-learn into ensemble-doc-edits. Explicit that by default, the RandomForestRegressor uses all input features when picking the best split instead of just a random subset of input features.
Not sure if this is the appropriate place to ask, is the inconsistent behavior of max_features='auto' between regressor and classifier (max_features=n_features for regressor and max_features = sqrt(n_features) for classifier, scikit-learn/sklearn/ensemble/forest.py Line 1111 in a62775e
scikit-learn/sklearn/ensemble/forest.py Line 1362 in a62775e
|
As far as I know, Leo Breiman himself suggested different default choices for this hyper-parameter of random forests. Perhaps you are right and different default behavior in RFClassifier versus RFRegressor should be made more explicit in the docstring. |
doc/modules/ensemble.rst
Outdated
from a sample drawn with replacement (i.e., a bootstrap sample) from the | ||
training set. When splitting a node during the construction of the tree, the | ||
best split can be computed using either all input features | ||
(``max_features=’auto’``) or using the best feature subset of size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should indeed be clearer that in RFC, max_features='auto'
indicates using the square root of the number of input features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it could also make sense to have defaults 'sqrt' (for classifier) and None (for regressor) (or introduce a new string 'max' or 'all' or sth for max_features='features')? And remove the whole 'auto' option? as 'auto' seems ambiguous anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but this would require a deprecation cycle.
I have further rephrased the paragraph to explain the meaning of |
The circle ci failure is caused by an invalid file in the debian apt repository:
I hope this is transient. I have already restarted the build once but it gave the same error. |
Merge with master should resolve the circleci issue (master uses Debian stretch implicitly) I opened a pr to use Debian stretch explicitly. #13642 |
I restarted the CI since #13642 has been merged. |
from a sample drawn with replacement (i.e., a bootstrap sample) from the | ||
training set. | ||
|
||
Furthermore, when splitting each node during the construction of a tree, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is difficult to parse "to split on can be". What about:
the best split is found either from all input features or a random subset of size ``max_features``.
apart of this LGTM |
Just pushed a nitpicks and merged it. Thanks @abenbihi |
Continues #5862.
Previous PR specified that the default RandomForestRegressor behavior is to use all the input features to pick the best split. The documentation only mentioned that a random subset of input features is used. This is not the case when using the default
max_features='auto'
. I added the explicitation.