Skip to content

[MRG+1]: Add clarification on random forest regressor default params #13248

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 20, 2019

Conversation

abenbihi
Copy link
Contributor

Continues #5862.
Previous PR specified that the default RandomForestRegressor behavior is to use all the input features to pick the best split. The documentation only mentioned that a random subset of input features is used. This is not the case when using the default max_features='auto'. I added the explicitation.

jonoleson and others added 3 commits November 16, 2015 15:53
…emble of bagged trees with its default max_features parameter
Merge branch 'ensemble-doc-edits' of https://github.com/jonoleson/scikit-learn into ensemble-doc-edits.
Explicit that by default, the RandomForestRegressor uses all input features
when picking the best split instead of just a random subset of input features.
@abenbihi abenbihi changed the title Ensemble doc edits Add clarification on random forest regressor default params Feb 25, 2019
@abenbihi abenbihi changed the title Add clarification on random forest regressor default params DOC: Add clarification on random forest regressor default params Feb 25, 2019
@agramfort agramfort changed the title DOC: Add clarification on random forest regressor default params [MRG+1]: Add clarification on random forest regressor default params Feb 25, 2019
@robintibor
Copy link

robintibor commented Mar 14, 2019

Not sure if this is the appropriate place to ask, is the inconsistent behavior of max_features='auto' between regressor and classifier (max_features=n_features for regressor and max_features = sqrt(n_features) for classifier,

- If "auto", then `max_features=n_features`.
- If "auto", then `max_features=sqrt(n_features)`.
) intended? This caused some confusion when we used it.

@banilo
Copy link
Contributor

banilo commented Mar 16, 2019

intended?

As far as I know, Leo Breiman himself suggested different default choices for this hyper-parameter of random forests. Perhaps you are right and different default behavior in RFClassifier versus RFRegressor should be made more explicit in the docstring.

from a sample drawn with replacement (i.e., a bootstrap sample) from the
training set. When splitting a node during the construction of the tree, the
best split can be computed using either all input features
(``max_features=’auto’``) or using the best feature subset of size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should indeed be clearer that in RFC, max_features='auto' indicates using the square root of the number of input features.

Copy link

@robintibor robintibor Mar 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it could also make sense to have defaults 'sqrt' (for classifier) and None (for regressor) (or introduce a new string 'max' or 'all' or sth for max_features='features')? And remove the whole 'auto' option? as 'auto' seems ambiguous anyways?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but this would require a deprecation cycle.

@ogrisel
Copy link
Member

ogrisel commented Apr 18, 2019

I have further rephrased the paragraph to explain the meaning of max_features and the bias-variance trade-off.

@ogrisel
Copy link
Member

ogrisel commented Apr 18, 2019

The circle ci failure is caused by an invalid file in the debian apt repository:

Get:6 http://deb.debian.org jessie/main amd64 Packages [9098 kB]
Fetched 10.1 MB in 1s (9787 kB/s)
W: Failed to fetch http://deb.debian.org/debian/dists/jessie-updates/InRelease  Unable to find expected entry 'main/binary-amd64/Packages' in Release file (Wrong sources.list entry or malformed file)

E: Some index files failed to download. They have been ignored, or old ones used instead.
Exited with code 100

I hope this is transient. I have already restarted the build once but it gave the same error.

@thomasjpfan
Copy link
Member

Merge with master should resolve the circleci issue (master uses Debian stretch implicitly) I opened a pr to use Debian stretch explicitly. #13642

@glemaitre
Copy link
Member

I restarted the CI since #13642 has been merged.

from a sample drawn with replacement (i.e., a bootstrap sample) from the
training set.

Furthermore, when splitting each node during the construction of a tree, the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is difficult to parse "to split on can be". What about:

the best split is found either from all input features or a random subset of size ``max_features``.

@glemaitre
Copy link
Member

apart of this LGTM

@glemaitre glemaitre self-requested a review April 25, 2019 10:04
@glemaitre glemaitre merged commit ab4b4ec into scikit-learn:master Jun 20, 2019
@glemaitre
Copy link
Member

Just pushed a nitpicks and merged it. Thanks @abenbihi

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Jul 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants