Skip to content

Reconsider default of max_features of RandomForestRegressor #20111

Closed
@mayer79

Description

@mayer79

In #7254, there was a long discussion on max_features defaults for random forests. As a consequence, the default "auto" was changed to "sqrt" for RandomForestClassifier, but unfortunately not for RandomForestRegressor. I would like to reconsider this decision.

What to change?

The default of RandomForestRegressor's max_features = "auto" should point to m/3 or sqrt(m), where m is the number of features.

Why?

  1. Good defaults are essential for random forests. The fact that random forests do well even without hyperparameter tuning is one of their only advantages over boosted trees.

  2. Every implementation in R and also h2o use sqrt(m) or m/3 as default. R's ranger package uses sqrt(m) for both regression and classification. https://github.com/imbs-hl/ranger

  3. Column subsampling per split is the main source of randomness, leading to less correlated trees. The current default removes this effect. Strictly speaking, the current default does not fit a proper random forest but rather a bagged tree. My experience shows that random forests perform better than bagged trees in the majority of the cases.

  4. Training time is proportional to max_features. I.e. one could easily run 500 trees instead of 100 with a better default.

Note: I am not talking about defaults for completely randomized trees, just about proper random forests.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions