Skip to content

RFC feature subsampling for tree based models #27347

Closed
@lorentzenchr

Description

@lorentzenchr

Background

Currently, our (mostly tree-based) models supporting feature subsampling via max_features are:

  • DecisionTreeClassifier, DecisionTreeRegressor
  • ExtraTreeClassifier, ExtraTreeRegressor
  • BaggingClassifier, BaggingRegressor
  • ExtraTreesClassifier, ExtraTreesRegressor
  • GradientBoostingClassifier, GradientBoostingRegressor
  • IsolationForest
  • RandomForestClassifier, RandomForestRegressor

Soon, HistGradientBoostingClassifier and HistGradientBoostingRegressor will have feature subsampling, too, see #27139.

Problem Statement

max_features can be a float between 0 and 1. In this case, it means that a fraction of features is randomly selected in each split. It can also be an integer, in which case it specifies the number of features selected in each split.
This means that 1 and 1. have very different effects! See #27139 (review).

Also note that for most estimators the default is max_features=None or 1.0 meaning no subsampling. The notable exception is RandomForestClassifier and ExtraTreesClassifier with default max_features="sqrt" (which makes sense), but the corresponding regressors still use all features (so no rf but bagged trees). This was discussed in detail in #20111.

Additionally, for most estimator it means subsampling per tree split/node. For BaggingC/R models and IsolationForest it means per model/iteration.

Proposal

Accompanied with a proper deprecation strategy, we could add different arguments, one for specifying fractions, one for integer numbers.

Decision Options

A Do we want to change the current situation?

B Which names shall the new arguments have?

B.1

We could keep max_features for number of features and add only one new argument for fractions. Or we can actually add 2 new arguments and deprecate max_features completely.

B.2

Possible names for fractions, see also #27139 (comment):

  • colsample_bynode, same name as in XGBoost and LightGBM
  • feature_(sub)sample_per_split
  • feature_fraction_per_split

@scikit-learn/core-devs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions