Description
Background
Currently, our (mostly tree-based) models supporting feature subsampling via max_features
are:
DecisionTreeClassifier
,DecisionTreeRegressor
ExtraTreeClassifier
,ExtraTreeRegressor
BaggingClassifier
,BaggingRegressor
ExtraTreesClassifier
,ExtraTreesRegressor
GradientBoostingClassifier
,GradientBoostingRegressor
IsolationForest
RandomForestClassifier
,RandomForestRegressor
Soon, HistGradientBoostingClassifier
and HistGradientBoostingRegressor
will have feature subsampling, too, see #27139.
Problem Statement
max_features
can be a float between 0 and 1. In this case, it means that a fraction of features is randomly selected in each split. It can also be an integer, in which case it specifies the number of features selected in each split.
This means that 1
and 1.
have very different effects! See #27139 (review).
Also note that for most estimators the default is max_features=None
or 1.0
meaning no subsampling. The notable exception is RandomForestClassifier
and ExtraTreesClassifier
with default max_features="sqrt"
(which makes sense), but the corresponding regressors still use all features (so no rf but bagged trees). This was discussed in detail in #20111.
Additionally, for most estimator it means subsampling per tree split/node. For BaggingC/R models and IsolationForest it means per model/iteration.
Proposal
Accompanied with a proper deprecation strategy, we could add different arguments, one for specifying fractions, one for integer numbers.
Decision Options
A Do we want to change the current situation?
B Which names shall the new arguments have?
B.1
We could keep max_features
for number of features and add only one new argument for fractions. Or we can actually add 2 new arguments and deprecate max_features completely.
B.2
Possible names for fractions, see also #27139 (comment):
colsample_bynode
, same name as in XGBoost and LightGBMfeature_(sub)sample_per_split
feature_fraction_per_split
@scikit-learn/core-devs