-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Describe the workflow you want to enable
#18394 added native support for categorical features to HGBT. Therefore, you have to ordinal encode your categoricals, e.g. in a ColumnTransformer
(potentially part of a pipeline), and indicate the column positions of the passed X
via the parameter categorical_features
.
How can we then programmatically, i.e. without manually filling in categorical_features
, specify the positions of categorical (ordinal encoded) columns in the feature matrix X
that is finally passed to HGBT?
X, y = ...
ct = make_column_transformer(
(OrdinalEncoder(),
make_column_selector(dtype_include='category')),
remainder='passthrough')
hist_native = make_pipeline(
ct,
HistGradientBoostingRegressor(categorical_features=???)
)
How to fill ???
?
Possible solutions
- Set it manually, e.g. use
OrdinalEncoder
as first or last part of aColumnTransformer
. This is currently used in this example but it's not ideal - Passing a callable/function, e.g
HistGradientBoostingRegressor(categorical_features=my_function)
, see ENH Add Categorical support for HistGradientBoosting #18394 (comment) for details.Sadly, this doesn't work. It breaks when the pipeline is used in e.g. cross_val_score because the estimators will be cloned there, and thus the callable refers to an unfitted CT:
- Pass feature names once they are available. Even then, you have to know the exact feature names that are created by
OrdinalEncoder
. - Pass feature-aligned meta data "this is a categorical feature" similar to SLEP006 and proposed in Feature request: pass meta-data per column/sample through the Pipeline #4196.
- Internally use an OE within the GBDT estimator so that users don't need to create a pipeline
Further context
One day, this might become relevant for more estimators, for linear models see #18893.