Skip to content

Programmatically pass categorical_features to HGBT #18894

@lorentzenchr

Description

@lorentzenchr

Describe the workflow you want to enable

#18394 added native support for categorical features to HGBT. Therefore, you have to ordinal encode your categoricals, e.g. in a ColumnTransformer (potentially part of a pipeline), and indicate the column positions of the passed X via the parameter categorical_features.

How can we then programmatically, i.e. without manually filling in categorical_features, specify the positions of categorical (ordinal encoded) columns in the feature matrix X that is finally passed to HGBT?

X, y = ...

ct = make_column_transformer(
    (OrdinalEncoder(),
     make_column_selector(dtype_include='category')),
    remainder='passthrough')

hist_native = make_pipeline(
    ct,
    HistGradientBoostingRegressor(categorical_features=???)
)

How to fill ????

Possible solutions

  1. Set it manually, e.g. use OrdinalEncoder as first or last part of a ColumnTransformer. This is currently used in this example but it's not ideal
  2. Passing a callable/function, e.g HistGradientBoostingRegressor(categorical_features=my_function), see ENH Add Categorical support for HistGradientBoosting #18394 (comment) for details.

    Sadly, this doesn't work. It breaks when the pipeline is used in e.g. cross_val_score because the estimators will be cloned there, and thus the callable refers to an unfitted CT:

  3. Pass feature names once they are available. Even then, you have to know the exact feature names that are created by OrdinalEncoder.
  4. Pass feature-aligned meta data "this is a categorical feature" similar to SLEP006 and proposed in Feature request: pass meta-data per column/sample through the Pipeline #4196.
  5. Internally use an OE within the GBDT estimator so that users don't need to create a pipeline

Further context

One day, this might become relevant for more estimators, for linear models see #18893.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions