API options for Pandas output

Related to:

- https://github.com/scikit-learn/scikit-learn/issues/5523 pandas in, pandas out
- https://github.com/scikit-learn/scikit-learn/issues/10603 typical data science use case
- https://github.com/scikit-learn/scikit-learn/pull/20100 array out in preprocessing
- #20110 output dataframes in column transformer

This issue summarizes all the options for pandas with a normal data science use case:

```python
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
```

In all of the following options, `pipe[-1].feature_names_in_` is used to get the feature names used in `LogisticRegression`. All options require `feature_names_in_` to enforce column name consistency between `fit` and `transform`.

## Option 1: `output` kwargs in `transform`

All transformers will accept a `output='pandas'` in `transform`. To configure transformers to output dataframes during `fit`:

```python
# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")

# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")
```

Pipeline will pass `output="pandas"` to every transform method during `fit`. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass `output="pandas"` to every `transformer.transform`. 

## Option 2: `__init__` parameter

All transformers will accept an `transform_output` in `__init__`:

```python
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median',
                              transform_output="pandas")),
    ('scaler', StandardScaler(transform_output="pandas"))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    transform_output="pandas")

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
          
# All transformers are configured to output dataframes
pipe.fit(X_train_df)
```

### Option 2b: Have a global config to `transform_output`

For a better user experience, we can have a global config. By default, `transform_output` is set to `'global'` in all transformers.

```python
import sklearn
sklearn.set_config(transform_output="pandas")

pipe = ...
pipe.fit(X_train_df)
```

## Option 3: Use SLEP 006

Have all transformers request `output`. Similiar to Option 1, every transformer needs a `output='pandas'` kwarg in `transform`.

```python
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
    ('scaler', StandardScaler().request_for_transform(output=True))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)

preprocessor = (ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
        .request_for_transform(output=True))

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
                      
pipe.fit(X_train_df, output="pandas")
```

### Option 3b: Have a global config for request

For a better user experience, we can have a global config:

```python
import sklearn
sklearn.set_config(request_for_transform={"output": True})

pipe = ...
pipe.fit(X_train_df, output="pandas")
```

## Summary

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API options for Pandas output #20258

Option 1: `output` kwargs in `transform`

Option 2: `init` parameter

Option 2b: Have a global config to `transform_output`

Option 3: Use SLEP 006

Option 3b: Have a global config for request

Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API options for Pandas output #20258

Description

Option 1: output kwargs in transform

Option 2: __init__ parameter

Option 2b: Have a global config to transform_output

Option 3: Use SLEP 006

Option 3b: Have a global config for request

Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Option 1: `output` kwargs in `transform`

Option 2: `init` parameter

Option 2b: Have a global config to `transform_output`