Support standard data science use-case

I brought this up in a couple of places, but I think it's important enough to have it being tracked as a separate issue. We often use the tenet "Simple things should be simple, complex things should be possible." However, right now, we don't support what I think it the most simple and trivial data-science use case with tabular data.

By default, in the data science world, you have a mix of continuous and discrete variables, and there are missing values. Right now, it's impossible to do cross-validation on a dataset like that.

With #9012 in the next release this would be possible using something like

```python
categorical = X.dtypes == object
preprocess = make_column_transformer(
    (make_pipeline(Imputer('median'), StandardScaler()), ~categorical),
    (make_pipeline(CategoricalEncoder(encoding='ordinal'),
    Imputer(method='mosts_frequent'), CategoricalEncoder()), categorical))
model = make_pipeline(preprocess, LogisticRegression())
```

For many people, this is the most simple and basic use-case of scikit-learn. Learning logistic regression on a tabular dataset. This is not simple.

I think we should strive to support this as a first-class use-case - because it is.

Right now, everywhere I look I see custom implementations to fix this issue. It would be great if we could solve this problem here once and for all. It's really actually quite simple.

In https://github.com/scikit-learn/scikit-learn/issues/2888#issuecomment-363598565 @jnothman said

> This sort of sounds like it belongs in some sklearn-building-blocks library rather than the core library, but I see why you'd want it here. From a "everything is an array" traditional scikit-learn perpsective, it looks like automl magic. From an R "everything is columns of different data types" perspective it looks like nothing else should have been considered.

In a sense I'd be happy having a building blocks library. But the question is: what else would there be in it? The only thing that literally every person I talk to needs and that we don't have is being able to handle categorical variables in a reasonable way.

EDIT:
```python
categorical = X.dtypes == object
preprocess = make_column_transformer(
    (make_pipeline(Imputer(), StandardScaler()), ~categorical),
    (make_pipeline(Imputer(strategy='constant', fill_value='NA'),
     OneHotEncoder(handle_unknown='ignore')), categorical))
model = make_pipeline(preprocess, LogisticRegression())
```
is now possible and mostly does what I wanted it to do.
Finding the correspondence of the coefficients to the original features is still challenging and this is still quite a bit of code. Having a ``BasicPreprocessor`` class that does the above might be useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support standard data science use-case #10603

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Support standard data science use-case #10603

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions