Skip to content

Support standard data science use-case #10603

@amueller

Description

@amueller

I brought this up in a couple of places, but I think it's important enough to have it being tracked as a separate issue. We often use the tenet "Simple things should be simple, complex things should be possible." However, right now, we don't support what I think it the most simple and trivial data-science use case with tabular data.

By default, in the data science world, you have a mix of continuous and discrete variables, and there are missing values. Right now, it's impossible to do cross-validation on a dataset like that.

With #9012 in the next release this would be possible using something like

categorical = X.dtypes == object
preprocess = make_column_transformer(
    (make_pipeline(Imputer('median'), StandardScaler()), ~categorical),
    (make_pipeline(CategoricalEncoder(encoding='ordinal'),
    Imputer(method='mosts_frequent'), CategoricalEncoder()), categorical))
model = make_pipeline(preprocess, LogisticRegression())

For many people, this is the most simple and basic use-case of scikit-learn. Learning logistic regression on a tabular dataset. This is not simple.

I think we should strive to support this as a first-class use-case - because it is.

Right now, everywhere I look I see custom implementations to fix this issue. It would be great if we could solve this problem here once and for all. It's really actually quite simple.

In #2888 (comment) @jnothman said

This sort of sounds like it belongs in some sklearn-building-blocks library rather than the core library, but I see why you'd want it here. From a "everything is an array" traditional scikit-learn perpsective, it looks like automl magic. From an R "everything is columns of different data types" perspective it looks like nothing else should have been considered.

In a sense I'd be happy having a building blocks library. But the question is: what else would there be in it? The only thing that literally every person I talk to needs and that we don't have is being able to handle categorical variables in a reasonable way.

EDIT:

categorical = X.dtypes == object
preprocess = make_column_transformer(
    (make_pipeline(Imputer(), StandardScaler()), ~categorical),
    (make_pipeline(Imputer(strategy='constant', fill_value='NA'),
     OneHotEncoder(handle_unknown='ignore')), categorical))
model = make_pipeline(preprocess, LogisticRegression())

is now possible and mostly does what I wanted it to do.
Finding the correspondence of the coefficients to the original features is still challenging and this is still quite a bit of code. Having a BasicPreprocessor class that does the above might be useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions