-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
I brought this up in a couple of places, but I think it's important enough to have it being tracked as a separate issue. We often use the tenet "Simple things should be simple, complex things should be possible." However, right now, we don't support what I think it the most simple and trivial data-science use case with tabular data.
By default, in the data science world, you have a mix of continuous and discrete variables, and there are missing values. Right now, it's impossible to do cross-validation on a dataset like that.
With #9012 in the next release this would be possible using something like
categorical = X.dtypes == object
preprocess = make_column_transformer(
(make_pipeline(Imputer('median'), StandardScaler()), ~categorical),
(make_pipeline(CategoricalEncoder(encoding='ordinal'),
Imputer(method='mosts_frequent'), CategoricalEncoder()), categorical))
model = make_pipeline(preprocess, LogisticRegression())
For many people, this is the most simple and basic use-case of scikit-learn. Learning logistic regression on a tabular dataset. This is not simple.
I think we should strive to support this as a first-class use-case - because it is.
Right now, everywhere I look I see custom implementations to fix this issue. It would be great if we could solve this problem here once and for all. It's really actually quite simple.
In #2888 (comment) @jnothman said
This sort of sounds like it belongs in some sklearn-building-blocks library rather than the core library, but I see why you'd want it here. From a "everything is an array" traditional scikit-learn perpsective, it looks like automl magic. From an R "everything is columns of different data types" perspective it looks like nothing else should have been considered.
In a sense I'd be happy having a building blocks library. But the question is: what else would there be in it? The only thing that literally every person I talk to needs and that we don't have is being able to handle categorical variables in a reasonable way.
EDIT:
categorical = X.dtypes == object
preprocess = make_column_transformer(
(make_pipeline(Imputer(), StandardScaler()), ~categorical),
(make_pipeline(Imputer(strategy='constant', fill_value='NA'),
OneHotEncoder(handle_unknown='ignore')), categorical))
model = make_pipeline(preprocess, LogisticRegression())
is now possible and mostly does what I wanted it to do.
Finding the correspondence of the coefficients to the original features is still challenging and this is still quite a bit of code. Having a BasicPreprocessor
class that does the above might be useful.