-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
I think we should add a dataset with missing data and string-encoded categorical variables.
These are really common (i.e. I rarely see a dataset that doesn't have these properties), and it would be good to have examples with best practices.
Right now these are really painful to implement with sklearn, but that will hopefully be fixed soon.
(doing separate scaling and imputation for categorical and continuous data within a pipeline ...).
There are some issues tracking this, but independently I'd like to think about a dataset to include. We could also punt and wait for the openml interface, but then we'd still have to decide which dataset to use.
I like adult but that's a bit big. We could use a subset of adult, but maybe that's a bit non-standard?
(Also adult is soooo depressing).
(What I'd like from the example is doing mean imputation for continuous variables, most frequent imputation for the categorical variables, add imputation indicators, scale the continuous variables with StandardScaler and do OneHotEncoder (at some point, probably before imputation?).
I used real-world government, housing and health data in my class and that was the minimum workflow you'd need to most scikit-learn estimators. But the actual content of the example is more related to #3886)