Skip to content

Add example dataset with missing data and string-encoded categorical variables #8888

@amueller

Description

@amueller

I think we should add a dataset with missing data and string-encoded categorical variables.
These are really common (i.e. I rarely see a dataset that doesn't have these properties), and it would be good to have examples with best practices.

Right now these are really painful to implement with sklearn, but that will hopefully be fixed soon.
(doing separate scaling and imputation for categorical and continuous data within a pipeline ...).

There are some issues tracking this, but independently I'd like to think about a dataset to include. We could also punt and wait for the openml interface, but then we'd still have to decide which dataset to use.

I like adult but that's a bit big. We could use a subset of adult, but maybe that's a bit non-standard?
(Also adult is soooo depressing).

(What I'd like from the example is doing mean imputation for continuous variables, most frequent imputation for the categorical variables, add imputation indicators, scale the continuous variables with StandardScaler and do OneHotEncoder (at some point, probably before imputation?).
I used real-world government, housing and health data in my class and that was the minimum workflow you'd need to most scikit-learn estimators. But the actual content of the example is more related to #3886)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions