Add built-in dataset with missing values and categorical data? #12433

amueller · 2018-10-22T17:09:46Z

Follow up on #12424:
We don't have a built-in dataset with missing values and/or mixed feature types. I think it would be good to have one to have non-synthetic examples for the column transformer and imputation strategies.
While I like to reduce the number of custom fetchers and would like to use fetch_openml as much as possible, I think there's a benefit to having a built-in dataset so that examples can run without internet connection.

Not sure what good candidates would be. Titanic is somewhat obvious though I'm not sure about missingness patterns there. Adult would be nice but might be too large to ship (4mb - would double the size of the wheel so seems unreasonable).

Maybe the ames housing data would be appropriate, not sure if it has missing values.

The text was updated successfully, but these errors were encountered:

amueller · 2018-10-22T17:13:56Z

On UCI adult is 4mb, on kaggle it's 400k?
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
https://www.kaggle.com/uciml/adult-census-income

(kaggle is zipped)

jeremiedbb · 2018-10-22T17:46:25Z

I'm not sure about missingness patterns there

You want a dataset with missing values completely at random ? Patterns in missing values could be interesting too.

amueller · 2018-10-22T17:49:08Z

I guess I would be most interested in not missing at random. The main point was to have a non-trivial amount of missing data and possibly across multiple features.
Any dataset that's both small enough and interesting enough should be fine, I wouldn't worry about trying to do statistics (which are probably impossible to compute given the data is actually missing).

amueller · 2018-10-23T16:37:39Z

I just realized the biggest problem with this issue: we can either return a weirdly typed numpy array or loading the dataset will have a soft requirement on having pandas installed....

jnothman · 2018-10-28T04:42:27Z

I think if we make it possible to get data frames from fetch_openml, and we have a way to insert missing values in some way that is not completely at random.

amueller · 2018-10-29T01:37:18Z

@jnothman Sorry I don't think I understood your statement.

jnothman · 2018-10-30T02:49:52Z

Yes, my statement was not a complete sentence.

I was basically saying I don't think we especially need a real-world dataset with missing values at this point.

I think we should make it possible to fetch Titanic as a dataframe (#11875).

And I think we should make it possible to insert missing values in a not-completely-at-random fashion (#6284).

amueller · 2018-11-27T17:38:07Z

I agree, we should be able to fetch titanic. I was a bit disappointed when I realized we're still doing a pandas read_csv to my github repo in an example.

ogrisel · 2020-02-19T12:46:47Z

Maybe the ames housing data would be appropriate, not sure if it has missing values.

The Ames housing dataset has missing values in the CSV files that can be found on Kaggle (login required to download the data):

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

There is also a version on OpenML but it does not have missing values:

https://www.openml.org/d/41211 It seems that they have been imputed by zeros.

The openml version is apparently derived from this R package:

https://cran.r-project.org/web/packages/AmesHousing/index.html

but the data is in rda format. I am not sure if the missing values are there.

If you are interested I trained some baseline models + model inspection in the following notebook:

https://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/sklearn_demos/ames_housing.ipynb

ogrisel · 2020-02-20T15:28:05Z

BTW the original dataset is there:

and the paper describing the motivation, design and analysis of this dataset: http://jse.amstat.org/v19n3/decock.pdf

amueller · 2020-02-20T17:31:31Z

OpenML allows having several versions of a dataset, so we could push a version with the missing data.
For the Ames Housing dataset, you need to log and remove outliers. If you don't remove outliers, your cross-validation score will have crazy variance in my experience.

ogrisel · 2020-02-20T17:53:29Z

Actually as @maikia found in #16345 there is already another version of the same dataset. But it does not have the missing values either. Maybe the ARFF format is limiting this? I haven't investigated.

ogrisel · 2020-02-20T17:54:39Z

I did not observe crazy variance in the notebook above but maybe the outliers have already been removed?

amueller · 2020-02-20T18:23:45Z

In your initial prototype they are in the training set, and I guess gradient boosting doesn't care as much.
ARFF definitely supports missing values.
These guys here:

See here:
https://github.com/amueller/COMS4995-s20/blob/master/slides/aml-05-linear-models-regression/aml-05-linear-models-regression-Copy1.ipynb

If you do a linear model, I would be curious to see how performance varies across cross-validation splits.
This is what I got for ridge:

maikia · 2020-02-21T08:56:50Z

Actually as @maikia found in #16345 there is already another version of the same dataset. But it does not have the missing values either. Maybe the ARFF format is limiting this? I haven't investigated.

I think there are quite few missing values in this dataset
https://www.openml.org/d/42165

cmarmo added module:datasets New Feature labels Feb 6, 2022

glemaitre mentioned this issue May 16, 2024

Add missing values and categorical features when generating datasets #28952

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add built-in dataset with missing values and categorical data? #12433

Add built-in dataset with missing values and categorical data? #12433

amueller commented Oct 22, 2018

amueller commented Oct 22, 2018 •

edited

Loading

jeremiedbb commented Oct 22, 2018

amueller commented Oct 22, 2018 •

edited

Loading

amueller commented Oct 23, 2018

jnothman commented Oct 28, 2018 via email

amueller commented Oct 29, 2018

jnothman commented Oct 30, 2018

amueller commented Nov 27, 2018

ogrisel commented Feb 19, 2020

ogrisel commented Feb 20, 2020

amueller commented Feb 20, 2020

ogrisel commented Feb 20, 2020 •

edited

Loading

ogrisel commented Feb 20, 2020

amueller commented Feb 20, 2020

maikia commented Feb 21, 2020

Add built-in dataset with missing values and categorical data? #12433

Add built-in dataset with missing values and categorical data? #12433

Comments

amueller commented Oct 22, 2018

amueller commented Oct 22, 2018 • edited Loading

jeremiedbb commented Oct 22, 2018

amueller commented Oct 22, 2018 • edited Loading

amueller commented Oct 23, 2018

jnothman commented Oct 28, 2018 via email

amueller commented Oct 29, 2018

jnothman commented Oct 30, 2018

amueller commented Nov 27, 2018

ogrisel commented Feb 19, 2020

ogrisel commented Feb 20, 2020

amueller commented Feb 20, 2020

ogrisel commented Feb 20, 2020 • edited Loading

ogrisel commented Feb 20, 2020

amueller commented Feb 20, 2020

maikia commented Feb 21, 2020

amueller commented Oct 22, 2018 •

edited

Loading

amueller commented Oct 22, 2018 •

edited

Loading

ogrisel commented Feb 20, 2020 •

edited

Loading