Skip to content

Add built-in dataset with missing values and categorical data? #12433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amueller opened this issue Oct 22, 2018 · 15 comments
Open

Add built-in dataset with missing values and categorical data? #12433

amueller opened this issue Oct 22, 2018 · 15 comments

Comments

@amueller
Copy link
Member

Follow up on #12424:
We don't have a built-in dataset with missing values and/or mixed feature types. I think it would be good to have one to have non-synthetic examples for the column transformer and imputation strategies.
While I like to reduce the number of custom fetchers and would like to use fetch_openml as much as possible, I think there's a benefit to having a built-in dataset so that examples can run without internet connection.

Not sure what good candidates would be. Titanic is somewhat obvious though I'm not sure about missingness patterns there. Adult would be nice but might be too large to ship (4mb - would double the size of the wheel so seems unreasonable).

Maybe the ames housing data would be appropriate, not sure if it has missing values.

@amueller
Copy link
Member Author

amueller commented Oct 22, 2018

On UCI adult is 4mb, on kaggle it's 400k?
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
https://www.kaggle.com/uciml/adult-census-income

(kaggle is zipped)

@jeremiedbb
Copy link
Member

I'm not sure about missingness patterns there

You want a dataset with missing values completely at random ? Patterns in missing values could be interesting too.

@amueller
Copy link
Member Author

amueller commented Oct 22, 2018

I guess I would be most interested in not missing at random. The main point was to have a non-trivial amount of missing data and possibly across multiple features.
Any dataset that's both small enough and interesting enough should be fine, I wouldn't worry about trying to do statistics (which are probably impossible to compute given the data is actually missing).

@amueller
Copy link
Member Author

I just realized the biggest problem with this issue: we can either return a weirdly typed numpy array or loading the dataset will have a soft requirement on having pandas installed....

@jnothman
Copy link
Member

jnothman commented Oct 28, 2018 via email

@amueller
Copy link
Member Author

@jnothman Sorry I don't think I understood your statement.

@jnothman
Copy link
Member

Yes, my statement was not a complete sentence.

I was basically saying I don't think we especially need a real-world dataset with missing values at this point.

I think we should make it possible to fetch Titanic as a dataframe (#11875).

And I think we should make it possible to insert missing values in a not-completely-at-random fashion (#6284).

@amueller
Copy link
Member Author

I agree, we should be able to fetch titanic. I was a bit disappointed when I realized we're still doing a pandas read_csv to my github repo in an example.

@ogrisel
Copy link
Member

ogrisel commented Feb 19, 2020

Maybe the ames housing data would be appropriate, not sure if it has missing values.

The Ames housing dataset has missing values in the CSV files that can be found on Kaggle (login required to download the data):

https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data

There is also a version on OpenML but it does not have missing values:

https://www.openml.org/d/41211 It seems that they have been imputed by zeros.

The openml version is apparently derived from this R package:

https://cran.r-project.org/web/packages/AmesHousing/index.html

but the data is in rda format. I am not sure if the missing values are there.

If you are interested I trained some baseline models + model inspection in the following notebook:

https://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/sklearn_demos/ames_housing.ipynb

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2020

@amueller
Copy link
Member Author

OpenML allows having several versions of a dataset, so we could push a version with the missing data.
For the Ames Housing dataset, you need to log and remove outliers. If you don't remove outliers, your cross-validation score will have crazy variance in my experience.

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2020

Actually as @maikia found in #16345 there is already another version of the same dataset. But it does not have the missing values either. Maybe the ARFF format is limiting this? I haven't investigated.

@ogrisel
Copy link
Member

ogrisel commented Feb 20, 2020

I did not observe crazy variance in the notebook above but maybe the outliers have already been removed?

@amueller
Copy link
Member Author

In your initial prototype they are in the training set, and I guess gradient boosting doesn't care as much.
ARFF definitely supports missing values.
These guys here:
image

See here:
https://github.com/amueller/COMS4995-s20/blob/master/slides/aml-05-linear-models-regression/aml-05-linear-models-regression-Copy1.ipynb

If you do a linear model, I would be curious to see how performance varies across cross-validation splits.
This is what I got for ridge:
image

@maikia
Copy link
Contributor

maikia commented Feb 21, 2020

Actually as @maikia found in #16345 there is already another version of the same dataset. But it does not have the missing values either. Maybe the ARFF format is limiting this? I haven't investigated.

I think there are quite few missing values in this dataset
https://www.openml.org/d/42165

missing_values

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants