-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Add built-in dataset with missing values and categorical data? #12433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On UCI adult is 4mb, on kaggle it's 400k? (kaggle is zipped) |
You want a dataset with missing values completely at random ? Patterns in missing values could be interesting too. |
I guess I would be most interested in not missing at random. The main point was to have a non-trivial amount of missing data and possibly across multiple features. |
I just realized the biggest problem with this issue: we can either return a weirdly typed numpy array or loading the dataset will have a soft requirement on having pandas installed.... |
I think if we make it possible to get data frames from fetch_openml, and we
have a way to insert missing values in some way that is not completely at
random.
|
@jnothman Sorry I don't think I understood your statement. |
Yes, my statement was not a complete sentence. I was basically saying I don't think we especially need a real-world dataset with missing values at this point. I think we should make it possible to fetch Titanic as a dataframe (#11875). And I think we should make it possible to insert missing values in a not-completely-at-random fashion (#6284). |
I agree, we should be able to fetch titanic. I was a bit disappointed when I realized we're still doing a pandas read_csv to my github repo in an example. |
The Ames housing dataset has missing values in the CSV files that can be found on Kaggle (login required to download the data): https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data There is also a version on OpenML but it does not have missing values: https://www.openml.org/d/41211 It seems that they have been imputed by zeros. The openml version is apparently derived from this R package: https://cran.r-project.org/web/packages/AmesHousing/index.html but the data is in rda format. I am not sure if the missing values are there. If you are interested I trained some baseline models + model inspection in the following notebook: https://nbviewer.jupyter.org/github/ogrisel/notebooks/blob/master/sklearn_demos/ames_housing.ipynb |
BTW the original dataset is there:
and the paper describing the motivation, design and analysis of this dataset: http://jse.amstat.org/v19n3/decock.pdf |
OpenML allows having several versions of a dataset, so we could push a version with the missing data. |
I did not observe crazy variance in the notebook above but maybe the outliers have already been removed? |
I think there are quite few missing values in this dataset |
Follow up on #12424:
We don't have a built-in dataset with missing values and/or mixed feature types. I think it would be good to have one to have non-synthetic examples for the column transformer and imputation strategies.
While I like to reduce the number of custom fetchers and would like to use
fetch_openml
as much as possible, I think there's a benefit to having a built-in dataset so that examples can run without internet connection.Not sure what good candidates would be. Titanic is somewhat obvious though I'm not sure about missingness patterns there. Adult would be nice but might be too large to ship (4mb - would double the size of the wheel so seems unreasonable).
Maybe the ames housing data would be appropriate, not sure if it has missing values.
The text was updated successfully, but these errors were encountered: