Skip to content

fetch_openml: Add an option to ignore some features, especially STRING type #11819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Aug 15, 2018 · 17 comments · Fixed by #13902
Closed

fetch_openml: Add an option to ignore some features, especially STRING type #11819

jnothman opened this issue Aug 15, 2018 · 17 comments · Fixed by #13902
Labels
Easy Well-defined and straightforward way to resolve Enhancement
Milestone

Comments

@jnothman
Copy link
Member

Users should be able to specify names of features to ignore in fetch_openml, or should be able to specify that it is safe to discard features with STRING type. This will avoid attempting to convert them to float64.

Or we can not do this, and instead just support returning DataFrames as in #11818.

@jnothman jnothman added Easy Well-defined and straightforward way to resolve Enhancement help wanted labels Aug 15, 2018
@jorisvandenbossche
Copy link
Member

Just thinking aloud. Another option might be to have an option to specify that the output should be a numerical array? Which would mean combination of encoding nominal + dropping string features.

@amueller
Copy link
Member

String types are likely categorical data, right?

@jorisvandenbossche
Copy link
Member

No, so openml distinguishes 'nominal' data and 'string' data. Nominal is categorical data, string data are just strings (can be all unique, varying, ..).
Currently (as merged in master), nominal strings are converted to integers (to have a numerical array as output), and if there are string data it raises an error (which basically means you cannot download such data right now).

@amueller
Copy link
Member

ah, that's fine, there's no string data on OpenML afaik because the backend doesn't support it last time I checked...

@amueller
Copy link
Member

amueller commented Aug 20, 2018

@janvanrijn and @joaquinvanschoren can tell me if I'm wrong ;)

@jorisvandenbossche
Copy link
Member

There is .. It is the reason that currently we cannot load the Titanic dataset: #11419 (review) (https://www.openml.org/d/40945)

@janvanrijn
Copy link
Contributor

@janvanrijn and @joaquinvanschoren can tell me if I'm wrong ;)

OpenML backend supports string features without problems. Contrarily, https://github.com/openml/openml-python openml-python has problems with it (and throws an error in case a string feature is in the dataset).

We should probably add this to the fetch_openml fn too.

@jnothman
Copy link
Member Author

jnothman commented Aug 21, 2018 via email

@amueller
Copy link
Member

amueller commented Nov 27, 2018

Btw this means we can't load titanic :-/ (fetch_openml("titanic") yields a small subset of the original features unfortunately). cc @janvanrijn ;)

@amueller
Copy link
Member

I think adding a boolean option to ignore strings as a quick fix would be good.

@oanise93
Copy link

Hi, I'm new to contributing to open-source projects but wondered if I could take on this issue. I use fetch_opeml a lot and would like to give back.

@jnothman
Copy link
Member Author

Please feel free to submit a PR

oanise93 added a commit to oanise93/scikit-learn that referenced this issue Feb 16, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Feb 16, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Mar 8, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Mar 8, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Mar 8, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Mar 8, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Mar 9, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
oanise93 added a commit to oanise93/scikit-learn that referenced this issue Mar 9, 2019
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
@jnothman jnothman added this to the 0.21 milestone Apr 16, 2019
@jnothman
Copy link
Member Author

I'm moving this to 0.22 :(

@amueller
Copy link
Member

I'm removing "help wanted" as I think this will be resolved by #13902

@janvanrijn
Copy link
Contributor

I lost this issue of my radar. Will it be integrated in the next release? If yes, anything I can do to help?

@amueller
Copy link
Member

@janvanrijn I think it's mostly done, thanks though!

@janvanrijn
Copy link
Contributor

Feel free to ping me during the sprints if there is anything related to the OpenML fetcher that you want my quick input or code contribution on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Easy Well-defined and straightforward way to resolve Enhancement
Projects
None yet
5 participants