-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
fetch_openml: Add an option to ignore some features, especially STRING type #11819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Just thinking aloud. Another option might be to have an option to specify that the output should be a numerical array? Which would mean combination of encoding nominal + dropping string features. |
String types are likely categorical data, right? |
No, so openml distinguishes 'nominal' data and 'string' data. Nominal is categorical data, string data are just strings (can be all unique, varying, ..). |
ah, that's fine, there's no string data on OpenML afaik because the backend doesn't support it last time I checked... |
@janvanrijn and @joaquinvanschoren can tell me if I'm wrong ;) |
There is .. It is the reason that currently we cannot load the Titanic dataset: #11419 (review) (https://www.openml.org/d/40945) |
OpenML backend supports string features without problems. Contrarily, https://github.com/openml/openml-python openml-python has problems with it (and throws an error in case a string feature is in the dataset). We should probably add this to the fetch_openml fn too. |
fetch_openml does currently throw an appropriate error in case of strings.
|
Btw this means we can't load titanic :-/ (fetch_openml("titanic") yields a small subset of the original features unfortunately). cc @janvanrijn ;) |
I think adding a boolean option to ignore strings as a quick fix would be good. |
Hi, I'm new to contributing to open-source projects but wondered if I could take on this issue. I use |
Please feel free to submit a PR |
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
I'm moving this to 0.22 :( |
I'm removing "help wanted" as I think this will be resolved by #13902 |
I lost this issue of my radar. Will it be integrated in the next release? If yes, anything I can do to help? |
@janvanrijn I think it's mostly done, thanks though! |
Feel free to ping me during the sprints if there is anything related to the OpenML fetcher that you want my quick input or code contribution on. |
Users should be able to specify names of features to ignore in fetch_openml, or should be able to specify that it is safe to discard features with STRING type. This will avoid attempting to convert them to float64.
Or we can not do this, and instead just support returning DataFrames as in #11818.
The text was updated successfully, but these errors were encountered: