fetch_openml: Add an option to ignore some features, especially STRING type #11819

jnothman · 2018-08-15T08:03:42Z

Users should be able to specify names of features to ignore in fetch_openml, or should be able to specify that it is safe to discard features with STRING type. This will avoid attempting to convert them to float64.

Or we can not do this, and instead just support returning DataFrames as in #11818.

jorisvandenbossche · 2018-08-17T16:27:25Z

Just thinking aloud. Another option might be to have an option to specify that the output should be a numerical array? Which would mean combination of encoding nominal + dropping string features.

amueller · 2018-08-17T18:12:26Z

String types are likely categorical data, right?

jorisvandenbossche · 2018-08-18T07:51:22Z

No, so openml distinguishes 'nominal' data and 'string' data. Nominal is categorical data, string data are just strings (can be all unique, varying, ..).
Currently (as merged in master), nominal strings are converted to integers (to have a numerical array as output), and if there are string data it raises an error (which basically means you cannot download such data right now).

amueller · 2018-08-20T22:20:30Z

ah, that's fine, there's no string data on OpenML afaik because the backend doesn't support it last time I checked...

amueller · 2018-08-20T22:20:52Z

@janvanrijn and @joaquinvanschoren can tell me if I'm wrong ;)

jorisvandenbossche · 2018-08-20T22:27:44Z

There is .. It is the reason that currently we cannot load the Titanic dataset: #11419 (review) (https://www.openml.org/d/40945)

janvanrijn · 2018-08-21T00:42:23Z

@janvanrijn and @joaquinvanschoren can tell me if I'm wrong ;)

OpenML backend supports string features without problems. Contrarily, https://github.com/openml/openml-python openml-python has problems with it (and throws an error in case a string feature is in the dataset).

We should probably add this to the fetch_openml fn too.

jnothman · 2018-08-21T01:07:31Z

fetch_openml does currently throw an appropriate error in case of strings.

amueller · 2018-11-27T17:33:33Z

Btw this means we can't load titanic :-/ (fetch_openml("titanic") yields a small subset of the original features unfortunately). cc @janvanrijn ;)

amueller · 2018-11-27T17:39:55Z

I think adding a boolean option to ignore strings as a quick fix would be good.

oanise93 · 2019-02-15T04:21:45Z

Hi, I'm new to contributing to open-source projects but wondered if I could take on this issue. I use fetch_opeml a lot and would like to give back.

jnothman · 2019-02-15T06:31:15Z

Please feel free to submit a PR

attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.

jnothman · 2019-04-24T05:19:23Z

I'm moving this to 0.22 :(

amueller · 2019-07-12T15:21:17Z

I'm removing "help wanted" as I think this will be resolved by #13902

janvanrijn · 2019-07-12T15:25:29Z

I lost this issue of my radar. Will it be integrated in the next release? If yes, anything I can do to help?

amueller · 2019-07-12T15:46:49Z

@janvanrijn I think it's mostly done, thanks though!

janvanrijn · 2019-07-12T16:14:11Z

Feel free to ping me during the sprints if there is anything related to the OpenML fetcher that you want my quick input or code contribution on.

jnothman added Easy Well-defined and straightforward way to resolve Enhancement help wanted labels Aug 15, 2018

jorisvandenbossche mentioned this issue Aug 17, 2018

API for returning datasets as DataFrames #10733

Closed

oanise93 mentioned this issue Feb 16, 2019

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

Closed

jnothman added this to the 0.21 milestone Apr 16, 2019

jnothman modified the milestones: 0.21, 0.22 Apr 24, 2019

thomasjpfan mentioned this issue May 18, 2019

[MRG] Adds fetch_openml pandas dataframe support #13902

Merged

amueller removed the help wanted label Jul 12, 2019

glemaitre closed this as completed in #13902 Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch_openml: Add an option to ignore some features, especially STRING type #11819

fetch_openml: Add an option to ignore some features, especially STRING type #11819

jnothman commented Aug 15, 2018

jorisvandenbossche commented Aug 17, 2018

amueller commented Aug 17, 2018

jorisvandenbossche commented Aug 18, 2018

amueller commented Aug 20, 2018

amueller commented Aug 20, 2018 •

edited

Loading

jorisvandenbossche commented Aug 20, 2018

janvanrijn commented Aug 21, 2018

jnothman commented Aug 21, 2018 via email

amueller commented Nov 27, 2018 •

edited

Loading

amueller commented Nov 27, 2018

oanise93 commented Feb 15, 2019

jnothman commented Feb 15, 2019

jnothman commented Apr 24, 2019

amueller commented Jul 12, 2019

janvanrijn commented Jul 12, 2019

amueller commented Jul 12, 2019

janvanrijn commented Jul 12, 2019

fetch_openml: Add an option to ignore some features, especially STRING type #11819

fetch_openml: Add an option to ignore some features, especially STRING type #11819

Comments

jnothman commented Aug 15, 2018

jorisvandenbossche commented Aug 17, 2018

amueller commented Aug 17, 2018

jorisvandenbossche commented Aug 18, 2018

amueller commented Aug 20, 2018

amueller commented Aug 20, 2018 • edited Loading

jorisvandenbossche commented Aug 20, 2018

janvanrijn commented Aug 21, 2018

jnothman commented Aug 21, 2018 via email

amueller commented Nov 27, 2018 • edited Loading

amueller commented Nov 27, 2018

oanise93 commented Feb 15, 2019

jnothman commented Feb 15, 2019

jnothman commented Apr 24, 2019

amueller commented Jul 12, 2019

janvanrijn commented Jul 12, 2019

amueller commented Jul 12, 2019

janvanrijn commented Jul 12, 2019

amueller commented Aug 20, 2018 •

edited

Loading

amueller commented Nov 27, 2018 •

edited

Loading