[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

oanise93 · 2019-02-16T17:17:51Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading
only a subset of the data through a boolean parameter ignore_strings.

Any other comments?

I added a warning so that when a user sets ignore_strings to True they will be informed of which attributes will be excluded. I'm not sure if this is the best way to make sure users know that they might be getting a subset of the data that is on OpenML.

sklearn/datasets/openml.py

attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.

oanise93 · 2019-04-12T02:08:01Z

Is there anything @glemaitre that I need to do push this forward?

glemaitre · 2019-04-12T08:02:00Z

I'll review this today

glemaitre

@jorisvandenbossche If you could have a look at that.

glemaitre · 2019-04-12T13:06:25Z

sklearn/datasets/openml.py

@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None):
            count = -1
        else:
            count = shape[0] * shape[1]
-        data = np.fromiter(itertools.chain.from_iterable(arff_data),
-                           dtype='float64', count=count)
+        data = np.array(list(itertools.chain.from_iterable(arff_data)))


I think that this change will reverse the enhancement from #13312.
@jorisvandenbossche Do you know what would be best here. It seems that we make our own fromiter which will select the column of X and y when iterating over the iterator. We should know the size in advance as well, isn't it?

Yes, this reverses the recent enhancement, to use fromiter. Thinking about it, this change is somewhat messier with fromiter. Not sure what the right solution is. The purpose of fromiter was to avoid materialising all the python objects, which was consuming a lot of temporary memory. But there are other ways to chunk the reading, I suppose.

The bigger problem here is that we're temporarily converting all numeric values to strings, which undoes the conversion work in the arff library.

I suspect that we will need to treat string columns separately before this, with an operation like:

string_col_idxs = ... # a list of string column indices if string_col_idxs: numeric_slices = [slice(start, stop) for start, stop in zip([None] + string_col_idxs, string_col_idxs + [None])] arff_data = (row[sl] for row in arff_data for sl in numeric_slices)

Though this may all be essentially replicating functionality in Pandas, so if we just supported pandas output, we could avoid this.

jnothman

It turns out this may be trickier than we'd thought.

jnothman · 2019-04-16T10:49:24Z

sklearn/datasets/openml.py

@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None):
            count = -1
        else:
            count = shape[0] * shape[1]
-        data = np.fromiter(itertools.chain.from_iterable(arff_data),
-                           dtype='float64', count=count)
+        data = np.array(list(itertools.chain.from_iterable(arff_data)))


Yes, this reverses the recent enhancement, to use fromiter. Thinking about it, this change is somewhat messier with fromiter. Not sure what the right solution is. The purpose of fromiter was to avoid materialising all the python objects, which was consuming a lot of temporary memory. But there are other ways to chunk the reading, I suppose.

The bigger problem here is that we're temporarily converting all numeric values to strings, which undoes the conversion work in the arff library.

jnothman · 2019-04-16T10:58:14Z

sklearn/datasets/openml.py

@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None):
            count = -1
        else:
            count = shape[0] * shape[1]
-        data = np.fromiter(itertools.chain.from_iterable(arff_data),
-                           dtype='float64', count=count)
+        data = np.array(list(itertools.chain.from_iterable(arff_data)))


I suspect that we will need to treat string columns separately before this, with an operation like:

string_col_idxs = ... # a list of string column indices if string_col_idxs: numeric_slices = [slice(start, stop) for start, stop in zip([None] + string_col_idxs, string_col_idxs + [None])] arff_data = (row[sl] for row in arff_data for sl in numeric_slices)

Though this may all be essentially replicating functionality in Pandas, so if we just supported pandas output, we could avoid this.

oanise93 · 2019-05-18T23:33:59Z

Going to close in light of the work in #13902

jnothman · 2019-05-21T00:21:16Z

Thanks for your efforts @oanise93 and sorry the issue was a bit understated

oanise93 force-pushed the ignore_strings branch from d679082 to 958b647 Compare February 16, 2019 17:31

glemaitre self-requested a review February 27, 2019 14:52

glemaitre requested changes Feb 27, 2019

View reviewed changes

oanise93 force-pushed the ignore_strings branch 5 times, most recently from bb390fe to b437471 Compare March 9, 2019 05:36

oanise93 force-pushed the ignore_strings branch from b437471 to a5008c0 Compare March 9, 2019 06:10

glemaitre self-requested a review April 12, 2019 08:02

glemaitre reviewed Apr 12, 2019

View reviewed changes

jnothman reviewed Apr 16, 2019

View reviewed changes

thomasjpfan mentioned this pull request May 18, 2019

[MRG] Adds fetch_openml pandas dataframe support #13902

Merged

oanise93 closed this May 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

Uh oh!

oanise93 commented Feb 16, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oanise93 commented Apr 12, 2019

Uh oh!

glemaitre commented Apr 12, 2019

Uh oh!

glemaitre left a comment

Uh oh!

glemaitre Apr 12, 2019

Uh oh!

jnothman Apr 16, 2019

Uh oh!

jnothman Apr 16, 2019

Uh oh!

jnothman left a comment

Uh oh!

jnothman Apr 16, 2019

Uh oh!

jnothman Apr 16, 2019

Uh oh!

oanise93 commented May 18, 2019

Uh oh!

jnothman commented May 21, 2019

Uh oh!

Uh oh!

Uh oh!

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

Uh oh!

Conversation

oanise93 commented Feb 16, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oanise93 commented Apr 12, 2019

Uh oh!

glemaitre commented Apr 12, 2019

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre Apr 12, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Apr 16, 2019

Choose a reason for hiding this comment

Uh oh!

oanise93 commented May 18, 2019

Uh oh!

jnothman commented May 21, 2019

Uh oh!

Uh oh!