Skip to content

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

oanise93
Copy link

Reference Issues/PRs

Closes #11819

What does this implement/fix? Explain your changes.

Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading
only a subset of the data through a boolean parameter ignore_strings.

Any other comments?

I added a warning so that when a user sets ignore_strings to True they will be informed of which attributes will be excluded. I'm not sure if this is the best way to make sure users know that they might be getting a subset of the data that is on OpenML.

@oanise93 oanise93 force-pushed the ignore_strings branch 5 times, most recently from bb390fe to b437471 Compare March 9, 2019 05:36
attributes by providing the option to ignore said attributes.

Right now, an error is raised when a dataset containing string
attributes (e.g., the Titanic dataset) is fetched from OpenML.
This commit allows users to specify whether or not they are okay loading
only a subset of the data. Closes scikit-learn#11819.
@oanise93
Copy link
Author

Is there anything @glemaitre that I need to do push this forward?

@glemaitre
Copy link
Member

I'll review this today

@glemaitre glemaitre self-requested a review April 12, 2019 08:02
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche If you could have a look at that.

@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None):
count = -1
else:
count = shape[0] * shape[1]
data = np.fromiter(itertools.chain.from_iterable(arff_data),
dtype='float64', count=count)
data = np.array(list(itertools.chain.from_iterable(arff_data)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this change will reverse the enhancement from #13312.
@jorisvandenbossche Do you know what would be best here. It seems that we make our own fromiter which will select the column of X and y when iterating over the iterator. We should know the size in advance as well, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this reverses the recent enhancement, to use fromiter. Thinking about it, this change is somewhat messier with fromiter. Not sure what the right solution is. The purpose of fromiter was to avoid materialising all the python objects, which was consuming a lot of temporary memory. But there are other ways to chunk the reading, I suppose.

The bigger problem here is that we're temporarily converting all numeric values to strings, which undoes the conversion work in the arff library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that we will need to treat string columns separately before this, with an operation like:

string_col_idxs = ...  # a list of string column indices
if string_col_idxs:
    numeric_slices = [slice(start, stop) for start, stop in zip([None] + string_col_idxs, string_col_idxs + [None])]
    arff_data = (row[sl] for row in arff_data for sl in numeric_slices)

Though this may all be essentially replicating functionality in Pandas, so if we just supported pandas output, we could avoid this.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out this may be trickier than we'd thought.

@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None):
count = -1
else:
count = shape[0] * shape[1]
data = np.fromiter(itertools.chain.from_iterable(arff_data),
dtype='float64', count=count)
data = np.array(list(itertools.chain.from_iterable(arff_data)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this reverses the recent enhancement, to use fromiter. Thinking about it, this change is somewhat messier with fromiter. Not sure what the right solution is. The purpose of fromiter was to avoid materialising all the python objects, which was consuming a lot of temporary memory. But there are other ways to chunk the reading, I suppose.

The bigger problem here is that we're temporarily converting all numeric values to strings, which undoes the conversion work in the arff library.

@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None):
count = -1
else:
count = shape[0] * shape[1]
data = np.fromiter(itertools.chain.from_iterable(arff_data),
dtype='float64', count=count)
data = np.array(list(itertools.chain.from_iterable(arff_data)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that we will need to treat string columns separately before this, with an operation like:

string_col_idxs = ...  # a list of string column indices
if string_col_idxs:
    numeric_slices = [slice(start, stop) for start, stop in zip([None] + string_col_idxs, string_col_idxs + [None])]
    arff_data = (row[sl] for row in arff_data for sl in numeric_slices)

Though this may all be essentially replicating functionality in Pandas, so if we just supported pandas output, we could avoid this.

@oanise93
Copy link
Author

Going to close in light of the work in #13902

@oanise93 oanise93 closed this May 18, 2019
@jnothman
Copy link
Member

Thanks for your efforts @oanise93 and sorry the issue was a bit understated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fetch_openml: Add an option to ignore some features, especially STRING type
3 participants