-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Adds the ability load datasets from OpenML containing string attributes #13177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d679082
to
958b647
Compare
bb390fe
to
b437471
Compare
attributes by providing the option to ignore said attributes. Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading only a subset of the data. Closes scikit-learn#11819.
Is there anything @glemaitre that I need to do push this forward? |
I'll review this today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche If you could have a look at that.
@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None): | |||
count = -1 | |||
else: | |||
count = shape[0] * shape[1] | |||
data = np.fromiter(itertools.chain.from_iterable(arff_data), | |||
dtype='float64', count=count) | |||
data = np.array(list(itertools.chain.from_iterable(arff_data))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this change will reverse the enhancement from #13312.
@jorisvandenbossche Do you know what would be best here. It seems that we make our own fromiter
which will select the column of X and y when iterating over the iterator. We should know the size in advance as well, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this reverses the recent enhancement, to use fromiter. Thinking about it, this change is somewhat messier with fromiter. Not sure what the right solution is. The purpose of fromiter was to avoid materialising all the python objects, which was consuming a lot of temporary memory. But there are other ways to chunk the reading, I suppose.
The bigger problem here is that we're temporarily converting all numeric values to strings, which undoes the conversion work in the arff library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that we will need to treat string columns separately before this, with an operation like:
string_col_idxs = ... # a list of string column indices
if string_col_idxs:
numeric_slices = [slice(start, stop) for start, stop in zip([None] + string_col_idxs, string_col_idxs + [None])]
arff_data = (row[sl] for row in arff_data for sl in numeric_slices)
Though this may all be essentially replicating functionality in Pandas, so if we just supported pandas output, we could avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out this may be trickier than we'd thought.
@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None): | |||
count = -1 | |||
else: | |||
count = shape[0] * shape[1] | |||
data = np.fromiter(itertools.chain.from_iterable(arff_data), | |||
dtype='float64', count=count) | |||
data = np.array(list(itertools.chain.from_iterable(arff_data))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this reverses the recent enhancement, to use fromiter. Thinking about it, this change is somewhat messier with fromiter. Not sure what the right solution is. The purpose of fromiter was to avoid materialising all the python objects, which was consuming a lot of temporary memory. But there are other ways to chunk the reading, I suppose.
The bigger problem here is that we're temporarily converting all numeric values to strings, which undoes the conversion work in the arff library.
@@ -242,11 +242,10 @@ def _convert_arff_data(arff_data, col_slice_x, col_slice_y, shape=None): | |||
count = -1 | |||
else: | |||
count = shape[0] * shape[1] | |||
data = np.fromiter(itertools.chain.from_iterable(arff_data), | |||
dtype='float64', count=count) | |||
data = np.array(list(itertools.chain.from_iterable(arff_data))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that we will need to treat string columns separately before this, with an operation like:
string_col_idxs = ... # a list of string column indices
if string_col_idxs:
numeric_slices = [slice(start, stop) for start, stop in zip([None] + string_col_idxs, string_col_idxs + [None])]
arff_data = (row[sl] for row in arff_data for sl in numeric_slices)
Though this may all be essentially replicating functionality in Pandas, so if we just supported pandas output, we could avoid this.
Going to close in light of the work in #13902 |
Thanks for your efforts @oanise93 and sorry the issue was a bit understated |
Reference Issues/PRs
Closes #11819
What does this implement/fix? Explain your changes.
Right now, an error is raised when a dataset containing string attributes (e.g., the Titanic dataset) is fetched from OpenML. This commit allows users to specify whether or not they are okay loading
only a subset of the data through a boolean parameter
ignore_strings
.Any other comments?
I added a warning so that when a user sets
ignore_strings
toTrue
they will be informed of which attributes will be excluded. I'm not sure if this is the best way to make sure users know that they might be getting a subset of the data that is on OpenML.