-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Openml loader #9908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Openml loader #9908
Conversation
Looking at the flake8 errors, there seems to be genuine problems, e.g.:
Not familiar at all with openml, do you have any outputs you can suggest for |
this is not ready for review yet ;) I wanted to share in case @vrishank97 was working on it. Give me like 3 more hours maybe, I'm running between meetings and sprints. |
you can try "iris" for a |
hm not sure how to catch the error on python2. seem there are different errors thrown on python2 and python3. |
Hm I thought by default returning an active dataset might be a good idea, but then the caching kind of interferes with that. But either way, the caching might interfere with warning people about deactivated datasets. Not sure what to do about it. We could invalidate the cache after a certain amount of time or only use it if there's no internet connection (at least for the meta-data, which can change). |
The handling of versions will be much easier if there's another filter added to openml, which will be done hopefully today or tomorrow. |
I've also made some progress on this issue, should I make a separate PR or work on this one? and what are the benefits of using arff over csv with json? |
sure can you send a PR so we can merge efforts? It would have been great if you'd send a PR earlier so we had less duplicate work. I'd really like to get this in some reasonable state this week. |
I'm not entirely sure what to do for tests. We have done some mocking for mldata, but here the requests are somewhat more complex. I could do the mocking for all of them if you think that's necessary. |
I'm also not sure what would be the best datatype to return for mixed data. If there's strings in there, we can either use a recarray or cast everything to string, or make it object. By default things become strings, which doesn't seem great :-/ |
I don't know if that helps but for from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99()
dataset.data
array([[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
...,
[0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
[0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
[0, b'tcp', b'http', ..., 0.01, 0.0, 0.0]], dtype=object) |
thanks @albertcthomas. the question was more how to detect that easily, but I could pull it out of the arff data. |
@lesteve now you can have a look ;) not tests yet, though, and the version stuff is not supported yet. Will bug jan to make it easier. |
maybe miceprotein is not good and I should use adult instead... (miceprotein is trivial to classify) |
or you can encode the categorical strings as numbers and report the
mapping. In the scikit-learn of a few years ago, I'm sure this is what we
would do.
|
@jnothman this is actually what the openml-python lib does. Right now I opted to convert to object dtype here. I like the dataset fetcher to not do any preprocessing, as that reflects better what people will encounter in "the real world". |
Current test failures in this PR are HTTP 404 errors, though.. |
@jnothman Indeed. I opened an issue about that openml/OpenML#628 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, @joaquinvanschoren, that this is pretty high priority.
It looks like we can't use numpy's genfromtxt reliably...
|
||
target_column : string or None, default 'default-target' | ||
Specify the column name in the data to use as target. If | ||
'default-target', the standard target column a stored on the server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a stored -> as stored
Specify the column name in the data to use as target. If | ||
'default-target', the standard target column a stored on the server | ||
is used. If ``None``, all columns are returned as data and the | ||
tharget is ``None``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*target
is used. If ``None``, all columns are returned as data and the | ||
tharget is ``None``. | ||
|
||
memory : boolean, default=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably support that case and reference :term:`memory`
|
||
data_description = _get_data_description_by_id_(data_id) | ||
if data_description['status'] != "active": | ||
warn("Version {} of dataset {} is inactive, meaning that issues have" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, perhaps we should only say "issues have been found in the dataset" if 'inactive'.
def _download_data_csv(file_id): | ||
response = urlopen("https://openml.org/data/v1/get_csv/{}".format(file_id)) | ||
data = np.genfromtxt(response, names=True, dtype=None, delimiter=',', | ||
missing_values='?') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens to missing value in output??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missing value handling seems a bit weird: if all values in a column are ints (signed or unsigned), the missing value is -1!!; if one is a float, the missing value is np.nan; if strings, the missing value stays as '?'.
Actually, no, even with usemask=True and missing_values='?' it seems -1 is detected as a missing value!!! I don't trust this...
np.genfromtxt(io.BytesIO(b'a,b\n-1,?\n2,b\n3,c'), names=True, dtype=None, delimiter=',', missing_values='?', usemask=True)
dtype = None | ||
else: | ||
dtype = object | ||
X = np.array([data[c] for c in data_columns], dtype=dtype).T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If everything is of the same dtype, it is possible to view the recarray as a scalar-type ndarray without copying like this.
We need to support reading Sparse ARFF-formatted CSV... or reject sparse datasets |
Given that I have reasons to doubt |
To be honest I'm not sure how quickly we'll have that sparse CSV export implemented (or whether you can even read it without adding extra dependencies?). I'd go ahead and either:
|
Given all the issues with |
@amueller can we get someone else to finish this? |
I think we need to exploit an existing, tested ARFF reader for sparse and
dense. I don't see why we should add a dependency for users, when the
package is small. I also would like to see this fetcher in the next release
and don't see why we should go to the effort of rolling our own in a first
release.
|
@mfeurer: you have been working a lot with/on liac_arff. What is your opinion? |
Unless I am much mistaken, I think we can solve this for now with liac-arff, and support the sparse and dense cases seamlessly. As long as we also exploit any typing information provided by openml as metadata, I would think we can make the interface quite stable, and the fact that we're using liac-arff internally will remain an implementation detail. I'm keen to have this finished and released. |
Sorry @joaquinvanschoren I didn't receive an email for this. Could you please let me know about what exactly you want my opinion. A generic answer: arff has a very bad format specification and many undocumented edge-cases. To the best of my knowledge, liac-arff has better support for those than the scipy arff reader, but is also a lot slower. It's not actively developed and I try to devote some time maintaining it. If liac-arff gets broader attention by the usage in scikit-learn this would be amazing. @jnothman do you want to integrate liac-arff the same way you're using joblib? |
that's my suggestion. openml seems to be relying on arff, at least for now.
I think liac-arff is our best shot.
|
It was probably discussed above, but why not simply downloading and reading in the csv file? |
Ah, OK, reading myself :-) So the problem for that are the sparse datasets. |
The problem with supporting CSV even in the dense case is that
np.genfromtxt is troublesome, and we regard pd.read_csv a forbidden
dependency.
|
Regardless of how it is read in, IMO not returning pandas DataFrames for the dense datasets from this functionality would also make it a lot less functional ... (but that's probably for another discussion ;-)) |
BTW, I wanted to give this another try (although it will probably needed to be reworked quite a bit to use liac-arff), but the current version clearly has a bug in that it simply drops non-numeric columns in the One of the problems with returning an object array in case of mixed dtypes, is that if you then simply convert it to a dataframe with
You then would need to do I understand relying fully on pandas here is not fully desirable, but would it be possible to add an option to return a DataFrame? |
Yes. See also #10733
…On Fri, 15 Jun 2018 at 00:09, Joris Van den Bossche < ***@***.***> wrote:
BTW, I wanted to give this another try (although it will probably needed
to be reworked quite a bit to use liac-arff), but the current version
clearly has a bug in that it simply drops non-numeric columns in the .data
array.
For example with mice = fetch_openml('miceprotein', version=4), mice.data
is missing 4 columns (and there are 4 nominal features).
One of the problems with returning an object array in case of mixed
dtypes, is that if you then simply convert it to a dataframe with pd.DataFrame(mice.data,
columns=mice.feature_names), you still have object dtype. Eg:
In [24]: pd.DataFrame(np.array([[1, 'a'], [2, 'b']], dtype=object)).dtypes
Out[24]:
0 object
1 object
dtype: object
You then would need to do .infer_objects() to convert the numerical
features to actual numerical columns, which is a bit annoying that the user
needs to do this (and also inefficient).
I understand relying fully on pandas here is not fully desirable, but
would it be possible to add an option to return a DataFrame?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9908 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wKCH0GXX-YtkrsMkeU_B0-S9dimks5t8m61gaJpZM4P1YKb>
.
|
ok fine with liac arff vendoring, I'll let @janvanrijn take this over. |
yay!
|
Closed for #11419 |
Fixes #9543.
Todo: