-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Openml data loader #11419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Openml data loader #11419
Conversation
But does it need to be the default? |
I am happy @jorisvandenbossche joined in on this, because I don't have much to say about it :) |
I feel very uncomfortable that we are blocking a release for these decisions, because we need such a loader to simply facilitate examples. I wish we had resolved this much earlier. There are a few of us who wish we could make dataframes and heterogeneous data first-class citizens, but it's a big change and we're not there yet. I look forward to being able to return a DataFrame (I don't think I think we can add a I hope you come to agree with me that this is the best we can do for now in terms of compatibility and extensibility, while supporting our immediate needs. But please do raise alternatives (and required extra features before 0.20 final) if you have them. |
In seeing the enormous difference between 'titanic' version 1 and 2, I think we need to require version when fetching by name... |
Instead we can have an issue after merge: Raise a warning along the lines of "Multiple dataset versions match the name BLAH. Versions may be very different. Getting version x." |
I think this is acceptable as is, and I would like to merge this. Can I hear +1/-1 for merge? I'll also prepare a deprecation of |
A problem: scikit-learn doesn't support string targets for multilabel data. So I can't get the yeast example to work... Do we therefore also need to encode targets, or only when there are multiple? :( |
I get a few DeprecationWarnings at import with Python 3.6, In [1]: from sklearn.datasets import fetch_openml
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n , ' (truncated)
''' % {'value_re': value_re})
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n , ' (truncated)
''' % {'value_re': value_re})
sklearn/externals/_arff.py:204: DeprecationWarning: Flags not at the start of the expression '(?x)\n , ' (truncated)
''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n (?:^\\s*' (truncated)
''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n (?:^\\s*' (truncated)
''' % {'value_re': value_re})
sklearn/externals/_arff.py:219: DeprecationWarning: Flags not at the start of the expression '(?x)\n (?:^\\s*' (truncated)
''' % {'value_re': value_re}) In the datasets I tried that are currently used in examples, titanic v0 works, v1 raises exception about unsupported string field (as discussed above), "minst_748", "shuttle", "iris", "leukemia" works. For "yeast" I get,
is this just the first label? Overall I share the sentiment of #11419 (comment), some things are not currently not that great (e.g. default nominal encoding, sparse datasets loading is very slow etc) but this is a good basis, it's marked as experimental, and we need this to fix broken examples CI. We can improve various points in subsequent PRs after the examples are migrated, even if it means changing the API a bit. +1 for merge. |
That's not the same yeast. Need version=4. Will look into the liac-arff
issue.
|
OK, the version parameter can really change the output completely... Maybe leave the multilabel targets as they are, and do the manual encoding in the single example where we need them for now? I am also not so keen on doing default encoding if we can avoid it... |
If the targets all have value {FALSE, TRUE} then it's fairly clear we
should encode it. The question is how stable that convention is in OpenML.
|
Should we error if no version is specified and there are multiple? Or
simply warn?
|
This, and examples based off it, are passing in all tested platforms. Merge? |
Great! Someone needs to press the green button.
Fair enough, also running some detection on targets is probably not too expensive.
+1 for warn (in subsequent PR) |
It's negligible as the {FALSE, TRUE} information is in metadata. |
I'd like to squash and merge, but I don't know who GitHub will attribute the commit to! I'd like it to be @janvanrijn. I can go squash and merge manually... |
According to someone somewhere, "GitHub takes the info of the PR author". Let's do it. |
Thank you @janvanrijn!! |
Wonderful! Thanks @janvanrijn and also @jnothman for doing major fixes on the fly in related projects ! |
Thanks guys, was a nice experience to work with you :) See you all in Paris this year! |
Reference Issues/PRs
Fixes #9543 Fixes #9908
This is the work I performed on issue #9543,
building upon the work that @amueller wants to merge in #9908
What does this implement/fix? Explain your changes.
Adds an OpenMl data loader
Any other comments?
Builds upon the work of @amueller. the latest version of liac-arff package is packed in externals and the unit tests are more extensive.
TO DO AFTER MERGE:
Y
consists of strings. We should maybe identify multilabel data, at least when its values are all {FALSE, TRUE} and encode it to {0, 1}.fetch_openml
in examples in place of fetch_mldatareturn_X_y
load_arff
function?