[MRG] Openml loader #9908

amueller · 2017-10-11T12:18:35Z

Todo:

lesteve · 2017-10-11T14:34:52Z

Looking at the flake8 errors, there seems to be genuine problems, e.g.:

./sklearn/datasets/openml.py:81:15: F821 undefined name 'urlretrieve'
    json_dl = urlretrieve(json_loc.format(name))[0]
              ^

Not familiar at all with openml, do you have any outputs you can suggest for name_or_id if we want to play with the fetcher?

amueller · 2017-10-11T15:16:42Z

this is not ready for review yet ;) I wanted to share in case @vrishank97 was working on it. Give me like 3 more hours maybe, I'm running between meetings and sprints.

amueller · 2017-10-11T15:17:06Z

you can try "iris" for aname_or_id

amueller · 2017-10-12T08:19:03Z

hm not sure how to catch the error on python2. seem there are different errors thrown on python2 and python3.

amueller · 2017-10-12T08:57:09Z

Hm I thought by default returning an active dataset might be a good idea, but then the caching kind of interferes with that. But either way, the caching might interfere with warning people about deactivated datasets. Not sure what to do about it. We could invalidate the cache after a certain amount of time or only use it if there's no internet connection (at least for the meta-data, which can change).

amueller · 2017-10-12T09:37:52Z

The handling of versions will be much easier if there's another filter added to openml, which will be done hopefully today or tomorrow.

vrishank97 · 2017-10-12T10:13:24Z

I've also made some progress on this issue, should I make a separate PR or work on this one? and what are the benefits of using arff over csv with json?

amueller · 2017-10-12T12:12:15Z

sure can you send a PR so we can merge efforts? It would have been great if you'd send a PR earlier so we had less duplicate work. I'd really like to get this in some reasonable state this week.
arff has possibly more meta-information. maybe not really necessary.

amueller · 2017-10-12T12:13:13Z

I'm not entirely sure what to do for tests. We have done some mocking for mldata, but here the requests are somewhat more complex. I could do the mocking for all of them if you think that's necessary.

amueller · 2017-10-12T12:17:23Z

I'm also not sure what would be the best datatype to return for mixed data. If there's strings in there, we can either use a recarray or cast everything to string, or make it object. By default things become strings, which doesn't seem great :-/

albertcthomas · 2017-10-12T12:45:39Z

I don't know if that helps but for fetch_kddcup99, where data also contains strings, we have dtype=object.

from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99()
dataset.data
array([[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       ...,
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0]], dtype=object)

amueller · 2017-10-12T14:09:17Z

thanks @albertcthomas. the question was more how to detect that easily, but I could pull it out of the arff data.

amueller · 2017-10-12T14:13:55Z

@lesteve now you can have a look ;) not tests yet, though, and the version stuff is not supported yet. Will bug jan to make it easier.

amueller · 2017-10-12T16:02:01Z

maybe miceprotein is not good and I should use adult instead... (miceprotein is trivial to classify)

jnothman · 2017-10-15T09:44:30Z

or you can encode the categorical strings as numbers and report the mapping. In the scikit-learn of a few years ago, I'm sure this is what we would do.

amueller · 2017-10-17T21:15:36Z

@jnothman this is actually what the openml-python lib does. Right now I opted to convert to object dtype here. I like the dataset fetcher to not do any preprocessing, as that reflects better what people will encounter in "the real world".

jnothman · 2018-02-18T23:29:28Z

Current test failures in this PR are HTTP 404 errors, though..

joaquinvanschoren · 2018-02-18T23:33:38Z

@jnothman Indeed. I opened an issue about that openml/OpenML#628

jnothman

I agree, @joaquinvanschoren, that this is pretty high priority.

It looks like we can't use numpy's genfromtxt reliably...

jnothman · 2018-02-18T23:30:39Z

sklearn/datasets/openml.py

+
+    target_column : string or None, default 'default-target'
+        Specify the column name in the data to use as target. If
+        'default-target', the standard target column a stored on the server


a stored -> as stored

jnothman · 2018-02-18T23:30:48Z

sklearn/datasets/openml.py

+        Specify the column name in the data to use as target. If
+        'default-target', the standard target column a stored on the server
+        is used. If ``None``, all columns are returned as data and the
+        tharget is ``None``.


jnothman · 2018-02-18T23:31:52Z

sklearn/datasets/openml.py

+        is used. If ``None``, all columns are returned as data and the
+        tharget is ``None``.
+
+    memory : boolean, default=True


We should probably support that case and reference :term:`memory`

jnothman · 2018-02-18T23:36:14Z

sklearn/datasets/openml.py

+
+    data_description = _get_data_description_by_id_(data_id)
+    if data_description['status'] != "active":
+        warn("Version {} of dataset {} is inactive, meaning that issues have"


Well, perhaps we should only say "issues have been found in the dataset" if 'inactive'.

jnothman · 2018-02-18T23:40:10Z

sklearn/datasets/openml.py

+def _download_data_csv(file_id):
+    response = urlopen("https://openml.org/data/v1/get_csv/{}".format(file_id))
+    data = np.genfromtxt(response, names=True, dtype=None, delimiter=',',
+                         missing_values='?')


what happens to missing value in output??

The missing value handling seems a bit weird: if all values in a column are ints (signed or unsigned), the missing value is -1!!; if one is a float, the missing value is np.nan; if strings, the missing value stays as '?'.

Actually, no, even with usemask=True and missing_values='?' it seems -1 is detected as a missing value!!! I don't trust this...

np.genfromtxt(io.BytesIO(b'a,b\n-1,?\n2,b\n3,c'), names=True, dtype=None, delimiter=',', missing_values='?', usemask=True)

jnothman · 2018-02-18T23:51:00Z

sklearn/datasets/openml.py

+        dtype = None
+    else:
+        dtype = object
+    X = np.array([data[c] for c in data_columns], dtype=dtype).T


If everything is of the same dtype, it is possible to view the recarray as a scalar-type ndarray without copying like this.

jnothman · 2018-02-19T00:13:05Z

We need to support reading Sparse ARFF-formatted CSV... or reject sparse datasets

jnothman · 2018-02-19T00:16:51Z

Given that I have reasons to doubt numpy.genfromtxt and reasons to doubt openml's CSV provisions, I'm strongly in favour of investigating the quality of liac-arff and vendoring it as a separately maintained and tested reader of OpenML's most supported format.

joaquinvanschoren · 2018-02-19T00:24:38Z

To be honest I'm not sure how quickly we'll have that sparse CSV export implemented (or whether you can even read it without adding extra dependencies?).

I'd go ahead and either:

Go the CSV route without supporting sparse data for now. It's a simple check and we can add support later.
Go the ARFF route with liac_arff

jnothman · 2018-02-19T00:25:59Z

Given all the issues with np.genfromtxt, I'd rather steer clear of generic CSV handling.

jnothman · 2018-04-19T07:59:15Z

@amueller can we get someone else to finish this?

amueller · 2018-06-04T17:32:24Z

@jnothman so you suggest adding a dependency on liac_arff? And you wanted to vendor it? Why vendor? @raghavrv had started a cython arff reader but I don't think he finished it. It might be interesting?

jnothman · 2018-06-05T07:44:15Z

I think we need to exploit an existing, tested ARFF reader for sparse and dense. I don't see why we should add a dependency for users, when the package is small. I also would like to see this fetcher in the next release and don't see why we should go to the effort of rolling our own in a first release.

joaquinvanschoren · 2018-06-05T08:19:33Z

@mfeurer: you have been working a lot with/on liac_arff. What is your opinion?

jnothman · 2018-06-14T10:08:02Z

Unless I am much mistaken, I think we can solve this for now with liac-arff, and support the sparse and dense cases seamlessly. As long as we also exploit any typing information provided by openml as metadata, I would think we can make the interface quite stable, and the fact that we're using liac-arff internally will remain an implementation detail. I'm keen to have this finished and released.

mfeurer · 2018-06-14T10:59:53Z

Sorry @joaquinvanschoren I didn't receive an email for this. Could you please let me know about what exactly you want my opinion. A generic answer: arff has a very bad format specification and many undocumented edge-cases. To the best of my knowledge, liac-arff has better support for those than the scipy arff reader, but is also a lot slower. It's not actively developed and I try to devote some time maintaining it. If liac-arff gets broader attention by the usage in scikit-learn this would be amazing. @jnothman do you want to integrate liac-arff the same way you're using joblib?

jnothman · 2018-06-14T12:33:54Z

that's my suggestion. openml seems to be relying on arff, at least for now. I think liac-arff is our best shot.

jorisvandenbossche · 2018-06-14T12:59:48Z

It was probably discussed above, but why not simply downloading and reading in the csv file?

jorisvandenbossche · 2018-06-14T13:00:40Z

Ah, OK, reading myself :-) So the problem for that are the sparse datasets.

jnothman · 2018-06-14T13:18:41Z

The problem with supporting CSV even in the dense case is that np.genfromtxt is troublesome, and we regard pd.read_csv a forbidden dependency.

jorisvandenbossche · 2018-06-14T13:35:54Z

Regardless of how it is read in, IMO not returning pandas DataFrames for the dense datasets from this functionality would also make it a lot less functional ... (but that's probably for another discussion ;-))

jorisvandenbossche · 2018-06-14T14:09:49Z

BTW, I wanted to give this another try (although it will probably needed to be reworked quite a bit to use liac-arff), but the current version clearly has a bug in that it simply drops non-numeric columns in the .data array.
For example with mice = fetch_openml('miceprotein', version=4), mice.data is missing 4 columns (and there are 4 nominal features).

One of the problems with returning an object array in case of mixed dtypes, is that if you then simply convert it to a dataframe with pd.DataFrame(mice.data, columns=mice.feature_names), you still have object dtype. Eg:

In [24]: pd.DataFrame(np.array([[1, 'a'], [2, 'b']], dtype=object)).dtypes
Out[24]: 
0    object
1    object
dtype: object

You then would need to do .infer_objects() to convert the numerical features to actual numerical columns, which is a bit annoying that the user needs to do this (and also inefficient).

I understand relying fully on pandas here is not fully desirable, but would it be possible to add an option to return a DataFrame?

jnothman · 2018-06-14T14:18:24Z

Yes. See also #10733

…

On Fri, 15 Jun 2018 at 00:09, Joris Van den Bossche < ***@***.***> wrote: BTW, I wanted to give this another try (although it will probably needed to be reworked quite a bit to use liac-arff), but the current version clearly has a bug in that it simply drops non-numeric columns in the .data array. For example with mice = fetch_openml('miceprotein', version=4), mice.data is missing 4 columns (and there are 4 nominal features). One of the problems with returning an object array in case of mixed dtypes, is that if you then simply convert it to a dataframe with pd.DataFrame(mice.data, columns=mice.feature_names), you still have object dtype. Eg: In [24]: pd.DataFrame(np.array([[1, 'a'], [2, 'b']], dtype=object)).dtypes Out[24]: 0 object 1 object dtype: object You then would need to do .infer_objects() to convert the numerical features to actual numerical columns, which is a bit annoying that the user needs to do this (and also inefficient). I understand relying fully on pandas here is not fully desirable, but would it be possible to add an option to return a DataFrame? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9908 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz6wKCH0GXX-YtkrsMkeU_B0-S9dimks5t8m61gaJpZM4P1YKb> .

amueller · 2018-06-20T20:34:53Z

ok fine with liac arff vendoring, I'll let @janvanrijn take this over.

jnothman · 2018-06-20T22:43:37Z

yay!

qinhanmin2014 · 2018-07-21T03:27:28Z

Closed for #11419

amueller added 2 commits October 10, 2017 13:50

start on openml dataset loader

268f533

working on stuff

4f3e93e

amueller mentioned this pull request Oct 11, 2017

Add OpenML dataset fetcher #9543

Closed

amueller changed the title ~~Openml loader~~ [WIP] Openml loader Oct 11, 2017

first version working

fabaa90

amueller added 2 commits October 12, 2017 10:35

docstrings, use version="active" as default

1804b14

add caching to openml loader

ffd4335

pep8 annoyance

fe0904b

fix download url, allow datasets without target

eea026e

allow specifying the target column, starting on docs

bca12e9

amueller added 4 commits October 12, 2017 15:42

add openml to the narrative docs

f59ce8b

get more people to upload stuff to openml.

d7dee6d

store metadata, convert to dtype object if there is nominal data.

4c19ad9

fix doctests, add fetch_openml to __init__

16b7fed

jnothman reviewed Feb 19, 2018

View reviewed changes

jnothman added the Blocker label Feb 19, 2018

lesteve mentioned this pull request Apr 19, 2018

mldata.org is down (for good?) #8588

Closed

joaquinvanschoren mentioned this pull request May 5, 2018

OpenML-Python ARFF reader openml/openml-python#466

Closed

jnothman added this to the 0.20 milestone Jun 14, 2018

janvanrijn mentioned this pull request Jul 3, 2018

[MRG] Openml data loader #11419

Merged

jnothman mentioned this pull request Jul 11, 2018

[MRG+1] Deprecate fetch_mldata #11466

Merged

qinhanmin2014 closed this Jul 21, 2018

[MRG] Openml loader #9908

[MRG] Openml loader #9908

Conversation

amueller commented Oct 11, 2017 • edited Loading

lesteve commented Oct 11, 2017

amueller commented Oct 11, 2017

amueller commented Oct 11, 2017

amueller commented Oct 12, 2017

amueller commented Oct 12, 2017

amueller commented Oct 12, 2017

vrishank97 commented Oct 12, 2017

amueller commented Oct 12, 2017

amueller commented Oct 12, 2017

amueller commented Oct 12, 2017

albertcthomas commented Oct 12, 2017

amueller commented Oct 12, 2017

amueller commented Oct 12, 2017

amueller commented Oct 12, 2017 • edited Loading

jnothman commented Oct 15, 2017 via email

amueller commented Oct 17, 2017

jnothman commented Feb 18, 2018

joaquinvanschoren commented Feb 18, 2018

jnothman left a comment

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman Feb 18, 2018

Choose a reason for hiding this comment

jnothman commented Feb 19, 2018

jnothman commented Feb 19, 2018

joaquinvanschoren commented Feb 19, 2018

jnothman commented Feb 19, 2018 • edited Loading

jnothman commented Apr 19, 2018

amueller commented Jun 4, 2018

jnothman commented Jun 5, 2018 via email

joaquinvanschoren commented Jun 5, 2018

jnothman commented Jun 14, 2018

mfeurer commented Jun 14, 2018

jnothman commented Jun 14, 2018 via email

jorisvandenbossche commented Jun 14, 2018

jorisvandenbossche commented Jun 14, 2018

jnothman commented Jun 14, 2018 via email

jorisvandenbossche commented Jun 14, 2018

jorisvandenbossche commented Jun 14, 2018

jnothman commented Jun 14, 2018 via email

amueller commented Jun 20, 2018

jnothman commented Jun 20, 2018 via email

qinhanmin2014 commented Jul 21, 2018

amueller commented Oct 11, 2017 •

edited

Loading

amueller commented Oct 12, 2017 •

edited

Loading

jnothman commented Feb 19, 2018 •

edited

Loading