Skip to content

[MRG] Openml loader #9908

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 32 commits into from
Closed

Conversation

amueller
Copy link
Member

@amueller amueller commented Oct 11, 2017

Fixes #9543.

Todo:

  • docstrings
  • userguide
  • tests
  • handle data set versions
  • caching

@lesteve
Copy link
Member

lesteve commented Oct 11, 2017

Looking at the flake8 errors, there seems to be genuine problems, e.g.:

./sklearn/datasets/openml.py:81:15: F821 undefined name 'urlretrieve'
    json_dl = urlretrieve(json_loc.format(name))[0]
              ^

Not familiar at all with openml, do you have any outputs you can suggest for name_or_id if we want to play with the fetcher?

@amueller
Copy link
Member Author

this is not ready for review yet ;) I wanted to share in case @vrishank97 was working on it. Give me like 3 more hours maybe, I'm running between meetings and sprints.

@amueller
Copy link
Member Author

you can try "iris" for aname_or_id

@amueller amueller changed the title Openml loader [WIP] Openml loader Oct 11, 2017
@amueller
Copy link
Member Author

hm not sure how to catch the error on python2. seem there are different errors thrown on python2 and python3.

@amueller
Copy link
Member Author

Hm I thought by default returning an active dataset might be a good idea, but then the caching kind of interferes with that. But either way, the caching might interfere with warning people about deactivated datasets. Not sure what to do about it. We could invalidate the cache after a certain amount of time or only use it if there's no internet connection (at least for the meta-data, which can change).

@amueller
Copy link
Member Author

The handling of versions will be much easier if there's another filter added to openml, which will be done hopefully today or tomorrow.

@vrishank97
Copy link
Contributor

I've also made some progress on this issue, should I make a separate PR or work on this one? and what are the benefits of using arff over csv with json?

@amueller
Copy link
Member Author

sure can you send a PR so we can merge efforts? It would have been great if you'd send a PR earlier so we had less duplicate work. I'd really like to get this in some reasonable state this week.
arff has possibly more meta-information. maybe not really necessary.

@amueller
Copy link
Member Author

I'm not entirely sure what to do for tests. We have done some mocking for mldata, but here the requests are somewhat more complex. I could do the mocking for all of them if you think that's necessary.

@amueller
Copy link
Member Author

I'm also not sure what would be the best datatype to return for mixed data. If there's strings in there, we can either use a recarray or cast everything to string, or make it object. By default things become strings, which doesn't seem great :-/

@albertcthomas
Copy link
Contributor

I don't know if that helps but for fetch_kddcup99, where data also contains strings, we have dtype=object.

from sklearn.datasets import fetch_kddcup99
dataset = fetch_kddcup99()
dataset.data
array([[0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.0, 0.0, 0.0],
       ...,
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0],
       [0, b'tcp', b'http', ..., 0.01, 0.0, 0.0]], dtype=object)

@amueller
Copy link
Member Author

thanks @albertcthomas. the question was more how to detect that easily, but I could pull it out of the arff data.

@amueller
Copy link
Member Author

@lesteve now you can have a look ;) not tests yet, though, and the version stuff is not supported yet. Will bug jan to make it easier.

@amueller
Copy link
Member Author

amueller commented Oct 12, 2017

maybe miceprotein is not good and I should use adult instead... (miceprotein is trivial to classify)

@jnothman
Copy link
Member

jnothman commented Oct 15, 2017 via email

@amueller
Copy link
Member Author

@jnothman this is actually what the openml-python lib does. Right now I opted to convert to object dtype here. I like the dataset fetcher to not do any preprocessing, as that reflects better what people will encounter in "the real world".

@jnothman
Copy link
Member

Current test failures in this PR are HTTP 404 errors, though..

@joaquinvanschoren
Copy link
Contributor

@jnothman Indeed. I opened an issue about that openml/OpenML#628

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, @joaquinvanschoren, that this is pretty high priority.

It looks like we can't use numpy's genfromtxt reliably...


target_column : string or None, default 'default-target'
Specify the column name in the data to use as target. If
'default-target', the standard target column a stored on the server
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a stored -> as stored

Specify the column name in the data to use as target. If
'default-target', the standard target column a stored on the server
is used. If ``None``, all columns are returned as data and the
tharget is ``None``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*target

is used. If ``None``, all columns are returned as data and the
tharget is ``None``.

memory : boolean, default=True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably support that case and reference :term:`memory`


data_description = _get_data_description_by_id_(data_id)
if data_description['status'] != "active":
warn("Version {} of dataset {} is inactive, meaning that issues have"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, perhaps we should only say "issues have been found in the dataset" if 'inactive'.

def _download_data_csv(file_id):
response = urlopen("https://openml.org/data/v1/get_csv/{}".format(file_id))
data = np.genfromtxt(response, names=True, dtype=None, delimiter=',',
missing_values='?')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens to missing value in output??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The missing value handling seems a bit weird: if all values in a column are ints (signed or unsigned), the missing value is -1!!; if one is a float, the missing value is np.nan; if strings, the missing value stays as '?'.

Actually, no, even with usemask=True and missing_values='?' it seems -1 is detected as a missing value!!! I don't trust this...

np.genfromtxt(io.BytesIO(b'a,b\n-1,?\n2,b\n3,c'), names=True, dtype=None, delimiter=',', missing_values='?', usemask=True)

dtype = None
else:
dtype = object
X = np.array([data[c] for c in data_columns], dtype=dtype).T
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If everything is of the same dtype, it is possible to view the recarray as a scalar-type ndarray without copying like this.

@jnothman
Copy link
Member

We need to support reading Sparse ARFF-formatted CSV... or reject sparse datasets

@jnothman
Copy link
Member

Given that I have reasons to doubt numpy.genfromtxt and reasons to doubt openml's CSV provisions, I'm strongly in favour of investigating the quality of liac-arff and vendoring it as a separately maintained and tested reader of OpenML's most supported format.

@joaquinvanschoren
Copy link
Contributor

To be honest I'm not sure how quickly we'll have that sparse CSV export implemented (or whether you can even read it without adding extra dependencies?).

I'd go ahead and either:

  • Go the CSV route without supporting sparse data for now. It's a simple check and we can add support later.
  • Go the ARFF route with liac_arff

@jnothman
Copy link
Member

jnothman commented Feb 19, 2018

Given all the issues with np.genfromtxt, I'd rather steer clear of generic CSV handling.

@jnothman
Copy link
Member

@amueller can we get someone else to finish this?

@amueller
Copy link
Member Author

amueller commented Jun 4, 2018

@jnothman so you suggest adding a dependency on liac_arff? And you wanted to vendor it? Why vendor? @raghavrv had started a cython arff reader but I don't think he finished it. It might be interesting?

@jnothman
Copy link
Member

jnothman commented Jun 5, 2018 via email

@joaquinvanschoren
Copy link
Contributor

@mfeurer: you have been working a lot with/on liac_arff. What is your opinion?

@jnothman
Copy link
Member

Unless I am much mistaken, I think we can solve this for now with liac-arff, and support the sparse and dense cases seamlessly. As long as we also exploit any typing information provided by openml as metadata, I would think we can make the interface quite stable, and the fact that we're using liac-arff internally will remain an implementation detail. I'm keen to have this finished and released.

@jnothman jnothman added this to the 0.20 milestone Jun 14, 2018
@mfeurer
Copy link
Contributor

mfeurer commented Jun 14, 2018

Sorry @joaquinvanschoren I didn't receive an email for this. Could you please let me know about what exactly you want my opinion. A generic answer: arff has a very bad format specification and many undocumented edge-cases. To the best of my knowledge, liac-arff has better support for those than the scipy arff reader, but is also a lot slower. It's not actively developed and I try to devote some time maintaining it. If liac-arff gets broader attention by the usage in scikit-learn this would be amazing. @jnothman do you want to integrate liac-arff the same way you're using joblib?

@jnothman
Copy link
Member

jnothman commented Jun 14, 2018 via email

@jorisvandenbossche
Copy link
Member

It was probably discussed above, but why not simply downloading and reading in the csv file?

@jorisvandenbossche
Copy link
Member

Ah, OK, reading myself :-) So the problem for that are the sparse datasets.

@jnothman
Copy link
Member

jnothman commented Jun 14, 2018 via email

@jorisvandenbossche
Copy link
Member

Regardless of how it is read in, IMO not returning pandas DataFrames for the dense datasets from this functionality would also make it a lot less functional ... (but that's probably for another discussion ;-))

@jorisvandenbossche
Copy link
Member

BTW, I wanted to give this another try (although it will probably needed to be reworked quite a bit to use liac-arff), but the current version clearly has a bug in that it simply drops non-numeric columns in the .data array.
For example with mice = fetch_openml('miceprotein', version=4), mice.data is missing 4 columns (and there are 4 nominal features).

One of the problems with returning an object array in case of mixed dtypes, is that if you then simply convert it to a dataframe with pd.DataFrame(mice.data, columns=mice.feature_names), you still have object dtype. Eg:

In [24]: pd.DataFrame(np.array([[1, 'a'], [2, 'b']], dtype=object)).dtypes
Out[24]: 
0    object
1    object
dtype: object

You then would need to do .infer_objects() to convert the numerical features to actual numerical columns, which is a bit annoying that the user needs to do this (and also inefficient).

I understand relying fully on pandas here is not fully desirable, but would it be possible to add an option to return a DataFrame?

@jnothman
Copy link
Member

jnothman commented Jun 14, 2018 via email

@amueller
Copy link
Member Author

ok fine with liac arff vendoring, I'll let @janvanrijn take this over.

@jnothman
Copy link
Member

jnothman commented Jun 20, 2018 via email

@qinhanmin2014
Copy link
Member

Closed for #11419

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants