Skip to content

[MRG] Memory usage of OpenML fetcher: use generator from arff #13312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 28, 2019

Conversation

jorisvandenbossche
Copy link
Member

This reduces the memory usage of fetch_openml (for the case when arrays are returned, not for sparse matrices) by consuming the arff data as a generator instead of a list of lists.

My main question / concern is whether this 'NumberOfFeatures' / 'NumberOfInstances' metadata from OpenML is guaranteed to always be available.

This indirectly should fix #13287 (the hypothesis is that the doc building on Circle CI is failing because of a memory issue when running an example fetching data from OpenML).

@jorisvandenbossche
Copy link
Member Author

@janvanrijn @jnothman @rth any idea how to generate the needed test data?
I need a new response from the openml server (https://github.com/scikit-learn/scikit-learn/pull/13312/files#diff-ea672b15dd808c88257c58681d17bb6aR341), and so also need to generate those responses for the offline tests.

But running them 'online' let the test fail (as offline we use truncated versions), and also does not seem to generate the responses in my local scikit-learn data home.

@janvanrijn
Copy link
Contributor

Number of observations and features are usually calculated several minutes after dataset is uploaded, if we can parse the dataset on the server.

@jorisvandenbossche
Copy link
Member Author

@janvanrijn Thanks. So we can assume that this information will always be available then? (what happens if the data could not be parsed? then the dataset is also not available to download?)

Further, do you remember how you originally constructed the gzipped responses included in the tests/data?

@janvanrijn
Copy link
Contributor

(on my phone so can't quote nicely)

There's a status field in the data set description. If the status field is 'active', we could parse the dataset on the server. (alternatively status could be 'in_preparation' and 'deactivated'.)

Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for.

@jorisvandenbossche
Copy link
Member Author

Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for.

I think that is exactly the answer I was looking for (I only hoped there would be an easier way :-))

@jnothman
Copy link
Member

jnothman commented Feb 28, 2019 via email

@jorisvandenbossche
Copy link
Member Author

Yes, so there are certain datasets that have data that can be downloaded, but because there was a processing error might not have the "qualities" filled in (so from where we obtain the expected shape).

Now, we warn in this case and still return you the data.
We could keep that behaviour by falling back in such a case to the old dense list of lists. It's not hard to do, but of course adds some extra complexity to the code.

@jorisvandenbossche jorisvandenbossche changed the title Memory usage of OpenML fetcher: use generator from arff [MRG] Memory usage of OpenML fetcher: use generator from arff Feb 28, 2019
@jorisvandenbossche
Copy link
Member Author

OK, I added a workaround to still process the data when the data qualities are not available, to keep the existing behaviour.

This should be ready to review now.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

(I wonder if it's worth squeezing this and the arff update into 0.20.3)

return None
for d in data_qualities:
if d['name'] == 'NumberOfFeatures':
n_features = int(float(d['value']))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we doing float because there's a . in the data or something??

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes ..

# the number of samples / features
if data_qualities is None:
return None
for d in data_qualities:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we make the code more readable/pythonic (?) by starting with data_qualities = {d['name']: d['value'] for d in data_qualities}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a better! (it is just a very annoying way that the data is stored .. :-))

if not return_sparse:
data_qualities = _get_data_qualities(data_id, data_home)
shape = _get_data_shape(data_qualities)
# if the data qualities were not available, we cans still get the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cans -> can

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jorisvandenbossche !

@jnothman I don't think it's a must to have this in 0.20.3, but it certainly doesn't hurt to have it.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@adrinjalali adrinjalali merged commit 1f75ffa into scikit-learn:master Feb 28, 2019
@jorisvandenbossche jorisvandenbossche deleted the openml-generator branch February 28, 2019 14:14
jnothman pushed a commit to jnothman/scikit-learn that referenced this pull request Feb 28, 2019
…earn#13312)

* Memory usage of OpenML fetcher: use generator from arff

* fix actually getting data qualities in all cases

* Add qualities responses

* add workaround for cases where data qualities are not available

* feedback joel
jnothman added a commit to jnothman/scikit-learn that referenced this pull request Feb 28, 2019
jnothman added a commit that referenced this pull request Mar 1, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
…earn#13312)

* Memory usage of OpenML fetcher: use generator from arff

* fix actually getting data qualities in all cases

* Add qualities responses

* add workaround for cases where data qualities are not available

* feedback joel
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
…earn#13312)

* Memory usage of OpenML fetcher: use generator from arff

* fix actually getting data qualities in all cases

* Add qualities responses

* add workaround for cases where data qualities are not available

* feedback joel
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CircleCI keeps failing on master
5 participants