[MRG] Memory usage of OpenML fetcher: use generator from arff #13312

jorisvandenbossche · 2019-02-27T15:17:01Z

This reduces the memory usage of fetch_openml (for the case when arrays are returned, not for sparse matrices) by consuming the arff data as a generator instead of a list of lists.

My main question / concern is whether this 'NumberOfFeatures' / 'NumberOfInstances' metadata from OpenML is guaranteed to always be available.

This indirectly should fix #13287 (the hypothesis is that the doc building on Circle CI is failing because of a memory issue when running an example fetching data from OpenML).

jorisvandenbossche · 2019-02-27T21:05:45Z

@janvanrijn @jnothman @rth any idea how to generate the needed test data?
I need a new response from the openml server (https://github.com/scikit-learn/scikit-learn/pull/13312/files#diff-ea672b15dd808c88257c58681d17bb6aR341), and so also need to generate those responses for the offline tests.

But running them 'online' let the test fail (as offline we use truncated versions), and also does not seem to generate the responses in my local scikit-learn data home.

janvanrijn · 2019-02-27T21:33:55Z

Number of observations and features are usually calculated several minutes after dataset is uploaded, if we can parse the dataset on the server.

jorisvandenbossche · 2019-02-27T21:48:24Z

@janvanrijn Thanks. So we can assume that this information will always be available then? (what happens if the data could not be parsed? then the dataset is also not available to download?)

Further, do you remember how you originally constructed the gzipped responses included in the tests/data?

janvanrijn · 2019-02-27T22:17:08Z

(on my phone so can't quote nicely)

There's a status field in the data set description. If the status field is 'active', we could parse the dataset on the server. (alternatively status could be 'in_preparation' and 'deactivated'.)

Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for.

jorisvandenbossche · 2019-02-27T22:19:19Z

Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for.

I think that is exactly the answer I was looking for (I only hoped there would be an easier way :-))

jnothman · 2019-02-28T09:00:47Z

I don't mind changing to error on status!='active' if that is not current behaviour

jorisvandenbossche · 2019-02-28T09:04:12Z

Yes, so there are certain datasets that have data that can be downloaded, but because there was a processing error might not have the "qualities" filled in (so from where we obtain the expected shape).

Now, we warn in this case and still return you the data.
We could keep that behaviour by falling back in such a case to the old dense list of lists. It's not hard to do, but of course adds some extra complexity to the code.

jorisvandenbossche · 2019-02-28T11:19:18Z

OK, I added a workaround to still process the data when the data qualities are not available, to keep the existing behaviour.

This should be ready to review now.

jnothman

Otherwise LGTM

(I wonder if it's worth squeezing this and the arff update into 0.20.3)

jnothman · 2019-02-28T11:28:54Z

sklearn/datasets/openml.py

+        return None
+    for d in data_qualities:
+        if d['name'] == 'NumberOfFeatures':
+            n_features = int(float(d['value']))


Are we doing float because there's a . in the data or something??

jnothman · 2019-02-28T11:30:04Z

sklearn/datasets/openml.py

+    # the number of samples / features
+    if data_qualities is None:
+        return None
+    for d in data_qualities:


How about we make the code more readable/pythonic (?) by starting with data_qualities = {d['name']: d['value'] for d in data_qualities}?

Ah, that's a better! (it is just a very annoying way that the data is stored .. :-))

jnothman · 2019-02-28T11:30:51Z

sklearn/datasets/openml.py

+    if not return_sparse:
+        data_qualities = _get_data_qualities(data_id, data_home)
+        shape = _get_data_shape(data_qualities)
+        # if the data qualities were not available, we cans still get the


cans -> can

adrinjalali

LGTM, thanks @jorisvandenbossche !

@jnothman I don't think it's a must to have this in 0.20.3, but it certainly doesn't hurt to have it.

glemaitre

LGTM

sklearn/datasets/openml.py

…earn#13312) * Memory usage of OpenML fetcher: use generator from arff * fix actually getting data qualities in all cases * Add qualities responses * add workaround for cases where data qualities are not available * feedback joel

This reverts commit f7345af.

…cikit-learn#13312)" This reverts commit 024c9ba.

This reverts commit f7345af.

…cikit-learn#13312)" This reverts commit 024c9ba.

…earn#13312) * Memory usage of OpenML fetcher: use generator from arff * fix actually getting data qualities in all cases * Add qualities responses * add workaround for cases where data qualities are not available * feedback joel

jorisvandenbossche added 2 commits February 27, 2019 16:12

Memory usage of OpenML fetcher: use generator from arff

a34175a

fix actually getting data qualities in all cases

fbad059

Add qualities responses

ee50606

add workaround for cases where data qualities are not available

dd842ee

jorisvandenbossche changed the title ~~Memory usage of OpenML fetcher: use generator from arff~~ [MRG] Memory usage of OpenML fetcher: use generator from arff Feb 28, 2019

jnothman approved these changes Feb 28, 2019

View reviewed changes

feedback joel

865c595

jorisvandenbossche requested a review from adrinjalali February 28, 2019 13:44

adrinjalali approved these changes Feb 28, 2019

View reviewed changes

glemaitre requested changes Feb 28, 2019

View reviewed changes

sklearn/datasets/openml.py Show resolved Hide resolved

sklearn/datasets/openml.py Show resolved Hide resolved

adrinjalali merged commit 1f75ffa into scikit-learn:master Feb 28, 2019

jorisvandenbossche deleted the openml-generator branch February 28, 2019 14:14

jnothman added a commit to jnothman/scikit-learn that referenced this pull request Feb 28, 2019

DOC Add what's new for scikit-learn#13312

2eef640

jnothman added a commit that referenced this pull request Mar 1, 2019

DOC Add what's new for #13312

b44d9ac

glemaitre mentioned this pull request Apr 12, 2019

[MRG] Adds the ability load datasets from OpenML containing string attributes #13177

Closed

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

DOC Add what's new for scikit-learn#13312

f7345af

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC Add what's new for scikit-learn#13312"

9144e29

This reverts commit f7345af.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "MNT Memory usage of OpenML fetcher: use generator from arff (s…

fd795ba

…cikit-learn#13312)" This reverts commit 024c9ba.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "DOC Add what's new for scikit-learn#13312"

54c1b5c

This reverts commit f7345af.

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019

Revert "MNT Memory usage of OpenML fetcher: use generator from arff (s…

e9b62e3

…cikit-learn#13312)" This reverts commit 024c9ba.

koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019

DOC Add what's new for scikit-learn#13312

ea397a6

Uh oh!

[MRG] Memory usage of OpenML fetcher: use generator from arff #13312

[MRG] Memory usage of OpenML fetcher: use generator from arff #13312

Uh oh!

Conversation

jorisvandenbossche commented Feb 27, 2019

Uh oh!

jorisvandenbossche commented Feb 27, 2019

Uh oh!

janvanrijn commented Feb 27, 2019

Uh oh!

jorisvandenbossche commented Feb 27, 2019

Uh oh!

janvanrijn commented Feb 27, 2019

Uh oh!

jorisvandenbossche commented Feb 27, 2019

Uh oh!

jnothman commented Feb 28, 2019 via email

Uh oh!

jorisvandenbossche commented Feb 28, 2019

Uh oh!

jorisvandenbossche commented Feb 28, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

jnothman Feb 28, 2019

Choose a reason for hiding this comment

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!