-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Memory usage of OpenML fetcher: use generator from arff #13312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Memory usage of OpenML fetcher: use generator from arff #13312
Conversation
@janvanrijn @jnothman @rth any idea how to generate the needed test data? But running them 'online' let the test fail (as offline we use truncated versions), and also does not seem to generate the responses in my local scikit-learn data home. |
Number of observations and features are usually calculated several minutes after dataset is uploaded, if we can parse the dataset on the server. |
@janvanrijn Thanks. So we can assume that this information will always be available then? (what happens if the data could not be parsed? then the dataset is also not available to download?) Further, do you remember how you originally constructed the gzipped responses included in the tests/data? |
(on my phone so can't quote nicely) There's a status field in the data set description. If the status field is 'active', we could parse the dataset on the server. (alternatively status could be 'in_preparation' and 'deactivated'.) Can you rephrase your second question? I downloaded them by hand from openml server, removed some records and gzipped them using the unix gzip command, but i guess that's not the answer you're looking for. |
I think that is exactly the answer I was looking for (I only hoped there would be an easier way :-)) |
I don't mind changing to error on status!='active' if that is not current
behaviour
|
Yes, so there are certain datasets that have data that can be downloaded, but because there was a processing error might not have the "qualities" filled in (so from where we obtain the expected shape). Now, we warn in this case and still return you the data. |
OK, I added a workaround to still process the data when the data qualities are not available, to keep the existing behaviour. This should be ready to review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
(I wonder if it's worth squeezing this and the arff update into 0.20.3)
sklearn/datasets/openml.py
Outdated
return None | ||
for d in data_qualities: | ||
if d['name'] == 'NumberOfFeatures': | ||
n_features = int(float(d['value'])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we doing float because there's a . in the data or something??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes ..
sklearn/datasets/openml.py
Outdated
# the number of samples / features | ||
if data_qualities is None: | ||
return None | ||
for d in data_qualities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we make the code more readable/pythonic (?) by starting with data_qualities = {d['name']: d['value'] for d in data_qualities}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's a better! (it is just a very annoying way that the data is stored .. :-))
sklearn/datasets/openml.py
Outdated
if not return_sparse: | ||
data_qualities = _get_data_qualities(data_id, data_home) | ||
shape = _get_data_shape(data_qualities) | ||
# if the data qualities were not available, we cans still get the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cans -> can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jorisvandenbossche !
@jnothman I don't think it's a must to have this in 0.20.3, but it certainly doesn't hurt to have it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…earn#13312) * Memory usage of OpenML fetcher: use generator from arff * fix actually getting data qualities in all cases * Add qualities responses * add workaround for cases where data qualities are not available * feedback joel
…earn#13312) * Memory usage of OpenML fetcher: use generator from arff * fix actually getting data qualities in all cases * Add qualities responses * add workaround for cases where data qualities are not available * feedback joel
This reverts commit f7345af.
…cikit-learn#13312)" This reverts commit 024c9ba.
This reverts commit f7345af.
…cikit-learn#13312)" This reverts commit 024c9ba.
…earn#13312) * Memory usage of OpenML fetcher: use generator from arff * fix actually getting data qualities in all cases * Add qualities responses * add workaround for cases where data qualities are not available * feedback joel
This reduces the memory usage of
fetch_openml
(for the case when arrays are returned, not for sparse matrices) by consuming the arff data as a generator instead of a list of lists.My main question / concern is whether this 'NumberOfFeatures' / 'NumberOfInstances' metadata from OpenML is guaranteed to always be available.
This indirectly should fix #13287 (the hypothesis is that the doc building on Circle CI is failing because of a memory issue when running an example fetching data from OpenML).