Skip to content

fetch_openml: perhaps cache decoded ARFF #11821

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jnothman opened this issue Aug 15, 2018 · 14 comments · Fixed by #21938
Closed

fetch_openml: perhaps cache decoded ARFF #11821

jnothman opened this issue Aug 15, 2018 · 14 comments · Fixed by #21938
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:datasets

Comments

@jnothman
Copy link
Member

We currently store the results of each HTTP request performed by fetch_openml. It still can take a substantial amount of time to decode the fetched ARFF data into numpy arrays or sparse matrices. We could instead (or also) cache the decoded data on disk, so that the ARFF does not need to be parsed in repeated calls.

Alternatively, we can hope that OpenML provides a more compact data representation some time soon, as per openml/OpenML#388.

@jnothman jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Aug 15, 2018
@amueller
Copy link
Member

amueller commented Mar 7, 2019

I just ran into this.

from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784")

takes 8 seconds on my machine.
I would suggest not waiting for OpenML.

@jnothman
Copy link
Member Author

jnothman commented Mar 7, 2019 via email

@amueller
Copy link
Member

amueller commented Mar 8, 2019

Sorry, last try was 40s actually. Yes, master.
Maybe the chunking made it slower, not sure?

@thomasjpfan
Copy link
Member

I get around 16 seconds on master, and 20 seconds on 0.20.2.

Regarding the arff mnist dataset, the biggest bottlenecks are the _arff._decode_values and _arff._parse_values which together uses about 80% of the time. Specifically, _arff._decode_values calls float 70k times and _arff._parse_values runs a regex 70k times.

For mnist, the arff dataset uses 15.5 MB. If we were to use np.savez_compressed to cache the mnist dataset will use 23MB.

If we save the numpy array, it would be good to remove the data from the arff file and let it keep the metadata.

@amueller
Copy link
Member

amueller commented Mar 8, 2019

Thanks for checking. How is the numpy file bigger than the arff file? That makes no sense to me.

How much faster is it to cache?
We could also use joblib memory for caching, which is I think what I originally implemented.

@jnothman
Copy link
Member Author

jnothman commented Mar 9, 2019 via email

@rth
Copy link
Member

rth commented Mar 9, 2019

It's very possible that we can speed up liac-arff still, even though I tried hard to get it as fast as possible without moving to C.

When comparing its performance to pandas.read_csv, if we use Cython/C/etc there is certainly room for improvement. The question is whether that additional effort would be justified. A mid-term solution to use some better file format, which could also solve other issues with arff.

@jnothman
Copy link
Member Author

jnothman commented Mar 10, 2019 via email

@thomasjpfan
Copy link
Member

@amueller

Thanks for checking. How is the numpy file bigger than the arff file? That makes no sense to me.

I suspect that the numpy are using more space stores the data as float64. The original ARFF file contains a bunch of 0s and a few uint8s to store the images.

How much faster is it to cache?
The caching will be slower, because of the following:

  1. Download arff file and save arff to disk.
  2. Read ARFF file and check if we support caching it as a numpy array or a scipy sparse matrix.
  3. Write numpy array into cache. (For mnist this takes ~8 seconds)
  4. Maybe remove the data from the arff file, leaving the metadata.

Loading the numpy array takes ~2 seconds for the mnist dataset.

We could also use joblib memory for caching, which is I think what I originally implemented.
joblib is not currently used.

For saving sparse matrices, we can use scipy.sparse.save_npz.

@thomasjpfan
Copy link
Member

Related issue: renatopp/liac-arff#75

What is preventing us from using scipy.io.arff.loadarff?

@jnothman
Copy link
Member Author

jnothman commented Mar 30, 2019 via email

@amueller
Copy link
Member

Our options are to download arff and convert locally once, or to download a binary format, right?
If we download a binary format, we need to be able to read it, though, which requires a dependency. I'm sure openml would be happy to add an additional endpoint for us to get parquet, but then we need to depend on a parquet loader.

@amueller
Copy link
Member

In a completely unscientific experiment, scipy took 27s to load the mnist arff and liac-arff took 17s.
In a rerun scipy took 22 and liac-arff again 17.

@raghavrv's pyarff: https://github.com/pyarff/pyarff
takes 0.8 seconds.

Maybe that would be worth taking up for dense datasets?

@amueller
Copy link
Member

before I delegate @thomasjpfan to another thing he should probably finish up all his amazing half-finished PRs so if someone else wants to work on this in the meantime ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Moderate Anything that requires some knowledge of conventions and best practices module:datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants