-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
fetch_openml: perhaps cache decoded ARFF #11821
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I just ran into this. from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784") takes 8 seconds on my machine. |
In master, 8s?
|
Sorry, last try was 40s actually. Yes, master. |
I get around 16 seconds on master, and 20 seconds on Regarding the arff mnist dataset, the biggest bottlenecks are the For mnist, the arff dataset uses 15.5 MB. If we were to use If we save the numpy array, it would be good to remove the data from the arff file and let it keep the metadata. |
Thanks for checking. How is the numpy file bigger than the arff file? That makes no sense to me. How much faster is it to cache? |
It's very possible that we can speed up liac-arff still, even though I
tried hard to get it as fast as possible without moving to C. It might also
be possible to exploit pandas.read_csv in the implementation.
|
When comparing its performance to pandas.read_csv, if we use Cython/C/etc there is certainly room for improvement. The question is whether that additional effort would be justified. A mid-term solution to use some better file format, which could also solve other issues with arff. |
Yes, I agree there's not a great deal of benefit to improving liac-arff if
there is a more efficient but similarly expressive encoding. But this might
require us to commit resources to openml to achieve change in the near
future.
|
I suspect that the numpy are using more space stores the data as float64. The original ARFF file contains a bunch of 0s and a few uint8s to store the images.
Loading the numpy array takes ~2 seconds for the mnist dataset.
For saving sparse matrices, we can use scipy.sparse.save_npz. |
Related issue: renatopp/liac-arff#75 What is preventing us from using |
Yes, that issue is related but the loading code in liac-arff has been
rewritten to close the gap with scipy (I hope that's still true). The
scipy.io implementation does not handle sparse and I think there may have
been other limitations we cared about.
But the issue here is basically that loading from a text format is
unnecessary slow regardless of which loader.
|
Our options are to download arff and convert locally once, or to download a binary format, right? |
In a completely unscientific experiment, scipy took 27s to load the mnist arff and liac-arff took 17s. @raghavrv's pyarff: https://github.com/pyarff/pyarff Maybe that would be worth taking up for dense datasets? |
before I delegate @thomasjpfan to another thing he should probably finish up all his amazing half-finished PRs so if someone else wants to work on this in the meantime ;) |
We currently store the results of each HTTP request performed by
fetch_openml
. It still can take a substantial amount of time to decode the fetched ARFF data into numpy arrays or sparse matrices. We could instead (or also) cache the decoded data on disk, so that the ARFF does not need to be parsed in repeated calls.Alternatively, we can hope that OpenML provides a more compact data representation some time soon, as per openml/OpenML#388.
The text was updated successfully, but these errors were encountered: