fetch_openml: perhaps cache decoded ARFF #11821

jnothman · 2018-08-15T08:27:26Z

We currently store the results of each HTTP request performed by fetch_openml. It still can take a substantial amount of time to decode the fetched ARFF data into numpy arrays or sparse matrices. We could instead (or also) cache the decoded data on disk, so that the ARFF does not need to be parsed in repeated calls.

Alternatively, we can hope that OpenML provides a more compact data representation some time soon, as per openml/OpenML#388.

The text was updated successfully, but these errors were encountered:

amueller · 2019-03-07T20:50:18Z

I just ran into this.

from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784")

takes 8 seconds on my machine.
I would suggest not waiting for OpenML.

jnothman · 2019-03-07T22:15:56Z

In master, 8s?

amueller · 2019-03-08T15:16:36Z

Sorry, last try was 40s actually. Yes, master.
Maybe the chunking made it slower, not sure?

thomasjpfan · 2019-03-08T16:56:45Z

I get around 16 seconds on master, and 20 seconds on 0.20.2.

Regarding the arff mnist dataset, the biggest bottlenecks are the _arff._decode_values and _arff._parse_values which together uses about 80% of the time. Specifically, _arff._decode_values calls float 70k times and _arff._parse_values runs a regex 70k times.

For mnist, the arff dataset uses 15.5 MB. If we were to use np.savez_compressed to cache the mnist dataset will use 23MB.

If we save the numpy array, it would be good to remove the data from the arff file and let it keep the metadata.

amueller · 2019-03-08T17:00:34Z

Thanks for checking. How is the numpy file bigger than the arff file? That makes no sense to me.

How much faster is it to cache?
We could also use joblib memory for caching, which is I think what I originally implemented.

jnothman · 2019-03-09T11:40:23Z

It's very possible that we can speed up liac-arff still, even though I tried hard to get it as fast as possible without moving to C. It might also be possible to exploit pandas.read_csv in the implementation.

rth · 2019-03-09T14:35:10Z

It's very possible that we can speed up liac-arff still, even though I tried hard to get it as fast as possible without moving to C.

When comparing its performance to pandas.read_csv, if we use Cython/C/etc there is certainly room for improvement. The question is whether that additional effort would be justified. A mid-term solution to use some better file format, which could also solve other issues with arff.

jnothman · 2019-03-10T22:48:05Z

Yes, I agree there's not a great deal of benefit to improving liac-arff if there is a more efficient but similarly expressive encoding. But this might require us to commit resources to openml to achieve change in the near future.

thomasjpfan · 2019-03-12T14:18:22Z

@amueller

Thanks for checking. How is the numpy file bigger than the arff file? That makes no sense to me.

I suspect that the numpy are using more space stores the data as float64. The original ARFF file contains a bunch of 0s and a few uint8s to store the images.

How much faster is it to cache?
The caching will be slower, because of the following:

Download arff file and save arff to disk.
Read ARFF file and check if we support caching it as a numpy array or a scipy sparse matrix.
Write numpy array into cache. (For mnist this takes ~8 seconds)
Maybe remove the data from the arff file, leaving the metadata.

Loading the numpy array takes ~2 seconds for the mnist dataset.

We could also use joblib memory for caching, which is I think what I originally implemented.
joblib is not currently used.

For saving sparse matrices, we can use scipy.sparse.save_npz.

thomasjpfan · 2019-03-29T23:15:38Z

Related issue: renatopp/liac-arff#75

What is preventing us from using scipy.io.arff.loadarff?

jnothman · 2019-03-30T12:50:24Z

Yes, that issue is related but the loading code in liac-arff has been rewritten to close the gap with scipy (I hope that's still true). The scipy.io implementation does not handle sparse and I think there may have been other limitations we cared about. But the issue here is basically that loading from a text format is unnecessary slow regardless of which loader.

amueller · 2019-04-16T13:59:49Z

Our options are to download arff and convert locally once, or to download a binary format, right?
If we download a binary format, we need to be able to read it, though, which requires a dependency. I'm sure openml would be happy to add an additional endpoint for us to get parquet, but then we need to depend on a parquet loader.

amueller · 2019-04-16T14:23:01Z

In a completely unscientific experiment, scipy took 27s to load the mnist arff and liac-arff took 17s.
In a rerun scipy took 22 and liac-arff again 17.

@raghavrv's pyarff: https://github.com/pyarff/pyarff
takes 0.8 seconds.

Maybe that would be worth taking up for dense datasets?

amueller · 2019-04-16T14:58:58Z

before I delegate @thomasjpfan to another thing he should probably finish up all his amazing half-finished PRs so if someone else wants to work on this in the meantime ;)

jnothman added Enhancement Moderate Anything that requires some knowledge of conventions and best practices help wanted labels Aug 15, 2018

thomasjpfan mentioned this issue Mar 18, 2021

ENH Better error for corrupted files in fetch_kddcup99 #19669

Merged

steinfurt mentioned this issue Dec 3, 2021

[MRG] Accelerate example plot_sparse_logistic_regression_mnist.py #21862

Closed

glemaitre mentioned this issue Dec 13, 2021

ENH improve ARFF parser using pandas #21938

Merged

cmarmo added module:datasets and removed help wanted labels Jan 30, 2022

glemaitre closed this as completed in #21938 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch_openml: perhaps cache decoded ARFF #11821

fetch_openml: perhaps cache decoded ARFF #11821

jnothman commented Aug 15, 2018

amueller commented Mar 7, 2019

jnothman commented Mar 7, 2019 via email

amueller commented Mar 8, 2019

thomasjpfan commented Mar 8, 2019

amueller commented Mar 8, 2019

jnothman commented Mar 9, 2019 via email

rth commented Mar 9, 2019

jnothman commented Mar 10, 2019 via email

thomasjpfan commented Mar 12, 2019

thomasjpfan commented Mar 29, 2019

jnothman commented Mar 30, 2019 via email

amueller commented Apr 16, 2019

amueller commented Apr 16, 2019

amueller commented Apr 16, 2019

fetch_openml: perhaps cache decoded ARFF #11821

fetch_openml: perhaps cache decoded ARFF #11821

Comments

jnothman commented Aug 15, 2018

amueller commented Mar 7, 2019

jnothman commented Mar 7, 2019 via email

amueller commented Mar 8, 2019

thomasjpfan commented Mar 8, 2019

amueller commented Mar 8, 2019

jnothman commented Mar 9, 2019 via email

rth commented Mar 9, 2019

jnothman commented Mar 10, 2019 via email

thomasjpfan commented Mar 12, 2019

thomasjpfan commented Mar 29, 2019

jnothman commented Mar 30, 2019 via email

amueller commented Apr 16, 2019

amueller commented Apr 16, 2019

amueller commented Apr 16, 2019