Faster fetch_openml cache #14855

rth · 2019-08-30T19:55:09Z

When designing fetch_openml in #11419 we decided to go with fine grained cache of individual HTTP requests. In retrospective that was not the right decision.

The issue is that HTTP fetching is not the only bottleneck. The (pure Python) ARFF parser we use is quite slow. A side note: there could be potentially a gain of 3x or so by using to compiled parser (here in Rust) but that would require a lot of work (on build and to reach feature parity). We are also doing a significant amount of post-processing to convert the parsed arff to something usable. ARFF is a problem overall.

A better solution is to cache the output of fetch_openml directly with joblib.Memory. This PR implements that.

This also has the advantage of delegating most of the caching logic to joblib.

Performance

Loading MNIST from cache with

fetch_openml('mnist_784', version=1)

takes 19s on master and 0.3s in this PR (on SSD).

Loading it the first time with an empty cache will probably also be marginally faster (due to pickling ndarray being more efficient as compared to storing gzipped HTTP responses). My internet is not good enough to try at the moment.

Initial motivation was to make examples in #14300 faster,

TODO

possibly add tests to ensure coverage: we have somewhat less control on where files are stored in the cache, so a bit more tricky to test.
see if the re-trying mechanism on corrupted/outdated pickles can be done in joblib.

rth · 2019-08-31T11:28:57Z

This also makes docs in CircleCI build faster : 26-33min on master and 20-25min in this PR for a full build.

One limitation is that because we cache a large function, any change there (even cosmetic) will trigger the cache to become invalid. I'm not sure if joblib would automatically clean up files from previous version. This will in particular affect developers, but I think paying for performance with some disk space is acceptable.

jnothman

Should we be caching the data request too, or some other portion of the work, to allow the user to tweak the parameters of their request without being penalised? It looks like you've considered that case for return_X_y.

rth · 2019-09-02T11:13:12Z

Should we be caching the data request too, or some other portion of the work, to allow the user to tweak the parameters of their request without being penalised? It looks like you've considered that case for return_X_y.

The ideal case would have been to cache the frame and select the target_column afterwards (or the X, y in the sparse case). The issue is that the way code is written now it is difficult, or at least I haven't managed to do it without a major rewrite).

Instead I re-added a caching of all HTTP requests (with joblib), including the data request, so that should mitigate a bit this problem. Data request will be cached only with joblib >=0.12, as sometimes the data request returned by the arff reader is a generator that cannot be pickled without cloudpickle in newer joblib versions.

rth · 2019-09-02T13:06:20Z

(The failure in test_fastica_simple is unrelated to this PR).

jnothman · 2019-09-01T22:57:47Z

sklearn/datasets/openml.py

+        if data_home is not None:
+            # Use dataset specific cache location, so it can be separately
+            # cleaned
+            self.memory = Memory(join(data_home, 'cache', str(data_id)),


'cache' is a lot less specific than the previous openml.org... Consider the user trying to manually fix cache or disk space issues... better name?

data_home already includes the openml string in the path (see openml.py:L618 on master). Previously it was,

~/scikit_learn_data/openml/openml.org/

which is a bit redundant (openml.org was the only folder there). Now it's

~/scikit_learn_data/openml/cache/

I can remove the cache as well, but then it'll be a bit more difficult to distinguish between the legacy and and these new files.

rth · 2019-09-16T13:44:02Z

see if the re-trying mechanism on corrupted/outdated pickles can be done in joblib.

Actually looking at the code, it sound like joblib.Memory already will call the cached function directly if it fails to load the pickled result from disk
https://github.com/joblib/joblib/blob/470a6372dbae826b2f6107a77f6457e956249fef/joblib/memory.py#L531

so we could remove the corresponding logic from this PR.

@lesteve do you confirm?

lesteve · 2019-09-16T14:27:59Z

Here is what I know (hopefully that can help):

About corrupted pickle, if joblib.Memory fails to load the pickle it will recompute and overwrite the old cache with the pickle of the computation

About "outdated": it depends what you mean by this.

if the function you cache has some arguments that change (when say you get a different version of the dataset on OpenML) then the result will be recomputed and saved. The old cache will still be there but in a different folder. The hash key is based on fully qualified function name (i.e. including module name) + joblib.hash of the input arguments.
there are some simple mechanism to detect if the function code has changed. This is mostly based on the function code source plus maybe a few additional tweaks like which line the function is defined. Sometimes, this simple mechanism is not enough: for example the function code does not change but one of the function you call inside the cached function changes. In this case, the recommended approach is to delete the cache manually ...

If some of what I am saying does not make sense, let me know and I can try to clarify. Another option is to play with a few examples:

from joblib import Memory

memory = Memory(location='/tmp/joblib')

def func(x, y):
    print(f'computing: x={x} y={y}')
    return x + y

cached_func = memory.cache(func)

cached_func(1, 2)
cached_func(1, 2)
cached_func(1, 3)
cached_func(1, 4)

# look at the content of memory.location:
ls -ltr /tmp/joblib/joblib/*/func

I get this: note that you have three hash subdirectories (three different arguments were used) above, the func_code.py contains the function source code.

total 16
drwxr-xr-x 2 lesteve sed 4096 Sep 16 16:25 84e2b58a1d6ad3657bedd42be32837f3
drwxr-xr-x 2 lesteve sed 4096 Sep 16 16:25 17444b56c7c0c792594ca5b8ae59b010
drwxr-xr-x 2 lesteve sed 4096 Sep 16 16:25 64a24e8d2ebfa0084a5e2f867b71ac83
-rw-r--r-- 1 lesteve sed   86 Sep 16 16:25 func_code.py

rth · 2019-09-16T14:39:52Z

Thanks for the explanations @lesteve !

rth · 2019-09-16T15:23:14Z

Further simplified this PR given that joblib.Memory handles re-trying on failed unpickling.

Sometimes, this simple mechanism is not enough: for example the function code does not change but one of the function you call inside the cached function changes. In this case, the recommended approach is to delete the cache manually ...

Edits that change the output of called functions, without changing their length (and therefore the line at which the cached function is defined) should be relatively infrequent. Still I added a warning about it on the top of the openml.py file.

Some care would need to be taken for future changes to this file, but I think not implementing the caching mechanism ourselves is still worth it.

rth · 2019-09-16T15:25:09Z

cc @thomasjpfan in case you are able to have a look )

jnothman

So this is now caching the semi-processed response using joblib in all instances, rather than caching the gzipped HTTP response directly? That's okay, but is the joblib cache of such matter zipped? This also leaves ghosts on the user's computer when upgrading scikit-learn. not sure there's much we can do about that if the user may potentially use multiple versions.

lesteve · 2019-09-17T02:44:18Z

That's okay, but is the joblib cache of such matter zipped?

The saved pickle can be compressed if you want.

You can use the compress argument of joblib.Memory for this:

# default compression (zlib, compression level 3 IIRC)
memory = Memory('/tmp/joblib', compress=True)
# you can set the compression library used  and the compression level if you want
memory = Memory('/tmp/joblib', compress=('lzma', 6))

There is an example for joblib.dump/joblib.load for different compressions libraries here

jnothman · 2019-09-17T03:55:38Z

Let's do that. Do we need to work towards warning about skeletons in the cache.

thomasjpfan · 2019-10-12T17:34:54Z

I am concerned with how this PR places restricts on the openml.py source file. Everytime we update this file we would need to consider: "This will invalidate the cache".

It would be nice to have a way to somehow warn users about the cache. Some ideas:

When cache gets too big, tell users to clean the cache.
Automatically clear the oldest files when cache gets too big.

"Too big" can be defined as a keyword in fetch_openml.

Overall, I am concerned with having to multiple copies of the same data, because we happened to change something in this file.

Another option would be to keep our current (inelegant) http caching, but when we get to the point where we store or load the data, we use joblib.dump/load?

lesteve · 2019-10-21T08:59:47Z

Overall, I am concerned with having to multiple copies of the same data, because we happened to change something in this file.

Can you elaborate on this? If the cache gets invalidated (for example because openml.py changed), the old cache gets replaced by the new one so there are no "multiple copies of the same data". Maybe you mean copies with different versions of the OpenML dataset?

I'd be in favour of trying to use joblib for this although I don't have the bandwidth to work on this unfortunately.

Side-comment: bytes_limit parameter + Memory.reduce_size can be used to explicitly clean the cache.

There were some talks about cleaning the joblib cache in the background that never materialized inside joblib (I don't remember the details fully but you can launch a sub-process that has a life on its own and can live longer than the python process that created it), maybe this is an opportunity to try again?

Of course there will be caveats like multiple cleaning process running at the same time and possibly stepping on each other toes. Note that measuring the size of the cache can take a while (I would need to see if I had this in my notes but in some use cases, the size of the cache can easily reach hundreds of GB of a shared filesystem) so a simple du -sh takes a while.

thomasjpfan · 2019-10-21T19:32:41Z

Can you elaborate on this? If the cache gets invalidated (for example because openml.py changed), the old cache gets replaced by the new one so there are no "multiple copies of the same data".

Ah I misunderstood the behavior of joblib when it comes to caching. As long as the old cache gets replaced then I am more in favor of this PR's solution.

jnothman

This looks good and it would be worthwhile to speed loading large ARFF, as long as openml continues to offer no alternative format.

jnothman · 2020-12-29T13:49:57Z

sklearn/datasets/openml.py

+        # in some cases liac-arff returns a generator with data which
+        # cannot be pickled with joblib < 0.12
+        arff = _download_data_arff(**kwargs)


Is it worth caching this at all? Is it sufficient to cache the full operation without worrying about this – essentially – HTTP caching?

FlorinAndrei · 2021-06-12T22:26:26Z

Please fix this bug. Loading large datasets is very slow, which defeats the purpose of caching. I have to always use this workaround:

from joblib import Memory
memory = Memory('./tmp')
fetch_openml_cached = memory.cache(fetch_openml)

And then loading MNIST is much faster, perhaps over 10x faster:

x, y = fetch_openml_cached('mnist_784', version = 1, cache = True, return_X_y = True, as_frame = False)

rth added 11 commits August 30, 2019 15:54

WIP

5537d99

Refactor fetch_openml to use global caching

2425a2b

Remove HTTP request based caching

ae5aea8

More simplifications

3b25f2e

More code cleanup

93b3d71

More refactoring

c493144

More simplification

529f0e8

Smaller diff

4a12e62

Joblib 0.11 compatibilty

02f0fe6

[doc build] fill CircleCI datasets cache

7822146

[doc build] rebuild examples with cache

c4ea2e4

jnothman reviewed Sep 1, 2019

View reviewed changes

rth added 3 commits September 1, 2019 23:30

Factorize more code out of the inner cached function

ae8e8b8

Also cache ARFF download

c6d5873

Fix joblib 0.11 compatibility

bf8e3fe

Revert unnecessary changes

60b1a63

Test HTTPError

c676fae

jnothman reviewed Sep 2, 2019

View reviewed changes

rth mentioned this pull request Sep 8, 2019

OpenML: avoid warning when all versions are inactive #14829

Open

rth mentioned this pull request Sep 16, 2019

Minimal Generalized linear models implementation (L2 + lbfgs) #14300

Merged

7 tasks

rth added 3 commits September 16, 2019 16:41

Merge remote-tracking branch 'upstream/master' into openml-cache2

76e2164

Delegate retrying on failure to unpickle to joblib

e9744ed

Add disclamer about changes to openml.py

9310280

thomasjpfan self-assigned this Sep 16, 2019

jnothman reviewed Sep 17, 2019

View reviewed changes

github-actions bot added the module:datasets label Mar 2, 2020

rth mentioned this pull request Nov 7, 2020

fetch_openml does not cache the data #18783

Closed

jnothman reviewed Dec 29, 2020

View reviewed changes

Base automatically changed from master to main January 22, 2021 10:51

rth mentioned this pull request Mar 30, 2021

fetch_openml with mnist_784 uses excessive memory #19774

Closed

steinfurt mentioned this pull request Dec 3, 2021

[MRG] Accelerate example plot_sparse_logistic_regression_mnist.py #21862

Closed

glemaitre mentioned this pull request Dec 13, 2021

ENH improve ARFF parser using pandas #21938

Merged

thomasjpfan mentioned this pull request Jan 25, 2022

Wrong 'cache' argument description in sklearn.datasets.fetch_openml doc #22301

Closed

glemaitre closed this in #21938 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster fetch_openml cache #14855

Faster fetch_openml cache #14855

rth commented Aug 30, 2019 •

edited

Loading

rth commented Aug 31, 2019

jnothman left a comment

rth commented Sep 2, 2019

rth commented Sep 2, 2019

jnothman Sep 1, 2019

rth Sep 2, 2019

rth commented Sep 16, 2019

lesteve commented Sep 16, 2019

rth commented Sep 16, 2019

rth commented Sep 16, 2019

rth commented Sep 16, 2019

jnothman left a comment

lesteve commented Sep 17, 2019 •

edited

Loading

jnothman commented Sep 17, 2019 via email

thomasjpfan commented Oct 12, 2019

lesteve commented Oct 21, 2019 •

edited

Loading

thomasjpfan commented Oct 21, 2019

jnothman left a comment

jnothman Dec 29, 2020

FlorinAndrei commented Jun 12, 2021

Faster fetch_openml cache #14855

Faster fetch_openml cache #14855

Conversation

rth commented Aug 30, 2019 • edited Loading

rth commented Aug 31, 2019

jnothman left a comment

Choose a reason for hiding this comment

rth commented Sep 2, 2019

rth commented Sep 2, 2019

jnothman Sep 1, 2019

Choose a reason for hiding this comment

rth Sep 2, 2019

Choose a reason for hiding this comment

rth commented Sep 16, 2019

lesteve commented Sep 16, 2019

rth commented Sep 16, 2019

rth commented Sep 16, 2019

rth commented Sep 16, 2019

jnothman left a comment

Choose a reason for hiding this comment

lesteve commented Sep 17, 2019 • edited Loading

jnothman commented Sep 17, 2019 via email

thomasjpfan commented Oct 12, 2019

lesteve commented Oct 21, 2019 • edited Loading

thomasjpfan commented Oct 21, 2019

jnothman left a comment

Choose a reason for hiding this comment

jnothman Dec 29, 2020

Choose a reason for hiding this comment

FlorinAndrei commented Jun 12, 2021

rth commented Aug 30, 2019 •

edited

Loading

lesteve commented Sep 17, 2019 •

edited

Loading

lesteve commented Oct 21, 2019 •

edited

Loading