Skip to content

Faster fetch_openml cache #14855

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed

Faster fetch_openml cache #14855

wants to merge 19 commits into from

Conversation

rth
Copy link
Member

@rth rth commented Aug 30, 2019

When designing fetch_openml in #11419 we decided to go with fine grained cache of individual HTTP requests. In retrospective that was not the right decision.

The issue is that HTTP fetching is not the only bottleneck. The (pure Python) ARFF parser we use is quite slow. A side note: there could be potentially a gain of 3x or so by using to compiled parser (here in Rust) but that would require a lot of work (on build and to reach feature parity). We are also doing a significant amount of post-processing to convert the parsed arff to something usable. ARFF is a problem overall.

A better solution is to cache the output of fetch_openml directly with joblib.Memory. This PR implements that.

This also has the advantage of delegating most of the caching logic to joblib.

Performance

Loading MNIST from cache with

fetch_openml('mnist_784', version=1)

takes 19s on master and 0.3s in this PR (on SSD).

Loading it the first time with an empty cache will probably also be marginally faster (due to pickling ndarray being more efficient as compared to storing gzipped HTTP responses). My internet is not good enough to try at the moment.

Initial motivation was to make examples in #14300 faster,

TODO

  • possibly add tests to ensure coverage: we have somewhat less control on where files are stored in the cache, so a bit more tricky to test.
  • see if the re-trying mechanism on corrupted/outdated pickles can be done in joblib.

@rth
Copy link
Member Author

rth commented Aug 31, 2019

This also makes docs in CircleCI build faster : 26-33min on master and 20-25min in this PR for a full build.

One limitation is that because we cache a large function, any change there (even cosmetic) will trigger the cache to become invalid. I'm not sure if joblib would automatically clean up files from previous version. This will in particular affect developers, but I think paying for performance with some disk space is acceptable.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be caching the data request too, or some other portion of the work, to allow the user to tweak the parameters of their request without being penalised? It looks like you've considered that case for return_X_y.

@rth
Copy link
Member Author

rth commented Sep 2, 2019

Should we be caching the data request too, or some other portion of the work, to allow the user to tweak the parameters of their request without being penalised? It looks like you've considered that case for return_X_y.

The ideal case would have been to cache the frame and select the target_column afterwards (or the X, y in the sparse case). The issue is that the way code is written now it is difficult, or at least I haven't managed to do it without a major rewrite).

Instead I re-added a caching of all HTTP requests (with joblib), including the data request, so that should mitigate a bit this problem. Data request will be cached only with joblib >=0.12, as sometimes the data request returned by the arff reader is a generator that cannot be pickled without cloudpickle in newer joblib versions.

@rth
Copy link
Member Author

rth commented Sep 2, 2019

(The failure in test_fastica_simple is unrelated to this PR).

if data_home is not None:
# Use dataset specific cache location, so it can be separately
# cleaned
self.memory = Memory(join(data_home, 'cache', str(data_id)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'cache' is a lot less specific than the previous openml.org... Consider the user trying to manually fix cache or disk space issues... better name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_home already includes the openml string in the path (see openml.py:L618 on master). Previously it was,

~/scikit_learn_data/openml/openml.org/

which is a bit redundant (openml.org was the only folder there). Now it's

~/scikit_learn_data/openml/cache/

I can remove the cache as well, but then it'll be a bit more difficult to distinguish between the legacy and and these new files.

@rth
Copy link
Member Author

rth commented Sep 16, 2019

see if the re-trying mechanism on corrupted/outdated pickles can be done in joblib.

Actually looking at the code, it sound like joblib.Memory already will call the cached function directly if it fails to load the pickled result from disk
https://github.com/joblib/joblib/blob/470a6372dbae826b2f6107a77f6457e956249fef/joblib/memory.py#L531

so we could remove the corresponding logic from this PR.

@lesteve do you confirm?

@lesteve
Copy link
Member

lesteve commented Sep 16, 2019

Here is what I know (hopefully that can help):

About corrupted pickle, if joblib.Memory fails to load the pickle it will recompute and overwrite the old cache with the pickle of the computation

About "outdated": it depends what you mean by this.

  • if the function you cache has some arguments that change (when say you get a different version of the dataset on OpenML) then the result will be recomputed and saved. The old cache will still be there but in a different folder. The hash key is based on fully qualified function name (i.e. including module name) + joblib.hash of the input arguments.
  • there are some simple mechanism to detect if the function code has changed. This is mostly based on the function code source plus maybe a few additional tweaks like which line the function is defined. Sometimes, this simple mechanism is not enough: for example the function code does not change but one of the function you call inside the cached function changes. In this case, the recommended approach is to delete the cache manually ...

If some of what I am saying does not make sense, let me know and I can try to clarify. Another option is to play with a few examples:

from joblib import Memory

memory = Memory(location='/tmp/joblib')

def func(x, y):
    print(f'computing: x={x} y={y}')
    return x + y

cached_func = memory.cache(func)

cached_func(1, 2)
cached_func(1, 2)
cached_func(1, 3)
cached_func(1, 4)
# look at the content of memory.location:
ls -ltr /tmp/joblib/joblib/*/func                                                                                           

I get this: note that you have three hash subdirectories (three different arguments were used) above, the func_code.py contains the function source code.

total 16
drwxr-xr-x 2 lesteve sed 4096 Sep 16 16:25 84e2b58a1d6ad3657bedd42be32837f3
drwxr-xr-x 2 lesteve sed 4096 Sep 16 16:25 17444b56c7c0c792594ca5b8ae59b010
drwxr-xr-x 2 lesteve sed 4096 Sep 16 16:25 64a24e8d2ebfa0084a5e2f867b71ac83
-rw-r--r-- 1 lesteve sed   86 Sep 16 16:25 func_code.py

@rth
Copy link
Member Author

rth commented Sep 16, 2019

Thanks for the explanations @lesteve !

@rth
Copy link
Member Author

rth commented Sep 16, 2019

Further simplified this PR given that joblib.Memory handles re-trying on failed unpickling.

Sometimes, this simple mechanism is not enough: for example the function code does not change but one of the function you call inside the cached function changes. In this case, the recommended approach is to delete the cache manually ...

Edits that change the output of called functions, without changing their length (and therefore the line at which the cached function is defined) should be relatively infrequent. Still I added a warning about it on the top of the openml.py file.

Some care would need to be taken for future changes to this file, but I think not implementing the caching mechanism ourselves is still worth it.

@rth
Copy link
Member Author

rth commented Sep 16, 2019

cc @thomasjpfan in case you are able to have a look )

@thomasjpfan thomasjpfan self-assigned this Sep 16, 2019
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is now caching the semi-processed response using joblib in all instances, rather than caching the gzipped HTTP response directly? That's okay, but is the joblib cache of such matter zipped? This also leaves ghosts on the user's computer when upgrading scikit-learn. not sure there's much we can do about that if the user may potentially use multiple versions.

@lesteve
Copy link
Member

lesteve commented Sep 17, 2019

That's okay, but is the joblib cache of such matter zipped?

The saved pickle can be compressed if you want.

You can use the compress argument of joblib.Memory for this:

# default compression (zlib, compression level 3 IIRC)
memory = Memory('/tmp/joblib', compress=True)
# you can set the compression library used  and the compression level if you want
memory = Memory('/tmp/joblib', compress=('lzma', 6))

There is an example for joblib.dump/joblib.load for different compressions libraries here

@jnothman
Copy link
Member

jnothman commented Sep 17, 2019 via email

@thomasjpfan
Copy link
Member

I am concerned with how this PR places restricts on the openml.py source file. Everytime we update this file we would need to consider: "This will invalidate the cache".

It would be nice to have a way to somehow warn users about the cache. Some ideas:

  1. When cache gets too big, tell users to clean the cache.
  2. Automatically clear the oldest files when cache gets too big.

"Too big" can be defined as a keyword in fetch_openml.

Overall, I am concerned with having to multiple copies of the same data, because we happened to change something in this file.

Another option would be to keep our current (inelegant) http caching, but when we get to the point where we store or load the data, we use joblib.dump/load?

@lesteve
Copy link
Member

lesteve commented Oct 21, 2019

Overall, I am concerned with having to multiple copies of the same data, because we happened to change something in this file.

Can you elaborate on this? If the cache gets invalidated (for example because openml.py changed), the old cache gets replaced by the new one so there are no "multiple copies of the same data". Maybe you mean copies with different versions of the OpenML dataset?

I'd be in favour of trying to use joblib for this although I don't have the bandwidth to work on this unfortunately.

Side-comment: bytes_limit parameter + Memory.reduce_size can be used to explicitly clean the cache.

There were some talks about cleaning the joblib cache in the background that never materialized inside joblib (I don't remember the details fully but you can launch a sub-process that has a life on its own and can live longer than the python process that created it), maybe this is an opportunity to try again?

Of course there will be caveats like multiple cleaning process running at the same time and possibly stepping on each other toes. Note that measuring the size of the cache can take a while (I would need to see if I had this in my notes but in some use cases, the size of the cache can easily reach hundreds of GB of a shared filesystem) so a simple du -sh takes a while.

@thomasjpfan
Copy link
Member

Can you elaborate on this? If the cache gets invalidated (for example because openml.py changed), the old cache gets replaced by the new one so there are no "multiple copies of the same data".

Ah I misunderstood the behavior of joblib when it comes to caching. As long as the old cache gets replaced then I am more in favor of this PR's solution.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and it would be worthwhile to speed loading large ARFF, as long as openml continues to offer no alternative format.

Comment on lines +710 to +712
# in some cases liac-arff returns a generator with data which
# cannot be pickled with joblib < 0.12
arff = _download_data_arff(**kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth caching this at all? Is it sufficient to cache the full operation without worrying about this – essentially – HTTP caching?

Base automatically changed from master to main January 22, 2021 10:51
@FlorinAndrei
Copy link

Please fix this bug. Loading large datasets is very slow, which defeats the purpose of caching. I have to always use this workaround:

from joblib import Memory
memory = Memory('./tmp')
fetch_openml_cached = memory.cache(fetch_openml)

And then loading MNIST is much faster, perhaps over 10x faster:

x, y = fetch_openml_cached('mnist_784', version = 1, cache = True, return_X_y = True, as_frame = False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants