-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Faster fetch_openml cache #14855
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster fetch_openml cache #14855
Conversation
This also makes docs in CircleCI build faster : 26-33min on master and 20-25min in this PR for a full build. One limitation is that because we cache a large function, any change there (even cosmetic) will trigger the cache to become invalid. I'm not sure if joblib would automatically clean up files from previous version. This will in particular affect developers, but I think paying for performance with some disk space is acceptable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be caching the data request too, or some other portion of the work, to allow the user to tweak the parameters of their request without being penalised? It looks like you've considered that case for return_X_y.
The ideal case would have been to cache the Instead I re-added a caching of all HTTP requests (with joblib), including the data request, so that should mitigate a bit this problem. Data request will be cached only with joblib >=0.12, as sometimes the data request returned by the arff reader is a generator that cannot be pickled without cloudpickle in newer joblib versions. |
(The failure in |
sklearn/datasets/openml.py
Outdated
if data_home is not None: | ||
# Use dataset specific cache location, so it can be separately | ||
# cleaned | ||
self.memory = Memory(join(data_home, 'cache', str(data_id)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'cache' is a lot less specific than the previous openml.org... Consider the user trying to manually fix cache or disk space issues... better name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_home
already includes the openml
string in the path (see openml.py:L618
on master). Previously it was,
~/scikit_learn_data/openml/openml.org/
which is a bit redundant (openml.org
was the only folder there). Now it's
~/scikit_learn_data/openml/cache/
I can remove the cache
as well, but then it'll be a bit more difficult to distinguish between the legacy and and these new files.
Actually looking at the code, it sound like so we could remove the corresponding logic from this PR. @lesteve do you confirm? |
Here is what I know (hopefully that can help): About corrupted pickle, if joblib.Memory fails to load the pickle it will recompute and overwrite the old cache with the pickle of the computation About "outdated": it depends what you mean by this.
If some of what I am saying does not make sense, let me know and I can try to clarify. Another option is to play with a few examples: from joblib import Memory
memory = Memory(location='/tmp/joblib')
def func(x, y):
print(f'computing: x={x} y={y}')
return x + y
cached_func = memory.cache(func)
cached_func(1, 2)
cached_func(1, 2)
cached_func(1, 3)
cached_func(1, 4)
I get this: note that you have three hash subdirectories (three different arguments were used) above, the
|
Thanks for the explanations @lesteve ! |
Further simplified this PR given that
Edits that change the output of called functions, without changing their length (and therefore the line at which the cached function is defined) should be relatively infrequent. Still I added a warning about it on the top of the Some care would need to be taken for future changes to this file, but I think not implementing the caching mechanism ourselves is still worth it. |
cc @thomasjpfan in case you are able to have a look ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is now caching the semi-processed response using joblib in all instances, rather than caching the gzipped HTTP response directly? That's okay, but is the joblib cache of such matter zipped? This also leaves ghosts on the user's computer when upgrading scikit-learn. not sure there's much we can do about that if the user may potentially use multiple versions.
The saved pickle can be compressed if you want. You can use the # default compression (zlib, compression level 3 IIRC)
memory = Memory('/tmp/joblib', compress=True)
# you can set the compression library used and the compression level if you want
memory = Memory('/tmp/joblib', compress=('lzma', 6)) There is an example for |
Let's do that. Do we need to work towards warning about skeletons in the
cache.
|
I am concerned with how this PR places restricts on the It would be nice to have a way to somehow warn users about the cache. Some ideas:
"Too big" can be defined as a keyword in Overall, I am concerned with having to multiple copies of the same data, because we happened to change something in this file. Another option would be to keep our current (inelegant) http caching, but when we get to the point where we store or load the data, we use |
Can you elaborate on this? If the cache gets invalidated (for example because I'd be in favour of trying to use joblib for this although I don't have the bandwidth to work on this unfortunately. Side-comment: There were some talks about cleaning the joblib cache in the background that never materialized inside joblib (I don't remember the details fully but you can launch a sub-process that has a life on its own and can live longer than the python process that created it), maybe this is an opportunity to try again? Of course there will be caveats like multiple cleaning process running at the same time and possibly stepping on each other toes. Note that measuring the size of the cache can take a while (I would need to see if I had this in my notes but in some use cases, the size of the cache can easily reach hundreds of |
Ah I misunderstood the behavior of joblib when it comes to caching. As long as the old cache gets replaced then I am more in favor of this PR's solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good and it would be worthwhile to speed loading large ARFF, as long as openml continues to offer no alternative format.
# in some cases liac-arff returns a generator with data which | ||
# cannot be pickled with joblib < 0.12 | ||
arff = _download_data_arff(**kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth caching this at all? Is it sufficient to cache the full operation without worrying about this – essentially – HTTP caching?
Please fix this bug. Loading large datasets is very slow, which defeats the purpose of caching. I have to always use this workaround:
And then loading MNIST is much faster, perhaps over 10x faster:
|
When designing
fetch_openml
in #11419 we decided to go with fine grained cache of individual HTTP requests. In retrospective that was not the right decision.The issue is that HTTP fetching is not the only bottleneck. The (pure Python) ARFF parser we use is quite slow. A side note: there could be potentially a gain of 3x or so by using to compiled parser (here in Rust) but that would require a lot of work (on build and to reach feature parity). We are also doing a significant amount of post-processing to convert the parsed arff to something usable. ARFF is a problem overall.
A better solution is to cache the output of
fetch_openml
directly withjoblib.Memory
. This PR implements that.This also has the advantage of delegating most of the caching logic to joblib.
Performance
Loading MNIST from cache with
takes 19s on master and 0.3s in this PR (on SSD).
Loading it the first time with an empty cache will probably also be marginally faster (due to pickling ndarray being more efficient as compared to storing gzipped HTTP responses). My internet is not good enough to try at the moment.
Initial motivation was to make examples in #14300 faster,
TODO