-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[MRG+2] MAINT: Refactor the converted-image cache #7764
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
On my test host (a single core linode) merging #7748 into this reduced the test time on a warm cache from 349 to 272 seconds. Before:
After:
|
Since that's a new file, just don't list it in |
Oh, is there a decision to deprecate testing with nose? I'm probably not on top of all the conversations now that most of them happen on github tickets instead of the devel mailing list. |
There's ongoing work in #6731; presumably will ramp up after 2.0 is out. |
I removed the file from |
Yes, nose has declared its self un-maintained. |
Current coverage is 63.82% (diff: 100%)@@ master #7764 diff @@
==========================================
Files 174 177 +3
Lines 56028 63194 +7166
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 34805 40336 +5531
- Misses 21223 22858 +1635
Partials 0 0
|
This doesn't play well with pytest-xdist. At least the cache statistics don't get relayed from the slave processes, and I suspect the cache object might not get created either. |
Works with pytest-xdist in my experiments... but now it looks like nose might not be picking up the plugin. Perhaps not worth investigating if we're dropping nose support soon? |
FWIW Inkscape also provides a
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I'm 99.9% sure this won't affect and downstream libraries (IOW, pytest-mpl). But it would be nice to get @astrofrog 's eyes on this as well.
The Inkscape shell mode would probably improve performance. With
Ghostscript you can probably do something similar. I wonder: is it
possible with pytest to run part of the test asynchronously, so it would
wait on the result from Inkscape while starting the next test case?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't finish the review (this patch is long, so several passes over it are in order).
I have two major comments.
-
We don't depend on nose for the testing anymore. I am not sure how you write a plugin for pytest. It's probably just a question of removing the import and minor fixes.
-
This PR adds many new modules and classes to our public API. I think we need to keep this internal, in order to be able to refactor this if need be, without having to keep the API.
I have a bunch of minor comments.
lib/matplotlib/testing/compare.py
Outdated
@@ -76,40 +70,6 @@ def compare_float(expected, actual, relTol=None, absTol=None): | |||
return msg or None | |||
|
|||
|
|||
def get_cache_dir(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a backward incompatible change. While I am happy removing it, we need to deprecate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 05b5fb5.
@@ -0,0 +1,200 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to make this module private, as well as all the classes in it. We shouldn't expose this API (at least just yet) to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from matplotlib import checkdep_inkscape | ||
|
||
|
||
class ConversionCache(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this private as well :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in bb22d11.
Parameters | ||
---------- | ||
directory : str, optional | ||
Files are stored in this directory, defaults to `'test_cache'` in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it helps readability to replace the comma with a dot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a8fcb75.
directory : str, optional | ||
Files are stored in this directory, defaults to `'test_cache'` in | ||
the overall Matplotlib cache directory. | ||
max_size : int, optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is super useful
self.cachedir = self.get_cache_dir() | ||
self.ensure_cache_dir() | ||
if not isinstance(max_size, int): | ||
raise ValueError("max_size is %s, expected int" % type(max_size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"max_size is of type %s"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a8fcb75.
""" | ||
|
||
def __init__(self, directory=None, max_size=int(1e8)): | ||
self.gets = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find gets
and hits
quite obscure argument names. If the class is made private, we can easily fix this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it's private so the names are easier to change later, but I wonder if you have a proposal for better names? I think "cache hit" is a common term and "get" has a nice symmetry with "hit".
if version_tag: | ||
md5.update(version_tag) | ||
else: | ||
warnings.warn(("Don't know the external converter for %s, cannot " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a user, if I see this message, I assume the "solution" is to delete the cache directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some more explanation in a8fcb75.
`r['gets']` is the set of files queried, | ||
`r['hits']` is the set of files found in the cache | ||
""" | ||
return dict(hits=self.hits, gets=self.gets) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is going to be super useful as well.
unicode_literals) | ||
|
||
from pprint import pprint | ||
from nose.plugins import Plugin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't depend on nose anymore. I am not sure how you convert this to a pytest compatible plugin.
Thanks for the review @NelleV! This already works with pytest, all the required code is in conftest.py. I made a plugin for nose because there was already some nose customization via plugins, and added hooks in conftest.py because that was how the existing pytest customization was done. In fact I think there is some problem in the nose plugin because it outputs strange coverage results, but we can just remove it if nose is otherwise gone. |
I am in favor of removing the nose plugin :) |
from ... import conversion_cache as ccache | ||
|
||
|
||
class ConversionCache(Plugin): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you have two classes both named ConversionCache. While technically OK I guess it'd be more legible to rename e.g. this one as ConversionCachePlugin (or similar).
(Missed the discussion about removing the nose plugin, which would make this comment obsolete :-))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nose plugin class is getting deleted anyway, per the other comments.
@jkseppan re: asynchronous tests: https://pypi.python.org/pypi/pytest-asyncio? (I guess it's time for me to finally understand how asyncio works. But it looks like only the conversion and comparison stage would need to be adapted to it (not sure).) |
self.gets.add(filename) | ||
hash_value = self._get_file_hash(filename) | ||
cached_file = os.path.join(self.cachedir, hash_value + self.cached_ext) | ||
with cbook.Locked(self.cachedir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a problem right now but it may be preferable to at some point use (and thus, first, implement...) per-file locking rather than per directory; see #7776 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to hold the locks as briefly as possible, so for example the hashes get computed without holding the lock and only the shutil.copyfile call happens with the lock held. The expire method is the exception, but that one really needs to lock the whole directory to guarantee its file-size computations are accurate. The TeX cache, which #7776 is about, stays locked for the duration of a whole LaTeX run, which can easily take seconds.
cached_file = os.path.join(self.cachedir, hash_value + self.cached_ext) | ||
with cbook.Locked(self.cachedir): | ||
if os.path.exists(cached_file): | ||
shutil.copyfile(cached_file, newname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we {sym,hard}link instead of copying?
(On Windows, admin rights are needed to symlink but not to hardlink. Python only added Windows support for either in Py3.2 (https://docs.python.org/3.6/library/os.html#os.link, https://docs.python.org/3.6/library/os.html#os.symlink).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would probably be faster and save a little disk space, but it would introduce the possibility to change the cached file later by modifying the linked file, which could lead to hard-to-find bugs. It's unlikely that the testing code would do this on purpose, but perhaps some day someone will write a recursive directory-deletion function that truncates the files first. (I have written such a function myself elsewhere because of... a long story.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some small remarks.
also, do you mind rebasing with master to get the new appveyor stuff in? It's totally optional :)
lib/matplotlib/testing/compare.py
Outdated
@@ -160,16 +121,39 @@ def comparable_formats(): | |||
return ['png'] + list(converter) | |||
|
|||
|
|||
def convert(filename, cache): | |||
@cbook.deprecated('2.1', addendum='Use _ConversionCache instead') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we advocate for the use of a private class/function or just inform the user of its disappearance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, removed the addenda.
|
||
""" | ||
|
||
def __init__(self, directory=None, max_size=int(1e8)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe document self.hits and self.gets here?
(I don't have better names yet, thought I haven't really thought about it extensively)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nose
will need to be removed.
lib/matplotlib/tests/test_cache.py
Outdated
|
||
|
||
def test_cache_basic(): | ||
tmpdir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tmpdir
as an argument to the function to use the pytest fixture. Note that it's a py.path.local
object, not a string or path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It does look like a useful object.
lib/matplotlib/tests/test_cache.py
Outdated
|
||
|
||
def test_cache_expire(): | ||
tmpdir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tmpdir
as a parameter to use the pytest fixture.
lib/matplotlib/tests/test_cache.py
Outdated
@mock.patch('matplotlib.testing._conversion_cache.cbook.mkdirs', | ||
side_effect=OSError) | ||
def test_cache_mkdir_error(mkdirs): | ||
tmpdir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tmpdir
as a parameter to use the pytest fixture.
lib/matplotlib/tests/test_cache.py
Outdated
@mock.patch('matplotlib.testing._conversion_cache.os.access', | ||
side_effect=[False]) | ||
def test_cache_unwritable_error(access): | ||
tmpdir = tempfile.mkdtemp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use tmpdir
as a parameter to use the pytest fixture.
lib/matplotlib/tests/test_cache.py
Outdated
import stat | ||
import tempfile | ||
|
||
from nose.tools import raises |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove nose
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/matplotlib/tests/test_cache.py
Outdated
assert copy.read() == 'generated from the pdf file' | ||
assert cache.report() == \ | ||
{'gets': {intmp('fake.pdf'), intmp('fake.svg')}, | ||
'hits': set([intmp('fake.pdf')])} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this isn't a set
literal like the other ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed. I had probably copied this from above and edited it from the empty set.
lib/matplotlib/tests/test_cache.py
Outdated
pass | ||
|
||
|
||
@raises(_CacheError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pytest.raises
context manager instead of this decorator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/matplotlib/tests/test_cache.py
Outdated
shutil.rmtree(tmpdir) | ||
|
||
|
||
@raises(_CacheError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use pytest.raises
context manager instead of this decorator.
try: | ||
self.converter_version['.pdf'] = \ | ||
checkdep_ghostscript()[1].encode('utf-8') | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bare except
s are dangerous. What is this actually trying to catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
try: | ||
self.converter_version['.svg'] = \ | ||
checkdep_inkscape().encode('utf-8') | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
a36436d
to
936f73b
Compare
This way py.test lib/matplotlib/tests/test_artist.py --conversion-cache-report-misses -n 2 works for me even though the options don't get documented in "py.test --help".
bb9a1a3
to
e52cdd2
Compare
I guess it's progress that now it fails on Travis too, so both CI engines are getting the same configuration. The remaining mystery is why it fails there while working for me locally. |
differences in pytest version? |
Do you run just |
lib/matplotlib/tests/test_cache.py
Outdated
assert not os.path.exists(fname('cache', 'onemore.png')) | ||
|
||
finally: | ||
tmpdir.remove(rec=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do this instead of letting the fixture clean up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see in the documentation that it promises to do any cleanup. But it seems that the base directory gets cleaned up after three test runs.
Do you think it would be possible/helpful to use pytest's builtin cache instead? |
@QuLogic You're right, I've been running |
@QuLogic I don't see much savings that are possible by using the pytest cache, unless it manages LRU deletion for us, and I don't see that mentioned. The get/set methods serialize the values using JSON, which probably increases the size of the data by escaping it inside a string. The makedir method seems to just offer a way to make a directory, which we can do without the API. |
It seems that the tmpdir fixture will take care of them after three test runs. http://doc.pytest.org/en/latest/tmpdir.html#the-default-base-temporary-directory
It seems |
It only works there, see pytest-dev/pytest#1427
When the |
Yes, the top-level |
This has grown merge conflicts. |
marked as stale. Assume this is obsolete and should closed? |
I suppose? I don't really remember the details of this PR. |
There is a cache of png files keyed by the MD5 hashes of corresponding svg and pdf files, which helps reduce test suite running times for svg and pdf files that stay exactly the same from one run to the next.
This patch enables caching of test results, not only expected results, which is only useful if the tests are mostly deterministic (see #7748). It adds reporting of cache misses, which can be helpful
in getting tests to stay deterministic, and expiration since the test results are going to change more often than the expected results.