Skip to content

[MRG+2] MAINT: Refactor the converted-image cache #7764

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from

Conversation

jkseppan
Copy link
Member

@jkseppan jkseppan commented Jan 7, 2017

There is a cache of png files keyed by the MD5 hashes of corresponding svg and pdf files, which helps reduce test suite running times for svg and pdf files that stay exactly the same from one run to the next.

This patch enables caching of test results, not only expected results, which is only useful if the tests are mostly deterministic (see #7748). It adds reporting of cache misses, which can be helpful
in getting tests to stay deterministic, and expiration since the test results are going to change more often than the expected results.

@tacaswell tacaswell modified the milestones: 2.2 (next next feature release), 2.1 (next point release) Jan 7, 2017
@tacaswell tacaswell requested a review from Kojoley January 7, 2017 22:16
@jkseppan
Copy link
Member Author

jkseppan commented Jan 8, 2017

On my test host (a single core linode) merging #7748 into this reduced the test time on a warm cache from 349 to 272 seconds.

Before:

Image conversion cache hit rate: 1477/2102
----------------------------------------------------------------------
Ran 5315 tests in 349.495s

After:

Image conversion cache hit rate: 2094/2102
----------------------------------------------------------------------
Ran 5315 tests in 271.976s

@QuLogic
Copy link
Member

QuLogic commented Jan 8, 2017

Since that's a new file, just don't list it in default_test_modules, then you don't have to worry about the optimized case because it'll only be run through pytest.

@jkseppan
Copy link
Member Author

jkseppan commented Jan 8, 2017

Oh, is there a decision to deprecate testing with nose? I'm probably not on top of all the conversations now that most of them happen on github tickets instead of the devel mailing list.

@QuLogic
Copy link
Member

QuLogic commented Jan 8, 2017

There's ongoing work in #6731; presumably will ramp up after 2.0 is out.

@jkseppan
Copy link
Member Author

jkseppan commented Jan 8, 2017

I removed the file from default_test_modules and went back to using bare assert. Rebased to avoid useless back-and-forth commits in the history.

@tacaswell
Copy link
Member

Yes, nose has declared its self un-maintained.

@codecov-io
Copy link

codecov-io commented Jan 8, 2017

Current coverage is 63.82% (diff: 100%)

Merging #7764 into master will increase coverage by 1.70%

@@             master      #7764   diff @@
==========================================
  Files           174        177      +3   
  Lines         56028      63194   +7166   
  Methods           0          0           
  Messages          0          0           
  Branches          0          0           
==========================================
+ Hits          34805      40336   +5531   
- Misses        21223      22858   +1635   
  Partials          0          0           

Powered by Codecov. Last update 115bade...6c9146e

@jkseppan
Copy link
Member Author

This doesn't play well with pytest-xdist. At least the cache statistics don't get relayed from the slave processes, and I suspect the cache object might not get created either.

@jkseppan
Copy link
Member Author

Works with pytest-xdist in my experiments... but now it looks like nose might not be picking up the plugin. Perhaps not worth investigating if we're dropping nose support soon?

@anntzer
Copy link
Contributor

anntzer commented Jan 23, 2017

FWIW Inkscape also provides a --shell mode that may be useful, if starting/stopping the process turns out to contribute significantly to the total time (which the man page suggests):

       --shell With this parameter, Inkscape will enter an interactive command line shell
               mode. In this mode, you type in commands at the prompt and Inkscape executes
               them, without you having to run a new copy of Inkscape for each command. This
               feature is mostly useful for scripting and server uses: it adds no new
               capabilities but allows you to improve the speed and memory requirements of
               any script that repeatedly calls Inkscape to perform command line tasks (such
               as export or conversions). Each command in shell mode must be a complete valid
               Inkscape command line but without the Inkscape program name, for example
               "file.svg --export-pdf=file.pdf".

Copy link
Member

@phobson phobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I'm 99.9% sure this won't affect and downstream libraries (IOW, pytest-mpl). But it would be nice to get @astrofrog 's eyes on this as well.

@jkseppan
Copy link
Member Author

jkseppan commented Jan 24, 2017 via email

NelleV
NelleV previously requested changes Jan 25, 2017
Copy link
Member

@NelleV NelleV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't finish the review (this patch is long, so several passes over it are in order).

I have two major comments.

  1. We don't depend on nose for the testing anymore. I am not sure how you write a plugin for pytest. It's probably just a question of removing the import and minor fixes.

  2. This PR adds many new modules and classes to our public API. I think we need to keep this internal, in order to be able to refactor this if need be, without having to keep the API.

I have a bunch of minor comments.

@@ -76,40 +70,6 @@ def compare_float(expected, actual, relTol=None, absTol=None):
return msg or None


def get_cache_dir():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a backward incompatible change. While I am happy removing it, we need to deprecate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 05b5fb5.

@@ -0,0 +1,200 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make this module private, as well as all the classes in it. We shouldn't expose this API (at least just yet) to users.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 9ba81dd, bb22d11.

from matplotlib import checkdep_inkscape


class ConversionCache(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this private as well :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in bb22d11.

Parameters
----------
directory : str, optional
Files are stored in this directory, defaults to `'test_cache'` in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it helps readability to replace the comma with a dot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a8fcb75.

directory : str, optional
Files are stored in this directory, defaults to `'test_cache'` in
the overall Matplotlib cache directory.
max_size : int, optional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is super useful

self.cachedir = self.get_cache_dir()
self.ensure_cache_dir()
if not isinstance(max_size, int):
raise ValueError("max_size is %s, expected int" % type(max_size))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"max_size is of type %s"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a8fcb75.

"""

def __init__(self, directory=None, max_size=int(1e8)):
self.gets = set()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find gets and hits quite obscure argument names. If the class is made private, we can easily fix this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it's private so the names are easier to change later, but I wonder if you have a proposal for better names? I think "cache hit" is a common term and "get" has a nice symmetry with "hit".

if version_tag:
md5.update(version_tag)
else:
warnings.warn(("Don't know the external converter for %s, cannot "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a user, if I see this message, I assume the "solution" is to delete the cache directory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some more explanation in a8fcb75.

`r['gets']` is the set of files queried,
`r['hits']` is the set of files found in the cache
"""
return dict(hits=self.hits, gets=self.gets)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is going to be super useful as well.

unicode_literals)

from pprint import pprint
from nose.plugins import Plugin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't depend on nose anymore. I am not sure how you convert this to a pytest compatible plugin.

@jkseppan
Copy link
Member Author

Thanks for the review @NelleV! This already works with pytest, all the required code is in conftest.py. I made a plugin for nose because there was already some nose customization via plugins, and added hooks in conftest.py because that was how the existing pytest customization was done. In fact I think there is some problem in the nose plugin because it outputs strange coverage results, but we can just remove it if nose is otherwise gone.

@NelleV
Copy link
Member

NelleV commented Jan 25, 2017

I am in favor of removing the nose plugin :)

from ... import conversion_cache as ccache


class ConversionCache(Plugin):
Copy link
Contributor

@anntzer anntzer Jan 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you have two classes both named ConversionCache. While technically OK I guess it'd be more legible to rename e.g. this one as ConversionCachePlugin (or similar).

(Missed the discussion about removing the nose plugin, which would make this comment obsolete :-))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nose plugin class is getting deleted anyway, per the other comments.

@anntzer
Copy link
Contributor

anntzer commented Jan 25, 2017

@jkseppan re: asynchronous tests: https://pypi.python.org/pypi/pytest-asyncio? (I guess it's time for me to finally understand how asyncio works. But it looks like only the conversion and comparison stage would need to be adapted to it (not sure).)

self.gets.add(filename)
hash_value = self._get_file_hash(filename)
cached_file = os.path.join(self.cachedir, hash_value + self.cached_ext)
with cbook.Locked(self.cachedir):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a problem right now but it may be preferable to at some point use (and thus, first, implement...) per-file locking rather than per directory; see #7776 (comment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to hold the locks as briefly as possible, so for example the hashes get computed without holding the lock and only the shutil.copyfile call happens with the lock held. The expire method is the exception, but that one really needs to lock the whole directory to guarantee its file-size computations are accurate. The TeX cache, which #7776 is about, stays locked for the duration of a whole LaTeX run, which can easily take seconds.

cached_file = os.path.join(self.cachedir, hash_value + self.cached_ext)
with cbook.Locked(self.cachedir):
if os.path.exists(cached_file):
shutil.copyfile(cached_file, newname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we {sym,hard}link instead of copying?
(On Windows, admin rights are needed to symlink but not to hardlink. Python only added Windows support for either in Py3.2 (https://docs.python.org/3.6/library/os.html#os.link, https://docs.python.org/3.6/library/os.html#os.symlink).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would probably be faster and save a little disk space, but it would introduce the possibility to change the cached file later by modifying the linked file, which could lead to hard-to-find bugs. It's unlikely that the testing code would do this on purpose, but perhaps some day someone will write a recursive directory-deletion function that truncates the files first. (I have written such a function myself elsewhere because of... a long story.)

Copy link
Member

@NelleV NelleV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some small remarks.

also, do you mind rebasing with master to get the new appveyor stuff in? It's totally optional :)

@@ -160,16 +121,39 @@ def comparable_formats():
return ['png'] + list(converter)


def convert(filename, cache):
@cbook.deprecated('2.1', addendum='Use _ConversionCache instead')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we advocate for the use of a private class/function or just inform the user of its disappearance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed the addenda.


"""

def __init__(self, directory=None, max_size=int(1e8)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe document self.hits and self.gets here?
(I don't have better names yet, thought I haven't really thought about it extensively)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

@NelleV NelleV changed the title MAINT: Refactor the converted-image cache [MRG+1] MAINT: Refactor the converted-image cache Jan 29, 2017
Copy link
Member

@QuLogic QuLogic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nose will need to be removed.



def test_cache_basic():
tmpdir = tempfile.mkdtemp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tmpdir as an argument to the function to use the pytest fixture. Note that it's a py.path.local object, not a string or path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It does look like a useful object.



def test_cache_expire():
tmpdir = tempfile.mkdtemp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tmpdir as a parameter to use the pytest fixture.

@mock.patch('matplotlib.testing._conversion_cache.cbook.mkdirs',
side_effect=OSError)
def test_cache_mkdir_error(mkdirs):
tmpdir = tempfile.mkdtemp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tmpdir as a parameter to use the pytest fixture.

@mock.patch('matplotlib.testing._conversion_cache.os.access',
side_effect=[False])
def test_cache_unwritable_error(access):
tmpdir = tempfile.mkdtemp()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tmpdir as a parameter to use the pytest fixture.

import stat
import tempfile

from nose.tools import raises
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove nose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

assert copy.read() == 'generated from the pdf file'
assert cache.report() == \
{'gets': {intmp('fake.pdf'), intmp('fake.svg')},
'hits': set([intmp('fake.pdf')])}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come this isn't a set literal like the other ones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I had probably copied this from above and edited it from the empty set.

pass


@raises(_CacheError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pytest.raises context manager instead of this decorator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

shutil.rmtree(tmpdir)


@raises(_CacheError)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use pytest.raises context manager instead of this decorator.

try:
self.converter_version['.pdf'] = \
checkdep_ghostscript()[1].encode('utf-8')
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare excepts are dangerous. What is this actually trying to catch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

try:
self.converter_version['.svg'] = \
checkdep_inkscape().encode('utf-8')
except:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

This way

  py.test lib/matplotlib/tests/test_artist.py --conversion-cache-report-misses -n 2

works for me even though the options don't get documented in
"py.test --help".
@jkseppan
Copy link
Member Author

I guess it's progress that now it fails on Travis too, so both CI engines are getting the same configuration. The remaining mystery is why it fails there while working for me locally.

@tacaswell
Copy link
Member

differences in pytest version?

@QuLogic
Copy link
Member

QuLogic commented Feb 20, 2017

Do you run just pytest or pytest path/to/something? The former is what CI does, and it doesn't work for me if I do that. If I do pytest lib/matplotlib, then it also fails because pytest_addoption is called twice from the regular tests and the sphinx tests. But If I do pytest lib/matplotlib/tests (or any of the other two tests paths), then it works.

assert not os.path.exists(fname('cache', 'onemore.png'))

finally:
tmpdir.remove(rec=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this instead of letting the fixture clean up?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see in the documentation that it promises to do any cleanup. But it seems that the base directory gets cleaned up after three test runs.

@QuLogic
Copy link
Member

QuLogic commented Feb 21, 2017

Do you think it would be possible/helpful to use pytest's builtin cache instead?

@jkseppan
Copy link
Member Author

@QuLogic You're right, I've been running pytest path/to/file to avoid running the whole test suite when testing, and it fails when I omit the path. Apparently the conftest.py files in subdirectories are only picked up correctly if there is a path. Is there a reason that the top-level configuration file was removed?

@jkseppan
Copy link
Member Author

@QuLogic I don't see much savings that are possible by using the pytest cache, unless it manages LRU deletion for us, and I don't see that mentioned. The get/set methods serialize the values using JSON, which probably increases the size of the data by escaping it inside a string. The makedir method seems to just offer a way to make a directory, which we can do without the API.

It seems that the tmpdir fixture will take care of them after
three test runs.
http://doc.pytest.org/en/latest/tmpdir.html#the-default-base-temporary-directory
@jkseppan
Copy link
Member Author

It seems pytest_addoption only works in top-level conftest.py files or plugins: pytest-dev/pytest#1427

@jkseppan
Copy link
Member Author

When the pytest_addoption call is at the top level, Travis tests seem to be passing but Appveyor fails with "E ValueError: Plugin already registered: C:\projects\matplotlib\conftest.py=<module 'conftest' from 'C:\projects\matplotlib\conftest.py'>"

@QuLogic
Copy link
Member

QuLogic commented Feb 22, 2017

Yes, the top-level conftest.py was removed because a) we needed fixtures to work on the installed version and b) AppVeyor didn't like having both... For the installed case, you can add a default to getting the option so it works without the option enabled, but I still have no idea how to get the top-level to work right.

@tacaswell tacaswell modified the milestones: 2.1.1 (next bug fix release), 2.1 (next point release) Aug 29, 2017
@tacaswell
Copy link
Member

This has grown merge conflicts.

@tacaswell tacaswell modified the milestones: 2.1.1 (next bug fix release), 2.2 (next feature release) Oct 9, 2017
@jklymak jklymak added the stale label Sep 12, 2020
@jklymak
Copy link
Member

jklymak commented Sep 12, 2020

marked as stale. Assume this is obsolete and should closed?

@jkseppan
Copy link
Member Author

jkseppan commented Sep 13, 2020

I suppose? I don't really remember the details of this PR.

@jkseppan jkseppan closed this Sep 13, 2020
@jkseppan jkseppan deleted the test-caching branch September 13, 2020 11:41
@QuLogic
Copy link
Member

QuLogic commented Sep 25, 2020

@anntzer replaced this in #15932 (though more of a rewrite in spirit than a rebase)

@QuLogic QuLogic removed the stale label Sep 25, 2020
@story645 story645 removed this from the future releases milestone Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants