-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
Make tests faster #778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make tests faster #778
Conversation
this sounds great - a year ago we made the tests run faster, but since then, I think the fine grained nature of the text / mathtext tests made them pretty slow again (most of the time is spent in there's also the issue (related to 1.) of calling inkscape multiple times - I think we've tried looking into this before about leaving a process running and batching those calls together. it would also be nice to limit the testing to one of PDF, SVG, or PNG - I do this by hand to speed things up. |
If (and this is a big if) we can make freetype behave predictably (as we've started to in #779, though more work needs to be done), then would one not expect the SVG files to be identical, in which case we could just compare the contents of the file and not convert them to PNG? (same for PDF) |
The freetype issue is totally orthogonal to SVG and PDF, since the freetype renderer is not used by vector backends. The SVG backend only uses the freetype TrueType file parser to get the curves, which come directly from the file and shouldn't be impacted by the version of freetype. The PDF backend doesn't even use that -- it uses a library called ttconv. The conversion of SVG and PDF to PNG will be necessary for the foreseeable future. Since both PDF and SVG use dictionaries and hash-based ids to some extent, the ordering of the elements changes between runs. The backends are constantly tweaked to fix bugs and improve efficiencies etc. which can change the structure of the file in radical ways while (if all goes well) not impacting the final result. The only way to look at the final result in an objective way is to render it to a bitmap and compare it in that space. |
Here's the first chunk of changes to make the tests faster. As these things tend to be finicky across systems, I'd appreciate as much testing as possible.
|
It might be useful to leave ghostscript and inkscape running as separate processes. For ghostscript, you'd start it with a command line like gs -sDEVICE=png16m -dNOPAUSE -sOutputFile=page%d.png and send input lines like (foo/bar/baz.pdf) run Trying this from the command line, it seems to work, but I don't know if there are subtle problems where e.g. one pdf file causes a font to be loaded that affects the rendering of the next file. Error recovery is another big question. I think you'd want to use something like pexpect to wait for gs prompts to know that the latest file has been completely written. Inkscape has a --shell option that allows you to send it input lines like "input.svg --export-png=output.png". |
Our tests are mainly end-to-end regression tests: you run a script and check that the output stays the same as it was before. That's useful for noticing regressions, but tests of a smaller scope would be faster to run. We could for example set up a test suite that includes predefined sequences of backend function calls and checks that the output for each sequence stays the same. That would allow skipping the frontend code when you are working on a specific backend. Likewise, we could have a stub backend that only records the calls made by the frontend code, and compares that sequence to a predefined "correct" sequence for a given test case. A related thought: if the freetype version differences are problematic, we could make most test cases not include any text output (disable tick labels, etc), and have a smaller number of cases that specifically test text rendering, and include a larger set of correct answers for just those tests. |
Hey, it shouldn't be too hard to make a checksummer that's aware of pdf syntax and tolerates trivial changes like reordering of dictionary elements and ignores things like the CreationDate timestamp. That might speed the usual case up quite a bit. |
@jkseppan: These are some good ideas. I had implemented using Inkscape around a year ago, but ultimately reverted it. Inkscape leaked memory with each render, so it needed to be restarted periodically, and I wasn't terribly comfortable with coming up with a heuristic for how often to do that. See 9086ef1 for the original kick at that can. It may be that current versions of Inkscape are less leaky, and it's also possible that my implementation could be improved. I can't find my old numbers, but I don't recall that it made a huge improvement in runtime, but that may have been that the memory leaks were causing problems on a memory-constrained machine. I haven't attempted a similar approach with ghostscript, but it's nice to know it can be done. I think removing text except where it is necessary is a good idea. There are cases where the ticking itself is being tested and we want to make sure the numbers and layout of the ticks are correct, but for most tests the ticking is superfluous. |
I have an initial commit that coverts all of the PDF's to PNG's using the same gs process. The speedup isn't terribly dramatic: It went from 4:30 to 4:15 wall clock time on my machine. It would be great if this could be tested -- I have no idea how robust this is to gs crashing etc. |
Rebased on current master |
Tests pass on a Macbook Pro running OS X 10.6.8, Python 2.7.2, GPL Ghostscript 8.71: Ran 1090 tests in 615.499s OK (KNOWNFAIL=2) |
I tried this out on Amazon EC2. I compared the runtime of On a small instance (single core, pretty slow), "before" took 702 seconds and "after" took 706 seconds, so it was slightly slower but probably within measurement noise. On a high-cpu medium instance (dual core, each supposedly 2.5 times faster than the small instance), "before" took 452 seconds and "after" took 357 seconds. Just to be sure, I ran "before" again, and it took 446 seconds, so I don't think it was a fluke. Based on this, it looks like this makes sense on dual-core but doesn't help on single-core. Perhaps the overhead from starting a new process is similar to that from context switches on single-core. |
Depending on how much of an overhead the conversion/storage of pdf&svg -> png is, we could always use an OrderedDict for python >= 2.7 (there are loads of OrderedDics recipes pre that version) to give the svg/pdf output a consistent ordering. |
…ff" toward the end of the filename so files are more appropriately grouped when viewed in file managers and thumbnail tools (eg. gthumb)
….6. Continue to use histogram for earlier Numpy versions.
…ache the baseline images, since the test results are never the same twice given different dictionary orderings in the file.
… removes titles and ticks from a figure before comparison. Use this throughout to make more of the tests less dependent on text differences.
…oesn't seem to offer any speed advantage.
@mdboom: What kind of state is this in? I'm prepared to go through these changes if its going to save us all time in the long run. |
@jkseppan: The real advantage of this PR is on subsequent runs, since it caches the results of conversions. For a real speed comparison, you should run each test run twice (cleaning ~/.matplotlib before each pair of runs). @pelson: I think this is good to merge. It would be nice to be able to get some predictable ordering in SVG and PDF such that the PNG conversions of the test results (rather than just the baseline images) could also be done. But there's no reason that can't be done as a follow-up in a separate PR. |
@@ -127,7 +172,7 @@ def comparable_formats(): | |||
on this system.''' | |||
return ['png'] + converter.keys() | |||
|
|||
def convert(filename): | |||
def convert(filename, cache): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some high level comments about cache wouldn't go amiss here.
On the whole I agree with removing the texts on some of the test results. However I wonder if the text in the |
I've got to the stage of reviewing the actual |
@mdboom: I'm going to merge this while it is still fresh (and mergable). For some reason github isn't showing all the diffs (the test files for instance) so I had to read the diffs on my clone. There are no show stoppers, but I do think it is worth addressing some of my comments either on this PR, or in a seperate one (if they need code). Good stuff. |
This is just a placeholder for some work that would be nice to get done.
The baseline images are converted from PDF and SVG to PNG every run. These could be cached between runs to save a good portion of runtime
Look into running tests in parallel. Nose supports this, but it wasn't working well last time I tried.
Maybe we could use lower resolution PNGs and convert the PDFs and SVGs at lower resolution to make things faster. This would also hopefully gloss over some of the text-related false positives we see on a regular basis.
Any other ideas? Let's brainstorm!