Lower test tolerance #5307

mdboom · 2015-10-23T17:06:13Z

This is a test to see if baseline images generated on a Mac with the new local build of freetype functionality will work on Travis with 0 tolerance. Fingers crossed.

mdboom · 2015-10-26T20:37:04Z

In some ways this experiment was a success. The failures are not due to text differences, but due to a bunch of other differences. Not clear whether it's a Mac vs. Linux issue, just general non-reproducibility or what...

QuLogic · 2015-10-26T20:48:34Z

I have been going through updating Cartopy tests so I've probably seen many of these changes. Probably most of them will be due to a small change in the antialiasing at the corner of plots, and possibly the upgrade to a newer Agg in 2a17839.

mdboom · 2015-10-26T21:06:16Z

I don't think the Agg change is the culprit. There's a certain group of tests that just generate slightly different results from run to run on the same machine/platform/environment. Some are obvious (like the random number generator not being reset in the xkcd test), but some are still an enigma.

QuLogic · 2015-10-26T22:25:03Z

Oh, you mean after updating the result? Yea, that wouldn't be the Agg change then.

mdboom · 2015-10-27T18:31:19Z

Woot! This is passing. (The Python 2.6 failure is on purpose -- I used OrderedDict without backporting it. Didn't want to expend the effort given that we're killing Python 2.6 in a few days).

To review this PR, I suggest clicking on individual commits except for the one called "Update all test images...". The whole diff here is too large for github to display.

It turns out the source of the random test failures on the same platform/machine were due to the spines being stored in a dictionary, and thus their drawing order being non-deterministic. This creates small one-value differences in two pixels in the corner of the axes. Changing this to an ordered dict resolved this from around 68 random failures to 6. The remaining ones with inter-run differences involve boxplot (which I think is a known issue) and axes_grid1. For those, I just turned the tolerance up slightly and filed #5334 to deal with them later. Everything else is now happily running with a tolerance of zero and a direct numpy.array_equal comparison. I think time will tell whether that becomes annoying (as in it catches too many non-important differences and we end up updating the test images a lot), or it gives us more certainty in unit testing.

I still would advise against merging this until some known-to-change-the-baseline-images PRs are merged first, especially #5301 and #5214. Though #5306 (which just turns on the testing version of freetype without changing the tolerances) can be merged without harm at any time.

WeatherGod · 2015-10-27T18:48:18Z

Very interesting and great work!

Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right? Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype?

Furthermore, do we want to eliminate fuzzy matching completely? Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations? Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs.

mdboom · 2015-10-27T19:03:43Z

Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right?

Yes. Most end users will continue to build against their system freetype or conda's freetype etc. as they always have. That said there's no harm in using the special testing freetype if they want to -- but most packagers, esp. Debian, aren't going to do that.

Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype?

Possibly. But I don't see that as a big issue: I've never seen an issue of that type appear, where the version of freetype was causing some sort of true problem. I've only ever seen small differences in the antialiasing of fonts that cause problems when comparing test images, but otherwise appear fine to a human being. Those are the kind of differences I would not consider a bug, but the natural refinement of the details in freetype over the years.

Furthermore, do we want to eliminate fuzzy matching completely?

The functionality is still there, and individual tests may still turn it on on an individual basis. (In fact the 6 tests mentioned in #5334 do). The only change is that the default is 0. I think it's possible that we may turn it on on some more tests over time if we learn that exact matching isn't important for them.

Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations?

That's an interesting question. At present (thanks to feedback from @jkseppan), they can still run the tests, but many of the tests will fail due to inexact matching. One possibility is to detect the use of a system freetype and turn the default tolerance up to some small but non-zero value. However, my worry there is that we need to communicate that very clearly to all developers, or there will be mismatches when they commit new test images to the repository. I deliberately took a hard line on this because I want to keep the images working well.

We could somehow track this much like one tracks code coverage. I don't know how you do that within nose though without building a bunch of scaffolding on top.

It also might not hurt to do "smoke tests" to just compile with various versions of freetype and make sure there's no API changes that bite us (though freetype is old and super stable wrt API at this point, and it's not like it's a small project that no one much notices anymore).

Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs.

Yeah -- there have been a number of bugs in the past few weeks that slipped right through the unit tests that we would have found if they were less tolerant. That's my real motivator here.

zblz · 2015-10-28T16:07:24Z

This is great work, will make things much easier! I haven't reviewed the commits, but I found that running python setup.py develop on a clean repo (i.e., before python setpu.py build) will fail because it tries to download freetype to the non-existing build directory.

tacaswell · 2015-10-29T00:08:12Z

The spine order is addressed by #4434 as well.

mdboom · 2015-10-29T01:09:48Z

The spine order is addressed by #4434 as well.

Oh, good to know. The OrderedDict approach is maybe slightly better in that you don't have to remember to sort it in all the places you might iterate over them, but it's a minor enough point. We should be sure to not solve it both ways when this or that PR is merged.

tacaswell · 2015-10-29T01:10:53Z

I would much rather solve this with ordered dicts than by sorting on every draw.

tacaswell · 2015-12-10T16:31:31Z

Sorry for asking lots of annoying picky questions.

Lower test tolerance

mdboom · 2015-12-10T16:57:43Z

\o/ Thanks for merging!

mdboom · 2015-12-10T16:58:45Z

Are you going to backport to 2.0.x, or should I?

tacaswell · 2015-12-10T17:03:55Z

doing it now

mdboom · 2015-12-10T17:05:00Z

Maybe I'm too late with this comment, but it just occurred to me that it might be worth doing a PR of the backport just because of the high potential for breakage here...

tacaswell · 2015-12-10T17:06:07Z

not too late ,will do

tacaswell · 2015-12-10T17:06:59Z

It does not back-port cleanly....

mdboom · 2015-12-10T17:08:56Z

Ok -- I can take a crack at it if the conflicts are non-obvious

tacaswell · 2015-12-10T17:20:05Z

I'll leave this to you 😈

Lower test tolerance

Backport #5307 to v2.0.x

QuLogic · 2016-10-16T06:28:14Z

Backported via #5652.

mdboom added the status: needs review label Oct 23, 2015

mdboom force-pushed the lower-tolerance branch 3 times, most recently from 1082685 to 94694f6 Compare October 23, 2015 19:00

mdboom mentioned this pull request Oct 26, 2015

Use a specific version of Freetype for testing #5306

Merged

mdboom force-pushed the lower-tolerance branch from 94694f6 to ab884d4 Compare October 27, 2015 12:19

zblz mentioned this pull request Oct 27, 2015

BUG: Dot should not be spaced when used as a decimal separator #5301

Merged

mdboom force-pushed the lower-tolerance branch from ab884d4 to a4a1c7d Compare October 27, 2015 12:45

zblz mentioned this pull request Oct 27, 2015

Use DejaVu fonts as default for text and mathtext #5214

Merged

mdboom force-pushed the lower-tolerance branch 4 times, most recently from 761fb04 to 7f7578c Compare October 27, 2015 16:02

mdboom mentioned this pull request Oct 27, 2015

Some tests give different results from run-to-run on the same machine #5334

Closed

6 tasks

mdboom force-pushed the lower-tolerance branch from ef54896 to 884aed3 Compare October 27, 2015 18:35

mdboom force-pushed the lower-tolerance branch 3 times, most recently from 36b1ac2 to 700b725 Compare October 28, 2015 21:44

mdboom force-pushed the lower-tolerance branch from 700b725 to b16857c Compare October 29, 2015 14:26

mdboom added 5 commits December 9, 2015 21:18

Don't disable miter limit in SVG

a678b61

Format floats in a Python2/3 compatible way

ee14544

Use np.round for consistent behavior across py2/3

95fda79

Increase tolerance to account for Py2/3 issue

912a34c

Update test images

727aacb

mdboom force-pushed the lower-tolerance branch from 23eacc4 to 727aacb Compare December 10, 2015 02:20

tacaswell added a commit that referenced this pull request Dec 10, 2015

Merge pull request #5307 from mdboom/lower-tolerance

61f0eea

Lower test tolerance

tacaswell merged commit 61f0eea into matplotlib:master Dec 10, 2015

tacaswell removed the status: needs review label Dec 10, 2015

mdboom mentioned this pull request Dec 10, 2015

Shorter svg files #5651

Merged

mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 10, 2015

Merge pull request matplotlib#5307 from mdboom/lower-tolerance

61cc364

Lower test tolerance

mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015

Merge pull request matplotlib#5307 from mdboom/lower-tolerance

65dc69a

Lower test tolerance

mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015

Merge pull request matplotlib#5307 from mdboom/lower-tolerance

57c33f0

Lower test tolerance

mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015

Merge pull request matplotlib#5307 from mdboom/lower-tolerance

a977b8f

Lower test tolerance

mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 13, 2015

Merge pull request matplotlib#5307 from mdboom/lower-tolerance

6583065

Lower test tolerance

mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 13, 2015

Merge pull request matplotlib#5307 from mdboom/lower-tolerance

164222a

Lower test tolerance

tacaswell added a commit that referenced this pull request Dec 13, 2015

Merge pull request #5652 from mdboom/backport-lower-tolerance-2.0

17a3a6d

Backport #5307 to v2.0.x

mdboom mentioned this pull request Dec 14, 2015

Deterministic svg #5671

Merged

WeatherGod mentioned this pull request Jan 25, 2016

new image comparison unit tests matplotlib/basemap#253

Closed

QuLogic added the topic: testing label Sep 28, 2016

Uh oh!

Lower test tolerance #5307

Lower test tolerance #5307

Uh oh!

Conversation

mdboom commented Oct 23, 2015

Uh oh!

mdboom commented Oct 26, 2015

Uh oh!

QuLogic commented Oct 26, 2015

Uh oh!

mdboom commented Oct 26, 2015

Uh oh!

QuLogic commented Oct 26, 2015

Uh oh!

mdboom commented Oct 27, 2015

Uh oh!

WeatherGod commented Oct 27, 2015

Uh oh!

mdboom commented Oct 27, 2015

Uh oh!

zblz commented Oct 28, 2015

Uh oh!

tacaswell commented Oct 29, 2015

Uh oh!

mdboom commented Oct 29, 2015

Uh oh!

tacaswell commented Oct 29, 2015

Uh oh!

tacaswell commented Dec 10, 2015

Uh oh!

mdboom commented Dec 10, 2015

Uh oh!

mdboom commented Dec 10, 2015

Uh oh!

tacaswell commented Dec 10, 2015

Uh oh!

mdboom commented Dec 10, 2015

Uh oh!

tacaswell commented Dec 10, 2015

Uh oh!

tacaswell commented Dec 10, 2015

Uh oh!

mdboom commented Dec 10, 2015

Uh oh!

tacaswell commented Dec 10, 2015

Uh oh!

QuLogic commented Oct 16, 2016

Uh oh!

Uh oh!