Skip to content

Lower test tolerance #5307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 10, 2015
Merged

Lower test tolerance #5307

merged 16 commits into from
Dec 10, 2015

Conversation

mdboom
Copy link
Member

@mdboom mdboom commented Oct 23, 2015

This is a test to see if baseline images generated on a Mac with the new local build of freetype functionality will work on Travis with 0 tolerance. Fingers crossed.

@mdboom mdboom force-pushed the lower-tolerance branch 3 times, most recently from 1082685 to 94694f6 Compare October 23, 2015 19:00
@mdboom
Copy link
Member Author

mdboom commented Oct 26, 2015

In some ways this experiment was a success. The failures are not due to text differences, but due to a bunch of other differences. Not clear whether it's a Mac vs. Linux issue, just general non-reproducibility or what...

@QuLogic
Copy link
Member

QuLogic commented Oct 26, 2015

I have been going through updating Cartopy tests so I've probably seen many of these changes. Probably most of them will be due to a small change in the antialiasing at the corner of plots, and possibly the upgrade to a newer Agg in 2a17839.

@mdboom
Copy link
Member Author

mdboom commented Oct 26, 2015

I don't think the Agg change is the culprit. There's a certain group of tests that just generate slightly different results from run to run on the same machine/platform/environment. Some are obvious (like the random number generator not being reset in the xkcd test), but some are still an enigma.

@QuLogic
Copy link
Member

QuLogic commented Oct 26, 2015

Oh, you mean after updating the result? Yea, that wouldn't be the Agg change then.

@mdboom
Copy link
Member Author

mdboom commented Oct 27, 2015

Woot! This is passing. (The Python 2.6 failure is on purpose -- I used OrderedDict without backporting it. Didn't want to expend the effort given that we're killing Python 2.6 in a few days).

To review this PR, I suggest clicking on individual commits except for the one called "Update all test images...". The whole diff here is too large for github to display.

It turns out the source of the random test failures on the same platform/machine were due to the spines being stored in a dictionary, and thus their drawing order being non-deterministic. This creates small one-value differences in two pixels in the corner of the axes. Changing this to an ordered dict resolved this from around 68 random failures to 6. The remaining ones with inter-run differences involve boxplot (which I think is a known issue) and axes_grid1. For those, I just turned the tolerance up slightly and filed #5334 to deal with them later. Everything else is now happily running with a tolerance of zero and a direct numpy.array_equal comparison. I think time will tell whether that becomes annoying (as in it catches too many non-important differences and we end up updating the test images a lot), or it gives us more certainty in unit testing.

I still would advise against merging this until some known-to-change-the-baseline-images PRs are merged first, especially #5301 and #5214. Though #5306 (which just turns on the testing version of freetype without changing the tolerances) can be merged without harm at any time.

@WeatherGod
Copy link
Member

Very interesting and great work!

Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right? Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype?

Furthermore, do we want to eliminate fuzzy matching completely? Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations? Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs.

@mdboom
Copy link
Member Author

mdboom commented Oct 27, 2015

Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right?

Yes. Most end users will continue to build against their system freetype or conda's freetype etc. as they always have. That said there's no harm in using the special testing freetype if they want to -- but most packagers, esp. Debian, aren't going to do that.

Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype?

Possibly. But I don't see that as a big issue: I've never seen an issue of that type appear, where the version of freetype was causing some sort of true problem. I've only ever seen small differences in the antialiasing of fonts that cause problems when comparing test images, but otherwise appear fine to a human being. Those are the kind of differences I would not consider a bug, but the natural refinement of the details in freetype over the years.

Furthermore, do we want to eliminate fuzzy matching completely?

The functionality is still there, and individual tests may still turn it on on an individual basis. (In fact the 6 tests mentioned in #5334 do). The only change is that the default is 0. I think it's possible that we may turn it on on some more tests over time if we learn that exact matching isn't important for them.

Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations?

That's an interesting question. At present (thanks to feedback from @jkseppan), they can still run the tests, but many of the tests will fail due to inexact matching. One possibility is to detect the use of a system freetype and turn the default tolerance up to some small but non-zero value. However, my worry there is that we need to communicate that very clearly to all developers, or there will be mismatches when they commit new test images to the repository. I deliberately took a hard line on this because I want to keep the images working well.

We could somehow track this much like one tracks code coverage. I don't know how you do that within nose though without building a bunch of scaffolding on top.

It also might not hurt to do "smoke tests" to just compile with various versions of freetype and make sure there's no API changes that bite us (though freetype is old and super stable wrt API at this point, and it's not like it's a small project that no one much notices anymore).

Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs.

Yeah -- there have been a number of bugs in the past few weeks that slipped right through the unit tests that we would have found if they were less tolerant. That's my real motivator here.

@zblz
Copy link
Member

zblz commented Oct 28, 2015

This is great work, will make things much easier! I haven't reviewed the commits, but I found that running python setup.py develop on a clean repo (i.e., before python setpu.py build) will fail because it tries to download freetype to the non-existing build directory.

@mdboom mdboom force-pushed the lower-tolerance branch 3 times, most recently from 36b1ac2 to 700b725 Compare October 28, 2015 21:44
@tacaswell
Copy link
Member

The spine order is addressed by #4434 as well.

@mdboom
Copy link
Member Author

mdboom commented Oct 29, 2015

The spine order is addressed by #4434 as well.

Oh, good to know. The OrderedDict approach is maybe slightly better in that you don't have to remember to sort it in all the places you might iterate over them, but it's a minor enough point. We should be sure to not solve it both ways when this or that PR is merged.

@tacaswell
Copy link
Member

I would much rather solve this with ordered dicts than by sorting on every draw.

@tacaswell
Copy link
Member

Sorry for asking lots of annoying picky questions.

tacaswell added a commit that referenced this pull request Dec 10, 2015
@tacaswell tacaswell merged commit 61f0eea into matplotlib:master Dec 10, 2015
@mdboom
Copy link
Member Author

mdboom commented Dec 10, 2015

\o/ Thanks for merging!

@mdboom
Copy link
Member Author

mdboom commented Dec 10, 2015

Are you going to backport to 2.0.x, or should I?

@tacaswell
Copy link
Member

doing it now

@mdboom
Copy link
Member Author

mdboom commented Dec 10, 2015

Maybe I'm too late with this comment, but it just occurred to me that it might be worth doing a PR of the backport just because of the high potential for breakage here...

@tacaswell
Copy link
Member

not too late ,will do

@tacaswell
Copy link
Member

It does not back-port cleanly....

@mdboom
Copy link
Member Author

mdboom commented Dec 10, 2015

Ok -- I can take a crack at it if the conflicts are non-obvious

@tacaswell
Copy link
Member

I'll leave this to you 😈

@mdboom mdboom mentioned this pull request Dec 10, 2015
mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 10, 2015
mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015
mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015
mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 11, 2015
mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 13, 2015
mdboom pushed a commit to mdboom/matplotlib that referenced this pull request Dec 13, 2015
tacaswell added a commit that referenced this pull request Dec 13, 2015
@mdboom mdboom mentioned this pull request Dec 14, 2015
@QuLogic
Copy link
Member

QuLogic commented Oct 16, 2016

Backported via #5652.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants