-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Lower test tolerance #5307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lower test tolerance #5307
Conversation
1082685
to
94694f6
Compare
In some ways this experiment was a success. The failures are not due to text differences, but due to a bunch of other differences. Not clear whether it's a Mac vs. Linux issue, just general non-reproducibility or what... |
I have been going through updating Cartopy tests so I've probably seen many of these changes. Probably most of them will be due to a small change in the antialiasing at the corner of plots, and possibly the upgrade to a newer Agg in 2a17839. |
I don't think the Agg change is the culprit. There's a certain group of tests that just generate slightly different results from run to run on the same machine/platform/environment. Some are obvious (like the random number generator not being reset in the xkcd test), but some are still an enigma. |
Oh, you mean after updating the result? Yea, that wouldn't be the Agg change then. |
94694f6
to
ab884d4
Compare
ab884d4
to
a4a1c7d
Compare
761fb04
to
7f7578c
Compare
Woot! This is passing. (The Python 2.6 failure is on purpose -- I used To review this PR, I suggest clicking on individual commits except for the one called "Update all test images...". The whole diff here is too large for github to display. It turns out the source of the random test failures on the same platform/machine were due to the spines being stored in a dictionary, and thus their drawing order being non-deterministic. This creates small one-value differences in two pixels in the corner of the axes. Changing this to an ordered dict resolved this from around 68 random failures to 6. The remaining ones with inter-run differences involve boxplot (which I think is a known issue) and axes_grid1. For those, I just turned the tolerance up slightly and filed #5334 to deal with them later. Everything else is now happily running with a tolerance of zero and a direct I still would advise against merging this until some known-to-change-the-baseline-images PRs are merged first, especially #5301 and #5214. Though #5306 (which just turns on the testing version of freetype without changing the tolerances) can be merged without harm at any time. |
ef54896
to
884aed3
Compare
Very interesting and great work! Question/Devil's Advocate: If I understand this correctly, there is now a disconnect between the images produced in the testing suite and images produced via normal means (modulo any default style differences). The freetype version is only enforced during the unit tests, right? Wouldn't that make it difficult to track down bugs that are reported by users that have roots in different versions of freetype? Furthermore, do we want to eliminate fuzzy matching completely? Would it make sense to keep it and make it possible for users to run the test suite with their system's freetype? Maybe even record how much of a difference occurs with different system configurations? Again, this is all devil's advocate, and I am really pleased that we can tighten down these knobs. |
Yes. Most end users will continue to build against their system freetype or conda's freetype etc. as they always have. That said there's no harm in using the special testing freetype if they want to -- but most packagers, esp. Debian, aren't going to do that.
Possibly. But I don't see that as a big issue: I've never seen an issue of that type appear, where the version of freetype was causing some sort of true problem. I've only ever seen small differences in the antialiasing of fonts that cause problems when comparing test images, but otherwise appear fine to a human being. Those are the kind of differences I would not consider a bug, but the natural refinement of the details in freetype over the years.
The functionality is still there, and individual tests may still turn it on on an individual basis. (In fact the 6 tests mentioned in #5334 do). The only change is that the default is 0. I think it's possible that we may turn it on on some more tests over time if we learn that exact matching isn't important for them.
That's an interesting question. At present (thanks to feedback from @jkseppan), they can still run the tests, but many of the tests will fail due to inexact matching. One possibility is to detect the use of a system freetype and turn the default tolerance up to some small but non-zero value. However, my worry there is that we need to communicate that very clearly to all developers, or there will be mismatches when they commit new test images to the repository. I deliberately took a hard line on this because I want to keep the images working well. We could somehow track this much like one tracks code coverage. I don't know how you do that within nose though without building a bunch of scaffolding on top. It also might not hurt to do "smoke tests" to just compile with various versions of freetype and make sure there's no API changes that bite us (though freetype is old and super stable wrt API at this point, and it's not like it's a small project that no one much notices anymore).
Yeah -- there have been a number of bugs in the past few weeks that slipped right through the unit tests that we would have found if they were less tolerant. That's my real motivator here. |
This is great work, will make things much easier! I haven't reviewed the commits, but I found that running |
36b1ac2
to
700b725
Compare
The spine order is addressed by #4434 as well. |
Oh, good to know. The OrderedDict approach is maybe slightly better in that you don't have to remember to sort it in all the places you might iterate over them, but it's a minor enough point. We should be sure to not solve it both ways when this or that PR is merged. |
I would much rather solve this with ordered dicts than by sorting on every draw. |
700b725
to
b16857c
Compare
23eacc4
to
727aacb
Compare
Sorry for asking lots of annoying picky questions. |
\o/ Thanks for merging! |
Are you going to backport to 2.0.x, or should I? |
doing it now |
Maybe I'm too late with this comment, but it just occurred to me that it might be worth doing a PR of the backport just because of the high potential for breakage here... |
not too late ,will do |
It does not back-port cleanly.... |
Ok -- I can take a crack at it if the conflicts are non-obvious |
I'll leave this to you 😈 |
Lower test tolerance
Lower test tolerance
Lower test tolerance
Lower test tolerance
Lower test tolerance
Lower test tolerance
Backported via #5652. |
This is a test to see if baseline images generated on a Mac with the new local build of freetype functionality will work on Travis with 0 tolerance. Fingers crossed.