-
Notifications
You must be signed in to change notification settings - Fork 45
Skip rather than xfail tests in workflows #90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Maybe implementers like @tomwhite could comment here, but my impression is that XFAILs are more useful than outright skips because they can alert you that the test should be removed from the file when the feature in question is implemented/fixed. I agree that false positives are a problem, so you might not necessarily want to use the pytest flag that fails the tests if there is an xpass. I also wonder if it's possible to automatically increase the number of max iterations for XFAIL so that if it does xpass, it will do so in a less flaky way. |
At best we can have both skips.txt and xfails.txt. Personally I think we should be xfailing rather than skipping the NumPy tests we run on our CI, for the reason I mentioned above. |
Yeah the utility of xfails are something I wouldn't want to lose, but this seems a hard limitation with Hypothesis. I had actually written this issue after I saw Hypothesis CI having flaky tests skipped instead of xfailed.
I think the burden of distinguishing between what is a flaky test and what test will fail is too great for folks not well-versed in how Hypothesis works, like some of the array library developers I've seen use the suite—even then it's pretty hard. Split skips and xfails does make CI adoption more confusing/cumbersome too.
I imagine it would be way too much effort for something that is still not that stable. |
But the flakyness is only a concern if you actually configure your tests to fail when there are XPASSes. Otherwise, it just shows the XPASS fails in the log. They might differ occasionally from run to run, but that's not a huge deal. But the point is that you can notice this if you pay attention to the build, or if you run the tests locally. With skips, there's no way to notice it at all. The only way I know how to avoid flakyness with hypothesis is to use
It depends on how easy pytest would make this, but what I'm imagining isn't too complicated. We just configure the hypothesis settings for each XFAIL test to have a higher max-examples. My understanding is that hypothesis won't actually run that many examples if they all are failing (is that correct?), so it would only actually run that many when the test is flaky or when it xpasses. |
Semantically, So I would be fine going back to using |
You're probably better at this stuff than I am, but yeah it seems very tricky.
Ah if it was just, "if this test has been marked as xfail, increase examples"—yeah I can see that, but it still won't be consistent. Obviously a test being marginally flaky (i.e. fails roughly 1 in a 10 runs) would require a larger max examples than a fail-1-in-2-runs flaky test, so there would be some awkward balancing between performance concerns and getting the flaky test to trigger.
I believe it's correct to say that an already failing example won't run many times, just enough to check for flakiness in the code being tested. But this isn't that useful to us, as generating a distinct failing examples or not is much more likely to cause flakiness in CI.
I have no idea on how popular fail-on-xpass is. Personally I would certainly always enable xpass failures for a mature-enough project's CI. But even if we polled all current array implementers if they would ever do this and find out no, I think it's good to semantically to heavily discourage flaky xfails. xfail seems more trouble than it's worth for a flaky test. You xfail so you get xpasses, but if those xpasses are not consistent then all it tells you is Hypothesis is sometimes not generating failing examples. And if you're not familiar with Hypothesis, which is the case for many of our existing users, then we're just unnecessarily frustrating you to have to learn all this stuff. (Typically you might xfail a part of a test's logic if you want to test other aspects of your code, but that doesn't apply for an unmodified vendored test suite.) Even I still have to re-familiarise myself with the relationship between Hypothesis and flakiness when I accidentally xfail a test I could of just skipped and written an issue/TODO. I really like the idea of removing this burden from users of having to distinguish what tests should be xfailed and what tests should be skipped, by just supporting (and documenting) a single We're a compliance suite afterall and shouldn't worry to much on library-specific issues, which are better dealt elsewhere like the compatibility tracker issues we are seeing. |
I can't speak for other library authors, but for our own NumPy tests, I'd rather be using xfails, so that we are actually alerted to when tests should be enabled. Probably the best solution here is to allow both |
+1 I think this would be a good compromise. |
Currently our NumPy workflow has a flaky test in
test_creation_functions.py::test_linspace
, which means sometimes we'll get anXFAIL
and sometimes anXPASS
. The random-ish nature of Hypothesis means this is often a possibility for bugs the test suite identifies. Therefore I think instead of xfailing tests for a workflow, we should skip them completely. We could also mix-and-match, but I think it's best to set a precedent that's simple for outsiders to understand and use.The text was updated successfully, but these errors were encountered: