Skip to content

Skip rather than xfail tests in workflows #90

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
honno opened this issue Feb 2, 2022 · 8 comments · Fixed by #92
Closed

Skip rather than xfail tests in workflows #90

honno opened this issue Feb 2, 2022 · 8 comments · Fixed by #92

Comments

@honno
Copy link
Member

honno commented Feb 2, 2022

Currently our NumPy workflow has a flaky test in test_creation_functions.py::test_linspace, which means sometimes we'll get an XFAIL and sometimes an XPASS. The random-ish nature of Hypothesis means this is often a possibility for bugs the test suite identifies. Therefore I think instead of xfailing tests for a workflow, we should skip them completely. We could also mix-and-match, but I think it's best to set a precedent that's simple for outsiders to understand and use.

@honno honno changed the title Skip rather than xfail tests in workflo Skip rather than xfail tests in workflows Feb 2, 2022
@honno honno closed this as completed in #92 Feb 3, 2022
@asmeurer
Copy link
Member

asmeurer commented Feb 24, 2022

Maybe implementers like @tomwhite could comment here, but my impression is that XFAILs are more useful than outright skips because they can alert you that the test should be removed from the file when the feature in question is implemented/fixed.

I agree that false positives are a problem, so you might not necessarily want to use the pytest flag that fails the tests if there is an xpass. I also wonder if it's possible to automatically increase the number of max iterations for XFAIL so that if it does xpass, it will do so in a less flaky way.

@asmeurer asmeurer reopened this Feb 24, 2022
@asmeurer
Copy link
Member

At best we can have both skips.txt and xfails.txt. Personally I think we should be xfailing rather than skipping the NumPy tests we run on our CI, for the reason I mentioned above.

@honno
Copy link
Member Author

honno commented Feb 25, 2022

Yeah the utility of xfails are something I wouldn't want to lose, but this seems a hard limitation with Hypothesis. I had actually written this issue after I saw Hypothesis CI having flaky tests skipped instead of xfailed.

At best we can have both skips.txt and xfails.txt..

I think the burden of distinguishing between what is a flaky test and what test will fail is too great for folks not well-versed in how Hypothesis works, like some of the array library developers I've seen use the suite—even then it's pretty hard. Split skips and xfails does make CI adoption more confusing/cumbersome too.

I also wonder if it's possible to automatically increase the number of max iterations for XFAIL so that if it does xpass, it will do so in a less flaky way.

I imagine it would be way too much effort for something that is still not that stable.

@asmeurer
Copy link
Member

But the flakyness is only a concern if you actually configure your tests to fail when there are XPASSes. Otherwise, it just shows the XPASS fails in the log. They might differ occasionally from run to run, but that's not a huge deal. But the point is that you can notice this if you pay attention to the build, or if you run the tests locally. With skips, there's no way to notice it at all.

The only way I know how to avoid flakyness with hypothesis is to use @example so that you ensure certain types of input are always picked. I'm not sure if it's worth us doing that. There's also the flaky pytest extension, which I've used successfully in other projects (not related to hypothesis), but that just reruns the test multiple times and would be equivalent to just increasing the number of examples.

I imagine it would be way too much effort for something that is still not that stable.

It depends on how easy pytest would make this, but what I'm imagining isn't too complicated. We just configure the hypothesis settings for each XFAIL test to have a higher max-examples.

My understanding is that hypothesis won't actually run that many examples if they all are failing (is that correct?), so it would only actually run that many when the test is flaky or when it xpasses.

@tomwhite
Copy link
Contributor

Semantically, xfail is a better fit for features that are not expected to pass yet (either because they are not implemented or because an edge case is causing a problem). Having them show as xpass when things change would be useful, even if that doesn't fail the test. Seeing a random xpass due to Hypothesis randomness isn't necessarily a problem.

So I would be fine going back to using xfail.

@honno
Copy link
Member Author

honno commented Feb 28, 2022

The only way I know how to avoid flakyness with hypothesis is to use @example so that you ensure certain types of input are always picked. I'm not sure if it's worth us doing that. There's also the flaky pytest extension, which I've used successfully in other projects (not related to hypothesis), but that just reruns the test multiple times and would be equivalent to just increasing the number of examples.

You're probably better at this stuff than I am, but yeah it seems very tricky.

I imagine it would be way too much effort for something [that trigger fails via increasing examples] that is still not that stable.

It depends on how easy pytest would make this, but what I'm imagining isn't too complicated. We just configure the hypothesis settings for each XFAIL test to have a higher max-examples.

Ah if it was just, "if this test has been marked as xfail, increase examples"—yeah I can see that, but it still won't be consistent. Obviously a test being marginally flaky (i.e. fails roughly 1 in a 10 runs) would require a larger max examples than a fail-1-in-2-runs flaky test, so there would be some awkward balancing between performance concerns and getting the flaky test to trigger.

My understanding is that hypothesis won't actually run that many examples if they all are failing (is that correct?), so it would only actually run that many when the test is flaky or when it xpasses.

I believe it's correct to say that an already failing example won't run many times, just enough to check for flakiness in the code being tested. But this isn't that useful to us, as generating a distinct failing examples or not is much more likely to cause flakiness in CI.


Semantically, xfail is a better fit for features that are not expected to pass yet (either because they are not implemented or because an edge case is causing a problem). Having them show as xpass when things change would be useful, even if that doesn't fail the test. Seeing a random xpass due to Hypothesis randomness isn't necessarily a problem.

But the flakyness is only a concern if you actually configure your tests to fail when there are XPASSes. Otherwise, it just shows the XPASS fails in the log. They might differ occasionally from run to run, but that's not a huge deal. But the point is that you can notice this if you pay attention to the build, or if you run the tests locally. With skips, there's no way to notice it at all

I have no idea on how popular fail-on-xpass is. Personally I would certainly always enable xpass failures for a mature-enough project's CI. But even if we polled all current array implementers if they would ever do this and find out no, I think it's good to semantically to heavily discourage flaky xfails.

xfail seems more trouble than it's worth for a flaky test. You xfail so you get xpasses, but if those xpasses are not consistent then all it tells you is Hypothesis is sometimes not generating failing examples. And if you're not familiar with Hypothesis, which is the case for many of our existing users, then we're just unnecessarily frustrating you to have to learn all this stuff.

(Typically you might xfail a part of a test's logic if you want to test other aspects of your code, but that doesn't apply for an unmodified vendored test suite.)

Even I still have to re-familiarise myself with the relationship between Hypothesis and flakiness when I accidentally xfail a test I could of just skipped and written an issue/TODO. I really like the idea of removing this burden from users of having to distinguish what tests should be xfailed and what tests should be skipped, by just supporting (and documenting) a single skips.txt file.

We're a compliance suite afterall and shouldn't worry to much on library-specific issues, which are better dealt elsewhere like the compatibility tracker issues we are seeing.

@asmeurer
Copy link
Member

asmeurer commented Mar 1, 2022

I can't speak for other library authors, but for our own NumPy tests, I'd rather be using xfails, so that we are actually alerted to when tests should be enabled.

Probably the best solution here is to allow both skips.txt and xfails.txt and document the caveats of both.

@tomwhite
Copy link
Contributor

tomwhite commented Mar 2, 2022

Probably the best solution here is to allow both skips.txt and xfails.txt and document the caveats of both.

+1 I think this would be a good compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants