Skip rather than xfail tests in workflows #90

honno · 2022-02-02T12:13:05Z

Currently our NumPy workflow has a flaky test in test_creation_functions.py::test_linspace, which means sometimes we'll get an XFAIL and sometimes an XPASS. The random-ish nature of Hypothesis means this is often a possibility for bugs the test suite identifies. Therefore I think instead of xfailing tests for a workflow, we should skip them completely. We could also mix-and-match, but I think it's best to set a precedent that's simple for outsiders to understand and use.

The text was updated successfully, but these errors were encountered:

asmeurer · 2022-02-24T23:08:20Z

Maybe implementers like @tomwhite could comment here, but my impression is that XFAILs are more useful than outright skips because they can alert you that the test should be removed from the file when the feature in question is implemented/fixed.

I agree that false positives are a problem, so you might not necessarily want to use the pytest flag that fails the tests if there is an xpass. I also wonder if it's possible to automatically increase the number of max iterations for XFAIL so that if it does xpass, it will do so in a less flaky way.

asmeurer · 2022-02-24T23:09:29Z

At best we can have both skips.txt and xfails.txt. Personally I think we should be xfailing rather than skipping the NumPy tests we run on our CI, for the reason I mentioned above.

honno · 2022-02-25T10:09:39Z

Yeah the utility of xfails are something I wouldn't want to lose, but this seems a hard limitation with Hypothesis. I had actually written this issue after I saw Hypothesis CI having flaky tests skipped instead of xfailed.

At best we can have both skips.txt and xfails.txt..

I think the burden of distinguishing between what is a flaky test and what test will fail is too great for folks not well-versed in how Hypothesis works, like some of the array library developers I've seen use the suite—even then it's pretty hard. Split skips and xfails does make CI adoption more confusing/cumbersome too.

I also wonder if it's possible to automatically increase the number of max iterations for XFAIL so that if it does xpass, it will do so in a less flaky way.

I imagine it would be way too much effort for something that is still not that stable.

asmeurer · 2022-02-25T23:31:58Z

But the flakyness is only a concern if you actually configure your tests to fail when there are XPASSes. Otherwise, it just shows the XPASS fails in the log. They might differ occasionally from run to run, but that's not a huge deal. But the point is that you can notice this if you pay attention to the build, or if you run the tests locally. With skips, there's no way to notice it at all.

The only way I know how to avoid flakyness with hypothesis is to use @example so that you ensure certain types of input are always picked. I'm not sure if it's worth us doing that. There's also the flaky pytest extension, which I've used successfully in other projects (not related to hypothesis), but that just reruns the test multiple times and would be equivalent to just increasing the number of examples.

I imagine it would be way too much effort for something that is still not that stable.

It depends on how easy pytest would make this, but what I'm imagining isn't too complicated. We just configure the hypothesis settings for each XFAIL test to have a higher max-examples.

My understanding is that hypothesis won't actually run that many examples if they all are failing (is that correct?), so it would only actually run that many when the test is flaky or when it xpasses.

tomwhite · 2022-02-26T10:44:24Z

Semantically, xfail is a better fit for features that are not expected to pass yet (either because they are not implemented or because an edge case is causing a problem). Having them show as xpass when things change would be useful, even if that doesn't fail the test. Seeing a random xpass due to Hypothesis randomness isn't necessarily a problem.

So I would be fine going back to using xfail.

honno · 2022-02-28T13:45:16Z

The only way I know how to avoid flakyness with hypothesis is to use @example so that you ensure certain types of input are always picked. I'm not sure if it's worth us doing that. There's also the flaky pytest extension, which I've used successfully in other projects (not related to hypothesis), but that just reruns the test multiple times and would be equivalent to just increasing the number of examples.

You're probably better at this stuff than I am, but yeah it seems very tricky.

I imagine it would be way too much effort for something [that trigger fails via increasing examples] that is still not that stable.

It depends on how easy pytest would make this, but what I'm imagining isn't too complicated. We just configure the hypothesis settings for each XFAIL test to have a higher max-examples.

Ah if it was just, "if this test has been marked as xfail, increase examples"—yeah I can see that, but it still won't be consistent. Obviously a test being marginally flaky (i.e. fails roughly 1 in a 10 runs) would require a larger max examples than a fail-1-in-2-runs flaky test, so there would be some awkward balancing between performance concerns and getting the flaky test to trigger.

My understanding is that hypothesis won't actually run that many examples if they all are failing (is that correct?), so it would only actually run that many when the test is flaky or when it xpasses.

I believe it's correct to say that an already failing example won't run many times, just enough to check for flakiness in the code being tested. But this isn't that useful to us, as generating a distinct failing examples or not is much more likely to cause flakiness in CI.

Semantically, xfail is a better fit for features that are not expected to pass yet (either because they are not implemented or because an edge case is causing a problem). Having them show as xpass when things change would be useful, even if that doesn't fail the test. Seeing a random xpass due to Hypothesis randomness isn't necessarily a problem.

But the flakyness is only a concern if you actually configure your tests to fail when there are XPASSes. Otherwise, it just shows the XPASS fails in the log. They might differ occasionally from run to run, but that's not a huge deal. But the point is that you can notice this if you pay attention to the build, or if you run the tests locally. With skips, there's no way to notice it at all

I have no idea on how popular fail-on-xpass is. Personally I would certainly always enable xpass failures for a mature-enough project's CI. But even if we polled all current array implementers if they would ever do this and find out no, I think it's good to semantically to heavily discourage flaky xfails.

xfail seems more trouble than it's worth for a flaky test. You xfail so you get xpasses, but if those xpasses are not consistent then all it tells you is Hypothesis is sometimes not generating failing examples. And if you're not familiar with Hypothesis, which is the case for many of our existing users, then we're just unnecessarily frustrating you to have to learn all this stuff.

(Typically you might xfail a part of a test's logic if you want to test other aspects of your code, but that doesn't apply for an unmodified vendored test suite.)

Even I still have to re-familiarise myself with the relationship between Hypothesis and flakiness when I accidentally xfail a test I could of just skipped and written an issue/TODO. I really like the idea of removing this burden from users of having to distinguish what tests should be xfailed and what tests should be skipped, by just supporting (and documenting) a single skips.txt file.

We're a compliance suite afterall and shouldn't worry to much on library-specific issues, which are better dealt elsewhere like the compatibility tracker issues we are seeing.

asmeurer · 2022-03-01T21:14:43Z

I can't speak for other library authors, but for our own NumPy tests, I'd rather be using xfails, so that we are actually alerted to when tests should be enabled.

Probably the best solution here is to allow both skips.txt and xfails.txt and document the caveats of both.

tomwhite · 2022-03-02T16:01:55Z

Probably the best solution here is to allow both skips.txt and xfails.txt and document the caveats of both.

+1 I think this would be a good compromise.

honno changed the title ~~Skip rather than xfail tests in workflo~~ Skip rather than xfail tests in workflows Feb 2, 2022

honno mentioned this issue Feb 3, 2022

Skip rather than xfail on CI #92

Merged

honno closed this as completed in #92 Feb 3, 2022

asmeurer reopened this Feb 24, 2022

honno mentioned this issue Jan 3, 2023

Add support for an xfails file, and allow the files to be specified with a flag #157

Merged

honno closed this as completed May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip rather than xfail tests in workflows #90

Skip rather than xfail tests in workflows #90

honno commented Feb 2, 2022 •

edited

Loading

asmeurer commented Feb 24, 2022 •

edited

Loading

asmeurer commented Feb 24, 2022

honno commented Feb 25, 2022 •

edited

Loading

asmeurer commented Feb 25, 2022

tomwhite commented Feb 26, 2022

honno commented Feb 28, 2022 •

edited

Loading

asmeurer commented Mar 1, 2022

tomwhite commented Mar 2, 2022

Skip rather than xfail tests in workflows #90

Skip rather than xfail tests in workflows #90

Comments

honno commented Feb 2, 2022 • edited Loading

asmeurer commented Feb 24, 2022 • edited Loading

asmeurer commented Feb 24, 2022

honno commented Feb 25, 2022 • edited Loading

asmeurer commented Feb 25, 2022

tomwhite commented Feb 26, 2022

honno commented Feb 28, 2022 • edited Loading

asmeurer commented Mar 1, 2022

tomwhite commented Mar 2, 2022

honno commented Feb 2, 2022 •

edited

Loading

asmeurer commented Feb 24, 2022 •

edited

Loading

honno commented Feb 25, 2022 •

edited

Loading

honno commented Feb 28, 2022 •

edited

Loading