chore: improve CI reliability #16169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

dannykopping merged 9 commits into main from dk/ci-reliability

Jan 20, 2025

Contributor

dannykopping commented Jan 17, 2025 •

edited

Loading

We have an effort underway to replace dbmem (#15109), and consequently we've begun running our full test-suite (with Postgres) on all supported OSs - Windows, MacOS, and Linux, since #15520.

Since this change, we've seen a marked decrease in the success rate of our builds on main (note how the Windows/MacOS failures account for the vast majority of failed builds):

We're still investigating why these OSs are a lot less reliable. It's likely that the VMs on which the builds are run have different characteristics from our Ubuntu runners such as disk I/O, network latency, or something else.

In the meantime, we need to start trusting CI failures in main again, as the current failures are too noisy / vague for us to correct.

We've also considered hosting our own runners where possible so we can get OS-level observability to rule out some possibilities.

See the meeting notes where we linked into this for more detail.

This PR introduces several changes:

Moves the full test-suite with Postgres on Windows/MacOS to the nightly-gauntlet workflow
tradeoff: this means that any regressions may be more difficult to discover since we merge to main several times a day
Run only the CLI test-suite on each PR / merge to main on Windows/MacOS
test-go is still running the full test-suite against all OSs (including the CLI ones), but will soon be removed once Remove the in-memory database #15109 is completed since it uses dbmem
Changes nightly-gauntlet to run at 4AM: we've seen several instances of the runner being stopped externally, and we're guessing this may have something to do with the midnight UTC execution time, when other cron jobs may run
Removes the existing nightly-gauntlet jobs since they haven't passed in a long time, indicating that nobody cares enough to fix them and they don't provide diagnostic value; we can restore them later if necessary

I've manually run both these new workflows successfully:

ci: https://github.com/coder/coder/actions/runs/12825874176/job/35764724907
nightly-gauntlet: https://github.com/coder/coder/actions/runs/12825539092

dannykopping added 6 commits

January 17, 2025 09:30


          Reducing the number of iterations of go-race to make it more likely t…

53479bd

…o pass

Signed-off-by: Danny Kopping <danny@coder.com>


          Run at 4AM to avoid possible collision with other midnight operations

9e4352a

Signed-off-by: Danny Kopping <danny@coder.com>


          Remove low-signal tests, add pg tests for windows/mac

eb44197

Signed-off-by: Danny Kopping <danny@coder.com>


          Run CLI tests for Windows/MacOS on each PR

c4de10e

Signed-off-by: Danny Kopping <danny@coder.com>


          Do not run full test-suite with Postgres on Windows/MacOS on each PR

e6761e6

Signed-off-by: Danny Kopping <danny@coder.com>


          Testing complete

f1b54b9

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping requested review from ThomasK33 and johnstcn

January 17, 2025 09:34

github-actions bot assigned dannykopping

johnstcn requested a review from mtojek

January 17, 2025 09:36

dannykopping marked this pull request as ready for review

January 17, 2025 09:42

johnstcn reviewed

View reviewed changes

.github/workflows/nightly-gauntlet.yaml Show resolved Hide resolved

.github/workflows/ci.yaml Show resolved Hide resolved

.github/workflows/nightly-gauntlet.yaml Show resolved Hide resolved

johnstcn reviewed

View reviewed changes

.github/workflows/ci.yaml Show resolved Hide resolved

ThomasK33 reviewed

View reviewed changes

.github/workflows/ci.yaml Show resolved Hide resolved

johnstcn approved these changes

View reviewed changes

Member

johnstcn left a comment

I don't have any blocking concerns here! 👍 Might be good to get buy-in from our US colleagues before merging.

mtojek reviewed

View reviewed changes

.github/workflows/nightly-gauntlet.yaml Outdated Show resolved Hide resolved

.github/workflows/ci.yaml Show resolved Hide resolved

matifali reviewed

View reviewed changes

.github/workflows/ci.yaml Show resolved Hide resolved

.github/workflows/nightly-gauntlet.yaml Outdated Show resolved Hide resolved


          Only run on weekdays

43eae7e

Co-authored-by: Muhammad Atif Ali <atif@coder.com>

mtojek approved these changes

View reviewed changes

Member

mtojek left a comment

👍

Consider preparing a follow up PR to remove PARALLEL_FLAG. If CI starts failing, we will revert the PR.

hugodutka reviewed

View reviewed changes

.github/workflows/nightly-gauntlet.yaml Outdated Show resolved Hide resolved


          Fix comment location

1442d50

Signed-off-by: Danny Kopping <danny@coder.com>

matifali approved these changes

View reviewed changes


          Merge branch 'dk/ci-reliability' of github.com:/coder/coder into dk/c…

57a798c

…i-reliability

dannykopping enabled auto-merge (squash)

January 20, 2025 06:47

dannykopping merged commit 5b72a43 into main

32 of 34 checks passed

dannykopping deleted the dk/ci-reliability branch

January 20, 2025 07:06

github-actions bot locked and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet