Skip to content

chore: improve CI reliability #16169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 20, 2025
Merged

chore: improve CI reliability #16169

merged 9 commits into from
Jan 20, 2025

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Jan 17, 2025

We have an effort underway to replace dbmem (#15109), and consequently we've begun running our full test-suite (with Postgres) on all supported OSs - Windows, MacOS, and Linux, since #15520.

Since this change, we've seen a marked decrease in the success rate of our builds on main (note how the Windows/MacOS failures account for the vast majority of failed builds):

image

We're still investigating why these OSs are a lot less reliable. It's likely that the VMs on which the builds are run have different characteristics from our Ubuntu runners such as disk I/O, network latency, or something else.

In the meantime, we need to start trusting CI failures in main again, as the current failures are too noisy / vague for us to correct.

We've also considered hosting our own runners where possible so we can get OS-level observability to rule out some possibilities.

See the meeting notes where we linked into this for more detail.

This PR introduces several changes:

  1. Moves the full test-suite with Postgres on Windows/MacOS to the nightly-gauntlet workflow
    tradeoff: this means that any regressions may be more difficult to discover since we merge to main several times a day
  2. Run only the CLI test-suite on each PR / merge to main on Windows/MacOS
  3. test-go is still running the full test-suite against all OSs (including the CLI ones), but will soon be removed once Remove the in-memory database #15109 is completed since it uses dbmem
  4. Changes nightly-gauntlet to run at 4AM: we've seen several instances of the runner being stopped externally, and we're guessing this may have something to do with the midnight UTC execution time, when other cron jobs may run
  5. Removes the existing nightly-gauntlet jobs since they haven't passed in a long time, indicating that nobody cares enough to fix them and they don't provide diagnostic value; we can restore them later if necessary

I've manually run both these new workflows successfully:

…o pass

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@johnstcn johnstcn requested a review from mtojek January 17, 2025 09:36
@dannykopping dannykopping marked this pull request as ready for review January 17, 2025 09:42
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any blocking concerns here! 👍 Might be good to get buy-in from our US colleagues before merging.

Co-authored-by: Muhammad Atif Ali <atif@coder.com>
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Consider preparing a follow up PR to remove PARALLEL_FLAG. If CI starts failing, we will revert the PR.

Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping enabled auto-merge (squash) January 20, 2025 06:47
@dannykopping dannykopping merged commit 5b72a43 into main Jan 20, 2025
32 of 34 checks passed
@dannykopping dannykopping deleted the dk/ci-reliability branch January 20, 2025 07:06
@github-actions github-actions bot locked and limited conversation to collaborators Jan 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants