-
Notifications
You must be signed in to change notification settings - Fork 906
chore: improve CI reliability #16169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…o pass Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any blocking concerns here! 👍 Might be good to get buy-in from our US colleagues before merging.
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Consider preparing a follow up PR to remove PARALLEL_FLAG
. If CI starts failing, we will revert the PR.
Signed-off-by: Danny Kopping <danny@coder.com>
We have an effort underway to replace
dbmem
(#15109), and consequently we've begun running our full test-suite (with Postgres) on all supported OSs - Windows, MacOS, and Linux, since #15520.Since this change, we've seen a marked decrease in the success rate of our builds on
main
(note how the Windows/MacOS failures account for the vast majority of failed builds):We're still investigating why these OSs are a lot less reliable. It's likely that the VMs on which the builds are run have different characteristics from our Ubuntu runners such as disk I/O, network latency, or something else.
In the meantime, we need to start trusting CI failures in
main
again, as the current failures are too noisy / vague for us to correct.We've also considered hosting our own runners where possible so we can get OS-level observability to rule out some possibilities.
See the meeting notes where we linked into this for more detail.
This PR introduces several changes:
nightly-gauntlet
workflowtradeoff: this means that any regressions may be more difficult to discover since we merge to main several times a day
main
on Windows/MacOStest-go
is still running the full test-suite against all OSs (including the CLI ones), but will soon be removed once Remove the in-memory database #15109 is completed since it usesdbmem
nightly-gauntlet
to run at 4AM: we've seen several instances of the runner being stopped externally, and we're guessing this may have something to do with the midnight UTC execution time, when other cron jobs may runnightly-gauntlet
jobs since they haven't passed in a long time, indicating that nobody cares enough to fix them and they don't provide diagnostic value; we can restore them later if necessaryI've manually run both these new workflows successfully:
ci
: https://github.com/coder/coder/actions/runs/12825874176/job/35764724907nightly-gauntlet
: https://github.com/coder/coder/actions/runs/12825539092