-
Notifications
You must be signed in to change notification settings - Fork 875
feat: add workspace build timing metrics #15771
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add workspace build timing metrics #15771
Conversation
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution @kevinh-canva!
Let's see if we can find a solution to the potential cardinality explosion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@kylecarbs could you please force-merge this PR?
The two failing CI jobs are both related to forks not being able to access secrets.
Signed-off-by: Danny Kopping <danny@coder.com>
@kevinh-canva would you mind commenting this again? The CLA step should now be fixed. |
I have read the CLA Document and I hereby sign the CLA |
@dannykopping Looks like it's still failing for me somehow |
Sorry about the issues with the CLA, we've been having some troubles lately with secrets in our GitHub actions workflows |
Manually adding this since our CLA bot is broken
Context
We want to place a tight SLO around coder workspace build times, so we can detect regression. However, buffy GPU instances often take a much longer time to start/stop than general instance, which frequently triggered our SLO alerts, even though it's only because of a few (expected) slow GPU builds. This is caused by the metrics we are using
coderd_provisionerd_job_timing_seconds
not having a dimension for template name (as we have a separate template for GPU and another for general instances).Looking closer at the code, this metrics is also not the correct one to use either, because a Job can actually be many different things, not just a workspace build.
Intent
This PR introduces a new prometheus metrics for
workspace_build_timing_seconds
, which specifically reports workspace build times. To reduce cardinality, this metrics excludesworkspace_name
andworkspace_owner
that are present on theworkspace_builds_total
metrics.This'd allow us to have different (and tight) SLOs for each of our template (GPU vs non-GPU) by filtering on the
template_name
(optionallytemplate_version
tag) as well as the workspace transition (as we noticedstop
is often slower thanstart
, but users don't care a lot aboutstop
transitions).