Skip to content

feat: add workspace build timing metrics #15771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kevinh-canva
Copy link
Contributor

@kevinh-canva kevinh-canva commented Dec 6, 2024

Context

We want to place a tight SLO around coder workspace build times, so we can detect regression. However, buffy GPU instances often take a much longer time to start/stop than general instance, which frequently triggered our SLO alerts, even though it's only because of a few (expected) slow GPU builds. This is caused by the metrics we are using coderd_provisionerd_job_timing_seconds not having a dimension for template name (as we have a separate template for GPU and another for general instances).

Looking closer at the code, this metrics is also not the correct one to use either, because a Job can actually be many different things, not just a workspace build.

Intent

This PR introduces a new prometheus metrics for workspace_build_timing_seconds, which specifically reports workspace build times. To reduce cardinality, this metrics excludes workspace_name and workspace_owner that are present on the workspace_builds_total metrics.

This'd allow us to have different (and tight) SLOs for each of our template (GPU vs non-GPU) by filtering on the template_name (optionally template_version tag) as well as the workspace transition (as we noticed stop is often slower than start, but users don't care a lot about stop transitions).

@cdr-bot cdr-bot bot added the community Pull Requests and issues created by the community. label Dec 6, 2024
@kevinh-canva kevinh-canva changed the title Add workspace build timing metrics feat(metrics): Add workspace build timing metrics Dec 6, 2024
@kevinh-canva kevinh-canva changed the title feat(metrics): Add workspace build timing metrics feat(metrics): add workspace build timing metrics Dec 6, 2024
@kevinh-canva kevinh-canva changed the title feat(metrics): add workspace build timing metrics feat(runner): add workspace build timing metrics Dec 6, 2024
@kevinh-canva kevinh-canva changed the title feat(runner): add workspace build timing metrics feat: add workspace build timing metrics Dec 6, 2024
@kevinh-canva
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@dannykopping dannykopping self-requested a review December 6, 2024 08:37
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution @kevinh-canva!

Let's see if we can find a solution to the potential cardinality explosion.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kylecarbs could you please force-merge this PR?
The two failing CI jobs are both related to forks not being able to access secrets.

@dannykopping
Copy link
Contributor

dannykopping commented Dec 9, 2024

I have read the CLA Document and I hereby sign the CLA

@kevinh-canva would you mind commenting this again? The CLA step should now be fixed.
Please also rebase on main before doing so.

@kevinh-canva
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@kevinh-canva
Copy link
Contributor Author

@dannykopping Looks like it's still failing for me somehow

@dannykopping dannykopping enabled auto-merge (squash) December 11, 2024 05:16
deansheather added a commit to coder/cla that referenced this pull request Dec 11, 2024
@deansheather
Copy link
Member

Sorry about the issues with the CLA, we've been having some troubles lately with secrets in our GitHub actions workflows

dannykopping added a commit to coder/cla that referenced this pull request Dec 11, 2024
Manually adding this since our CLA bot is broken
@dannykopping dannykopping merged commit c528791 into coder:main Dec 11, 2024
29 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community Pull Requests and issues created by the community.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants