feat: add workspace build timing metrics #15771

kevinh-canva · 2024-12-06T01:17:14Z

Context

We want to place a tight SLO around coder workspace build times, so we can detect regression. However, buffy GPU instances often take a much longer time to start/stop than general instance, which frequently triggered our SLO alerts, even though it's only because of a few (expected) slow GPU builds. This is caused by the metrics we are using coderd_provisionerd_job_timing_seconds not having a dimension for template name (as we have a separate template for GPU and another for general instances).

Looking closer at the code, this metrics is also not the correct one to use either, because a Job can actually be many different things, not just a workspace build.

Intent

This PR introduces a new prometheus metrics for workspace_build_timing_seconds, which specifically reports workspace build times. To reduce cardinality, this metrics excludes workspace_name and workspace_owner that are present on the workspace_builds_total metrics.

This'd allow us to have different (and tight) SLOs for each of our template (GPU vs non-GPU) by filtering on the template_name (optionally template_version tag) as well as the workspace transition (as we noticed stop is often slower than start, but users don't care a lot about stop transitions).

kevinh-canva · 2024-12-06T01:28:47Z

I have read the CLA Document and I hereby sign the CLA

dannykopping

Thanks for your contribution @kevinh-canva!

Let's see if we can find a solution to the potential cardinality explosion.

provisionerd/provisionerd.go

dannykopping

LGTM

@kylecarbs could you please force-merge this PR?
The two failing CI jobs are both related to forks not being able to access secrets.

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping · 2024-12-09T15:31:10Z

I have read the CLA Document and I hereby sign the CLA

@kevinh-canva would you mind commenting this again? The CLA step should now be fixed.
Please also rebase on main before doing so.

…e-timing-metrics

kevinh-canva · 2024-12-09T22:55:21Z

I have read the CLA Document and I hereby sign the CLA

kevinh-canva · 2024-12-09T22:56:34Z

@dannykopping Looks like it's still failing for me somehow

deansheather · 2024-12-11T05:19:23Z

Sorry about the issues with the CLA, we've been having some troubles lately with secrets in our GitHub actions workflows

Manually adding this since our CLA bot is broken

Add workspace build timing metrics

0f331e7

cdr-bot bot added the community Pull Requests and issues created by the community. label Dec 6, 2024

github-actions bot assigned kevinh-canva Dec 6, 2024

kevinh-canva changed the title ~~Add workspace build timing metrics~~ feat(metrics): Add workspace build timing metrics Dec 6, 2024

kevinh-canva changed the title ~~feat(metrics): Add workspace build timing metrics~~ feat(metrics): add workspace build timing metrics Dec 6, 2024

kevinh-canva changed the title ~~feat(metrics): add workspace build timing metrics~~ feat(runner): add workspace build timing metrics Dec 6, 2024

kevinh-canva changed the title ~~feat(runner): add workspace build timing metrics~~ feat: add workspace build timing metrics Dec 6, 2024

dannykopping self-requested a review December 6, 2024 08:37

dannykopping reviewed Dec 6, 2024

View reviewed changes

provisionerd/provisionerd.go Outdated Show resolved Hide resolved

rc

455c61c

kevinh-canva requested a review from dannykopping December 9, 2024 03:01

dannykopping approved these changes Dec 9, 2024

View reviewed changes

dannykopping and others added 2 commits December 9, 2024 08:41

Nit: consistency

2f77a60

Signed-off-by: Danny Kopping <danny@coder.com>

Merge branch 'main' into kevinh-add-workspace-timing-metrics

50f6300

Merge remote-tracking branch 'upstream/main' into kevinh-add-workspac…

cf8a144

…e-timing-metrics

dannykopping enabled auto-merge (squash) December 11, 2024 05:16

deansheather added a commit to coder/cla that referenced this pull request Dec 11, 2024

@kevinh-canva has signed the CLA in coder/coder#15771

c8fb9b6

dannykopping added a commit to coder/cla that referenced this pull request Dec 11, 2024

@kevinh-canva signed the CLA in coder/coder#15771

857f3bf

Manually adding this since our CLA bot is broken

dannykopping merged commit c528791 into coder:main Dec 11, 2024
29 checks passed

github-actions bot locked and limited conversation to collaborators Dec 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add workspace build timing metrics #15771

feat: add workspace build timing metrics #15771

kevinh-canva commented Dec 6, 2024 •

edited

Loading

Uh oh!

kevinh-canva commented Dec 6, 2024

Uh oh!

dannykopping left a comment

Uh oh!

Uh oh!

dannykopping left a comment

Uh oh!

dannykopping commented Dec 9, 2024 •

edited

Loading

Uh oh!

kevinh-canva commented Dec 9, 2024

Uh oh!

kevinh-canva commented Dec 9, 2024

Uh oh!

deansheather commented Dec 11, 2024

Uh oh!

Uh oh!

Uh oh!

feat: add workspace build timing metrics #15771

feat: add workspace build timing metrics #15771

Conversation

kevinh-canva commented Dec 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Intent

Uh oh!

kevinh-canva commented Dec 6, 2024

Uh oh!

dannykopping left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dannykopping left a comment

Choose a reason for hiding this comment

Uh oh!

dannykopping commented Dec 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kevinh-canva commented Dec 9, 2024

Uh oh!

kevinh-canva commented Dec 9, 2024

Uh oh!

deansheather commented Dec 11, 2024

Uh oh!

Uh oh!

Uh oh!

kevinh-canva commented Dec 6, 2024 •

edited

Loading

dannykopping commented Dec 9, 2024 •

edited

Loading