Skip to content

Commit 0ab345c

Browse files
authored
feat: add prebuild timing metrics to Prometheus (coder#19503)
## Description This PR introduces one counter and two histograms related to workspace creation and claiming. The goal is to provide clearer observability into how workspaces are created (regular vs prebuild) and the time cost of those operations. ### `coderd_workspace_creation_total` * Metric type: Counter * Name: `coderd_workspace_creation_total` * Labels: `organization_name`, `template_name`, `preset_name` This counter tracks whether a regular workspace (not created from a prebuild pool) was created using a preset or not. Currently, we already expose `coderd_prebuilt_workspaces_claimed_total` for claimed prebuilt workspaces, but we lack a comparable metric for regular workspace creations. This metric fills that gap, making it possible to compare regular creations against claims. Implementation notes: * Exposed as a `coderd_` metric, consistent with other workspace-related metrics (e.g. `coderd_api_workspace_latest_build`: https://github.com/coder/coder/blob/main/coderd/prometheusmetrics/prometheusmetrics.go#L149). * Every `defaultRefreshRate` (1 minute ), DB query `GetRegularWorkspaceCreateMetrics` is executed to fetch all regular workspaces (not created from a prebuild pool). * The counter is updated with the total from all time (not just since metric introduction). This differs from the histograms below, which only accumulate from their introduction forward. ### `coderd_workspace_creation_duration_seconds` & `coderd_prebuilt_workspace_claim_duration_seconds` * Metric types: Histogram * Names: * `coderd_workspace_creation_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name`, `type` (`regular`, `prebuild`) * `coderd_prebuilt_workspace_claim_duration_seconds` * Labels: `organization_name`, `template_name`, `preset_name` We already have `coderd_provisionerd_workspace_build_timings_seconds`, which tracks build run times for all workspace builds handled by the provisioner daemon. However, in the context of this issue, we are only interested in creation and claim build times, not all transitions; additionally, this metric does not include `preset_name`, and adding it there would significantly increase cardinality. Therefore, separate more focused metrics are introduced here: * `coderd_workspace_creation_duration_seconds`: Build time to create a workspace (either a regular workspace or the build into a prebuild pool, for prebuild initial provisioning build). * `coderd_prebuilt_workspace_claim_duration_seconds`: Time to claim a prebuilt workspace from the pool. The reason for two separate histograms is that: * Creation (regular or prebuild): provisioning builds with similar time magnitude, generally expected to take longer than a claim operation. * Claim: expected to be a much faster provisioning build. #### Native histogram usage Provisioning times vary widely between projects. Using static buckets risks unbalanced or poorly informative histograms. To address this, these metrics use [Prometheus native histograms](https://prometheus.io/docs/specs/native_histograms/): * First introduced in Prometheus v2.40.0 * Recommended stable usage from v2.45+ * Requires Go client `prometheus/client_golang` v1.15.0+ * Experimental and must be explicitly enabled on the server (`--enable-feature=native-histograms`) For compatibility, we also retain a classic bucket definition (aligned with the existing provisioner metric: https://github.com/coder/coder/blob/main/provisionerd/provisionerd.go#L182-L189). * If native histograms are enabled, Prometheus ingests the high-resolution histogram. * If not, it falls back to the predefined buckets. Implementation notes: * Unlike the counter, these histograms are updated in real-time at workspace build job completion. * They reflect data only from the point of introduction forward (no historical backfill). ## Relates to Closes: coder#19528 Native histograms tested in observability stack: coder/observability#50
1 parent 9fd33a7 commit 0ab345c

File tree

21 files changed

+699
-8
lines changed

21 files changed

+699
-8
lines changed

cli/server.go

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,6 @@ import (
6262
"github.com/coder/serpent"
6363
"github.com/coder/wgtunnel/tunnelsdk"
6464

65-
"github.com/coder/coder/v2/coderd/entitlements"
66-
"github.com/coder/coder/v2/coderd/notifications/reports"
67-
"github.com/coder/coder/v2/coderd/runtimeconfig"
68-
"github.com/coder/coder/v2/coderd/webpush"
69-
"github.com/coder/coder/v2/codersdk/drpcsdk"
70-
7165
"github.com/coder/coder/v2/buildinfo"
7266
"github.com/coder/coder/v2/cli/clilog"
7367
"github.com/coder/coder/v2/cli/cliui"
@@ -83,25 +77,31 @@ import (
8377
"github.com/coder/coder/v2/coderd/database/migrations"
8478
"github.com/coder/coder/v2/coderd/database/pubsub"
8579
"github.com/coder/coder/v2/coderd/devtunnel"
80+
"github.com/coder/coder/v2/coderd/entitlements"
8681
"github.com/coder/coder/v2/coderd/externalauth"
8782
"github.com/coder/coder/v2/coderd/gitsshkey"
8883
"github.com/coder/coder/v2/coderd/httpmw"
8984
"github.com/coder/coder/v2/coderd/jobreaper"
9085
"github.com/coder/coder/v2/coderd/notifications"
86+
"github.com/coder/coder/v2/coderd/notifications/reports"
9187
"github.com/coder/coder/v2/coderd/oauthpki"
9288
"github.com/coder/coder/v2/coderd/prometheusmetrics"
9389
"github.com/coder/coder/v2/coderd/prometheusmetrics/insights"
9490
"github.com/coder/coder/v2/coderd/promoauth"
91+
"github.com/coder/coder/v2/coderd/provisionerdserver"
92+
"github.com/coder/coder/v2/coderd/runtimeconfig"
9593
"github.com/coder/coder/v2/coderd/schedule"
9694
"github.com/coder/coder/v2/coderd/telemetry"
9795
"github.com/coder/coder/v2/coderd/tracing"
9896
"github.com/coder/coder/v2/coderd/updatecheck"
9997
"github.com/coder/coder/v2/coderd/util/ptr"
10098
"github.com/coder/coder/v2/coderd/util/slice"
10199
stringutil "github.com/coder/coder/v2/coderd/util/strings"
100+
"github.com/coder/coder/v2/coderd/webpush"
102101
"github.com/coder/coder/v2/coderd/workspaceapps/appurl"
103102
"github.com/coder/coder/v2/coderd/workspacestats"
104103
"github.com/coder/coder/v2/codersdk"
104+
"github.com/coder/coder/v2/codersdk/drpcsdk"
105105
"github.com/coder/coder/v2/cryptorand"
106106
"github.com/coder/coder/v2/provisioner/echo"
107107
"github.com/coder/coder/v2/provisioner/terraform"
@@ -280,6 +280,12 @@ func enablePrometheus(
280280
}
281281
}
282282

283+
provisionerdserverMetrics := provisionerdserver.NewMetrics(logger)
284+
if err := provisionerdserverMetrics.Register(options.PrometheusRegistry); err != nil {
285+
return nil, xerrors.Errorf("failed to register provisionerd_server metrics: %w", err)
286+
}
287+
options.ProvisionerdServerMetrics = provisionerdserverMetrics
288+
283289
//nolint:revive
284290
return ServeHandler(
285291
ctx, logger, promhttp.InstrumentMetricHandler(

coderd/coderd.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -241,6 +241,8 @@ type Options struct {
241241
UpdateAgentMetrics func(ctx context.Context, labels prometheusmetrics.AgentMetricLabels, metrics []*agentproto.Stats_Metric)
242242
StatsBatcher workspacestats.Batcher
243243

244+
ProvisionerdServerMetrics *provisionerdserver.Metrics
245+
244246
// WorkspaceAppAuditSessionTimeout allows changing the timeout for audit
245247
// sessions. Raising or lowering this value will directly affect the write
246248
// load of the audit log table. This is used for testing. Default 1 hour.
@@ -1930,6 +1932,7 @@ func (api *API) CreateInMemoryTaggedProvisionerDaemon(dialCtx context.Context, n
19301932
},
19311933
api.NotificationsEnqueuer,
19321934
&api.PrebuildsReconciler,
1935+
api.ProvisionerdServerMetrics,
19331936
)
19341937
if err != nil {
19351938
return nil, err

coderd/coderdtest/coderdtest.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,8 @@ type Options struct {
184184
OIDCConvertKeyCache cryptokeys.SigningKeycache
185185
Clock quartz.Clock
186186
TelemetryReporter telemetry.Reporter
187+
188+
ProvisionerdServerMetrics *provisionerdserver.Metrics
187189
}
188190

189191
// New constructs a codersdk client connected to an in-memory API instance.
@@ -604,6 +606,7 @@ func NewOptions(t testing.TB, options *Options) (func(http.Handler), context.Can
604606
Clock: options.Clock,
605607
AppEncryptionKeyCache: options.APIKeyEncryptionCache,
606608
OIDCConvertKeyCache: options.OIDCConvertKeyCache,
609+
ProvisionerdServerMetrics: options.ProvisionerdServerMetrics,
607610
}
608611
}
609612

coderd/database/dbauthz/dbauthz.go

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2699,6 +2699,13 @@ func (q *querier) GetQuotaConsumedForUser(ctx context.Context, params database.G
26992699
return q.db.GetQuotaConsumedForUser(ctx, params)
27002700
}
27012701

2702+
func (q *querier) GetRegularWorkspaceCreateMetrics(ctx context.Context) ([]database.GetRegularWorkspaceCreateMetricsRow, error) {
2703+
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceWorkspace.All()); err != nil {
2704+
return nil, err
2705+
}
2706+
return q.db.GetRegularWorkspaceCreateMetrics(ctx)
2707+
}
2708+
27022709
func (q *querier) GetReplicaByID(ctx context.Context, id uuid.UUID) (database.Replica, error) {
27032710
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err != nil {
27042711
return database.Replica{}, err

coderd/database/dbauthz/dbauthz_test.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2177,6 +2177,10 @@ func (s *MethodTestSuite) TestWorkspace() {
21772177
dbm.EXPECT().GetWorkspaceAgentDevcontainersByAgentID(gomock.Any(), agt.ID).Return([]database.WorkspaceAgentDevcontainer{d}, nil).AnyTimes()
21782178
check.Args(agt.ID).Asserts(w, policy.ActionRead).Returns([]database.WorkspaceAgentDevcontainer{d})
21792179
}))
2180+
s.Run("GetRegularWorkspaceCreateMetrics", s.Subtest(func(_ database.Store, check *expects) {
2181+
check.Args().
2182+
Asserts(rbac.ResourceWorkspace.All(), policy.ActionRead)
2183+
}))
21802184
}
21812185

21822186
func (s *MethodTestSuite) TestWorkspacePortSharing() {

coderd/database/dbmetrics/querymetrics.go

Lines changed: 7 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

coderd/database/dbmock/dbmock.go

Lines changed: 15 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

coderd/database/querier.go

Lines changed: 3 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

coderd/database/queries.sql.go

Lines changed: 70 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

coderd/database/queries/prebuilds.sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -230,7 +230,7 @@ HAVING COUNT(*) = @hard_limit::bigint;
230230
SELECT
231231
t.name as template_name,
232232
tvp.name as preset_name,
233-
o.name as organization_name,
233+
o.name as organization_name,
234234
COUNT(*) as created_count,
235235
COUNT(*) FILTER (WHERE pj.job_status = 'failed'::provisioner_job_status) as failed_count,
236236
COUNT(*) FILTER (

0 commit comments

Comments
 (0)