-
Notifications
You must be signed in to change notification settings - Fork 887
feat: expose workspace statuses (with details) as a prometheus metric #12762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…levant detail Light refactoring Signed-off-by: Danny Kopping <danny@coder.com>
Light refactoring Signed-off-by: Danny Kopping <danny@coder.com>
…ed inappropriately slow Signed-off-by: Danny Kopping <danny@coder.com>
Moderate refactoring Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Use stronger rand Signed-off-by: Danny Kopping <danny@coder.com>
if strings.HasPrefix(scanner.Text(), "coderd_api_workspace_latest_build_total") { | ||
hasWorkspaces = true | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side-effect of clearing the gauge when no db rows are loaded.
This wasn't strictly necessary for the test since other manually-registered metrics are included to validate this test.
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
coderd/database/queries.sql.go
Outdated
LatestBuildCanceledAt sql.NullTime `db:"latest_build_canceled_at" json:"latest_build_canceled_at"` | ||
LatestBuildError sql.NullString `db:"latest_build_error" json:"latest_build_error"` | ||
LatestBuildTransition WorkspaceTransition `db:"latest_build_transition" json:"latest_build_transition"` | ||
LatestBuildStatus NullProvisionerJobStatus `db:"latest_build_status" json:"latest_build_status"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to make this be non-nullable in the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afraid not, looks like the latest_build
attributes are the result of a left join:
coder/coderd/database/queries.sql.go
Lines 12172 to 12218 in 843d650
SELECT | |
workspaces.id, workspaces.created_at, workspaces.updated_at, workspaces.owner_id, workspaces.organization_id, workspaces.template_id, workspaces.deleted, workspaces.name, workspaces.autostart_schedule, workspaces.ttl, workspaces.last_used_at, workspaces.dormant_at, workspaces.deleting_at, workspaces.automatic_updates, workspaces.favorite, | |
COALESCE(template.name, 'unknown') as template_name, | |
latest_build.template_version_id, | |
latest_build.template_version_name, | |
users.username as username, | |
latest_build.completed_at as latest_build_completed_at, | |
latest_build.canceled_at as latest_build_canceled_at, | |
latest_build.error as latest_build_error, | |
latest_build.transition as latest_build_transition, | |
latest_build.job_status as latest_build_status | |
FROM | |
workspaces | |
JOIN | |
users | |
ON | |
workspaces.owner_id = users.id | |
LEFT JOIN LATERAL ( | |
SELECT | |
workspace_builds.id, | |
workspace_builds.transition, | |
workspace_builds.template_version_id, | |
template_versions.name AS template_version_name, | |
provisioner_jobs.id AS provisioner_job_id, | |
provisioner_jobs.started_at, | |
provisioner_jobs.updated_at, | |
provisioner_jobs.canceled_at, | |
provisioner_jobs.completed_at, | |
provisioner_jobs.error, | |
provisioner_jobs.job_status | |
FROM | |
workspace_builds | |
LEFT JOIN | |
provisioner_jobs | |
ON | |
provisioner_jobs.id = workspace_builds.job_id | |
LEFT JOIN | |
template_versions | |
ON | |
template_versions.id = workspace_builds.template_version_id | |
WHERE | |
workspace_builds.workspace_id = workspaces.id | |
ORDER BY | |
build_number DESC | |
LIMIT | |
1 | |
) latest_build ON TRUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose by definition a workspace cannot exist without a build...
I see in workspace_builds
:
create table if not exists public.workspace_builds
(
id uuid not null
primary key,
created_at timestamp with time zone not null,
updated_at timestamp with time zone not null,
workspace_id uuid not null
references public.workspaces
on delete cascade,
...
The workspace foreign key is not null, so I think this left join may be inappropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hhmm. I changed it and ran make gen
but it's still being defined as a nullable field.
Also, one of its sibling fields latest_build_transition
is not nullable and is derived from the same join, so actually all of the above about the join was ostensibly was invalid.
I'll need to dig a little further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! I was basically on the correct track!
provisioner_jobs
was being left joined as well, and since that also has a non-nullable foreign key with workspace_builds.job_id
, it is safe to use an inner join. I can actually leave the outer lateral join as it is but just change this to correct the generated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -12405,7 +12406,7 @@ WHERE | |||
-- @authorize_filter | |||
), filtered_workspaces_order AS ( | |||
SELECT | |||
fw.id, fw.created_at, fw.updated_at, fw.owner_id, fw.organization_id, fw.template_id, fw.deleted, fw.name, fw.autostart_schedule, fw.ttl, fw.last_used_at, fw.dormant_at, fw.deleting_at, fw.automatic_updates, fw.favorite, fw.template_name, fw.template_version_id, fw.template_version_name, fw.username, fw.latest_build_completed_at, fw.latest_build_canceled_at, fw.latest_build_error, fw.latest_build_transition | |||
fw.id, fw.created_at, fw.updated_at, fw.owner_id, fw.organization_id, fw.template_id, fw.deleted, fw.name, fw.autostart_schedule, fw.ttl, fw.last_used_at, fw.dormant_at, fw.deleting_at, fw.automatic_updates, fw.favorite, fw.template_name, fw.template_version_id, fw.template_version_name, fw.username, fw.latest_build_completed_at, fw.latest_build_canceled_at, fw.latest_build_error, fw.latest_build_transition, fw.latest_build_status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
follow-up idea: it could be useful to have this field in the codersdk.Workspace
as well :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@johnstcn and I spoke offline to elaborate on this, apparently we convert between the database.Workspace
and codersdk.Workspace
types in some situations (convertWorkspace
, for example), and it would be good to maintain parity between them. This would be an enhancement for a subsequent PR, we agreed.
Signed-off-by: Danny Kopping <danny@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smoke-tested:
After creating workspace:
curl -fsSL http://localhost:21112 | grep coderd_api_workspace_detail
# HELP coderd_api_workspace_detail The current workspace details by template, transition, owner, and status.
# TYPE coderd_api_workspace_detail gauge
coderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="nice_hofstadter1",workspace_name="test",workspace_owner="admin",workspace_transition="start"} 1
After stopping a workspace and waiting approx 1 minute:
curl -fsSL http://localhost:21112 | grep coderd_api_workspace_detail
# HELP coderd_api_workspace_detail The current workspace details by template, transition, owner, and status.
# TYPE coderd_api_workspace_detail gauge
coderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="nice_hofstadter1",workspace_name="test",workspace_owner="admin",workspace_transition="stop"} 1
Based on @johnstcn's comment and a little time away from this, I'm thinking the current metric name is not quite explanatory enough. I'm thinking of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level review round
@@ -118,7 +119,7 @@ LEFT JOIN LATERAL ( | |||
provisioner_jobs.job_status | |||
FROM | |||
workspace_builds | |||
LEFT JOIN | |||
JOIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check this with @mafredri if it was not intentional 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid making this change. I believe there's a window of time when there exists a new build but it's not assigned to a job. So the nullability is something that should be handled.
This logic change essentially results in returning the previous build until the it's been assigned to a job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if that's possible?
create table if not exists workspace_builds
(
id uuid not null
...
workspace_id uuid not null
references public.workspaces
on delete cascade,
...
job_id uuid not null
unique
references public.provisioner_jobs
on delete cascade,
...
);
These keys are all non-nullable.
Is there something I'm missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should say though that I'd rather not make this change if you in any way suspect this will cause issues.
The way I had it prior to #12762 (comment) worked fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Builds are inserted with their respective job in a single transaction. c.f. wsbuilder
Also, Danny's right that job_id is NOT NULL
with a foreign key constraint, so even if we broke wsbuilder
it still wouldn't be possible to have a build with no job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the constraint, the change is perfectly fine 👍. I should’ve checked it myself before commenting. Sorry for the noise @dannykopping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DB schema should probably be updated to include NOT NULL on the foreign keys, though. Otherwise NULL entries are not enforced by the DB.
Actually, where did you get the schema @dannykopping you referenced above? Doesn’t look the same as our dump which has NOT NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries @mafredri!
I just generated the DDL from an existing database in my IDE (GoLand). The above also has not null
but you have to scroll over a bit to the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, it was perfectly hidden by the code block size 😂.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know right??! Sorry about that lol
return nil, err | ||
} | ||
|
||
workspaceDetails := prometheus.NewGaugeVec(prometheus.GaugeOpts{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking loud:
During the last scale tests we hit 2000 workspaces, does it mean that we're going to swamp our Prometheus endpoint with details for all of them? If so, maybe we should make it configurable? cc @johnstcn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a time-series per workspace is excessive here. Operators want to know general counts of workspaces in different states. If they want to know specific workspace names they can just hit our API or look at the /workspaces
page in our frontend.
This is especially true since the metrics will be reported by each Coderd, so if you have M workspaces and N Coderd replicas, that's M*N time-series that prometheus needs to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I was basing the decision to include it from the original request so I'll defer to @bpmct before removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also Spike's comment: #12462 (comment)
I think what they want is a GUAGE that tracks the current number of workspaces in a particular state, not a counter of builds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow, it seems I misunderstood the initial ask.
I was under the incorrect assumption that @bpmct wanted a metric to match the format described in the issue, and "the builds metric" which he was referring to was the existing coderd_api_workspace_latest_build_total
metric 🤦
coderd_workspace_builds_total
already exists!
# HELP coderd_workspace_builds_total The number of workspaces started, updated, or deleted.
# TYPE coderd_workspace_builds_total counter
coderd_workspace_builds_total{status="failed",template_name="docker",template_version="vibrant_davinci2",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="START"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="2",workspace_owner="danny",workspace_transition="START"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="START"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="STOP"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="mystifying_driscoll3",workspace_name="2",workspace_owner="danny",workspace_transition="STOP"} 1
Plus we don't have to worry about cardinality because these metrics are written by the provisioner when executing a job, not reading from the database state.
I have amended the new metric to drop the workspace
label.
I think this will satisfy the requirements but @bpmct please confirm.
# HELP coderd_workspace_latest_build_status The current workspace statuses by template, transition, and owner.
# TYPE coderd_workspace_latest_build_status gauge
coderd_workspace_latest_build_status{status="failed",template_name="docker",template_version="happy_feynman7",workspace_owner="danny",workspace_transition="stop"} 1
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_owner="danny",workspace_transition="start"
} 2
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_owner="danny",workspace_transition="start"} 1
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="ecstatic_greider3",workspace_owner="danny",workspace_transition="start"} 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will satisfy the requirements but @bpmct please confirm.
# HELP coderd_workspace_latest_build_status The current workspace statuses by template, transition, and owner. # TYPE coderd_workspace_latest_build_status gauge coderd_workspace_latest_build_status{status="failed",template_name="docker",template_version="happy_feynman7",workspace_owner="danny",workspace_transition="stop"} 1 coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_owner="danny",workspace_transition="start" } 2 coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_owner="danny",workspace_transition="start"} 1 coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="ecstatic_greider3",workspace_owner="danny",workspace_transition="start"} 1
Yep!
I wonder if it isn't too late for renaming as people may use it on their dashboards. |
This is a new metric being added by this PR. |
My bad, I assumed that this is |
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
return | ||
} | ||
|
||
workspaceStatuses.Reset() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this has a race condition where if the registry collects the metrics before the for-loop below is complete, it will get partial data. This can result in glitchy stats were they are zeroed out for a single data point and then return to "normal." If anyone puts alerts on those stats it could result in false alerting in addition to visual noise on the dashboard.
I realize this is following the pattern of the existing metric, which has the same bug.
Instead these stats should be a custom collector that queries the database and returns the requested data on demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead these stats should be a custom collector that queries the database and returns the requested data on demand.
Definitely agree! Called this out in the description as well.
I'm not sure if we can fix this race without the change to the collector pattern. Do you have any ideas?
In reality the race would have to be on the nanosecond scale because this loop will iterate very quickly, but your point is absolutely valid.
I'll add a follow-up issue for this in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it works as intended 👍 I left some comments on naming, but definitely not blockers for pushing.
workspaceStatuses.Reset() | ||
} | ||
|
||
logger.Warn(ctx, "failed to load active workspaces", slog.Error(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe logger.Error
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't strictly call it an error because it could be an instance of sql.ErrNoRows
which is valid but undesirable. To be the most correct I would make it warn in case of sql.ErrNoRows
and error in all other cases, but that feels unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which is valid but undesirable
Will it also return sql.ErrNoRows
if there are zero workspaces = fresh deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe so, yes.
Or if all workspaces have been deleted.
Signed-off-by: Danny Kopping <danny@coder.com>
I believe I've addressed all the comments. I'll merge, but if there any subsequent thoughts, I'll amend those in a follow-up. |
Implements #12462
This produces the following metric:
Notes to reviewers:
ActiveUsers
andWorkspaces
metrics to a minute; 5 minutes seemed quite excessive and may not allow operators to act fast enough; not sure if there was a reason these were different to the other metrics in this file/metrics
, for example)_total
suffixes which are generally reserved for countersWorkspace
tests run with a slower interval (25ms instead of 1ms) and a faster cut-off (1s vs 10s)