Skip to content

feat: expose workspace statuses (with details) as a prometheus metric #12762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 2, 2024

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented Mar 26, 2024

Implements #12462

This produces the following metric:

coderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_name="test_workspace",workspace_owner="danny",workspace_transition="start"}

Notes to reviewers:

  • Minor behaviour change: previously if a metric existed but then subsequent queries loaded no entries, it would no reset the gauge. Additionally, we set the state of the gauge on startup to help with testing. Both of these would create a weird user-experience in my view so I got rid of them, but it has some small side-effects which I call out in the code
  • I've reduced the refresh rate of ActiveUsers and Workspaces metrics to a minute; 5 minutes seemed quite excessive and may not allow operators to act fast enough; not sure if there was a reason these were different to the other metrics in this file
  • I've also made the Workspace tests run with a slower interval (25ms instead of 1ms) and a faster cut-off (1s vs 10s)

…levant detail

Light refactoring

Signed-off-by: Danny Kopping <danny@coder.com>
Light refactoring

Signed-off-by: Danny Kopping <danny@coder.com>
…ed inappropriately slow

Signed-off-by: Danny Kopping <danny@coder.com>
Moderate refactoring

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Use stronger rand

Signed-off-by: Danny Kopping <danny@coder.com>
Comment on lines -984 to -987
if strings.HasPrefix(scanner.Text(), "coderd_api_workspace_latest_build_total") {
hasWorkspaces = true
continue
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side-effect of clearing the gauge when no db rows are loaded.
This wasn't strictly necessary for the test since other manually-registered metrics are included to validate this test.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
@dannykopping dannykopping marked this pull request as ready for review March 26, 2024 13:38
LatestBuildCanceledAt sql.NullTime `db:"latest_build_canceled_at" json:"latest_build_canceled_at"`
LatestBuildError sql.NullString `db:"latest_build_error" json:"latest_build_error"`
LatestBuildTransition WorkspaceTransition `db:"latest_build_transition" json:"latest_build_transition"`
LatestBuildStatus NullProvisionerJobStatus `db:"latest_build_status" json:"latest_build_status"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make this be non-nullable in the query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afraid not, looks like the latest_build attributes are the result of a left join:

SELECT
workspaces.id, workspaces.created_at, workspaces.updated_at, workspaces.owner_id, workspaces.organization_id, workspaces.template_id, workspaces.deleted, workspaces.name, workspaces.autostart_schedule, workspaces.ttl, workspaces.last_used_at, workspaces.dormant_at, workspaces.deleting_at, workspaces.automatic_updates, workspaces.favorite,
COALESCE(template.name, 'unknown') as template_name,
latest_build.template_version_id,
latest_build.template_version_name,
users.username as username,
latest_build.completed_at as latest_build_completed_at,
latest_build.canceled_at as latest_build_canceled_at,
latest_build.error as latest_build_error,
latest_build.transition as latest_build_transition,
latest_build.job_status as latest_build_status
FROM
workspaces
JOIN
users
ON
workspaces.owner_id = users.id
LEFT JOIN LATERAL (
SELECT
workspace_builds.id,
workspace_builds.transition,
workspace_builds.template_version_id,
template_versions.name AS template_version_name,
provisioner_jobs.id AS provisioner_job_id,
provisioner_jobs.started_at,
provisioner_jobs.updated_at,
provisioner_jobs.canceled_at,
provisioner_jobs.completed_at,
provisioner_jobs.error,
provisioner_jobs.job_status
FROM
workspace_builds
LEFT JOIN
provisioner_jobs
ON
provisioner_jobs.id = workspace_builds.job_id
LEFT JOIN
template_versions
ON
template_versions.id = workspace_builds.template_version_id
WHERE
workspace_builds.workspace_id = workspaces.id
ORDER BY
build_number DESC
LIMIT
1
) latest_build ON TRUE

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose by definition a workspace cannot exist without a build...

I see in workspace_builds:

create table if not exists public.workspace_builds
(
    id                  uuid                                                                                not null
        primary key,
    created_at          timestamp with time zone                                                            not null,
    updated_at          timestamp with time zone                                                            not null,
    workspace_id        uuid                                                                                not null
        references public.workspaces
            on delete cascade,
...

The workspace foreign key is not null, so I think this left join may be inappropriate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hhmm. I changed it and ran make gen but it's still being defined as a nullable field.
Also, one of its sibling fields latest_build_transition is not nullable and is derived from the same join, so actually all of the above about the join was ostensibly was invalid.

I'll need to dig a little further.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! I was basically on the correct track!

provisioner_jobs was being left joined as well, and since that also has a non-nullable foreign key with workspace_builds.job_id, it is safe to use an inner join. I can actually leave the outer lateral join as it is but just change this to correct the generated code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -12405,7 +12406,7 @@ WHERE
-- @authorize_filter
), filtered_workspaces_order AS (
SELECT
fw.id, fw.created_at, fw.updated_at, fw.owner_id, fw.organization_id, fw.template_id, fw.deleted, fw.name, fw.autostart_schedule, fw.ttl, fw.last_used_at, fw.dormant_at, fw.deleting_at, fw.automatic_updates, fw.favorite, fw.template_name, fw.template_version_id, fw.template_version_name, fw.username, fw.latest_build_completed_at, fw.latest_build_canceled_at, fw.latest_build_error, fw.latest_build_transition
fw.id, fw.created_at, fw.updated_at, fw.owner_id, fw.organization_id, fw.template_id, fw.deleted, fw.name, fw.autostart_schedule, fw.ttl, fw.last_used_at, fw.dormant_at, fw.deleting_at, fw.automatic_updates, fw.favorite, fw.template_name, fw.template_version_id, fw.template_version_name, fw.username, fw.latest_build_completed_at, fw.latest_build_canceled_at, fw.latest_build_error, fw.latest_build_transition, fw.latest_build_status
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow-up idea: it could be useful to have this field in the codersdk.Workspace as well :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@johnstcn and I spoke offline to elaborate on this, apparently we convert between the database.Workspace and codersdk.Workspace types in some situations (convertWorkspace, for example), and it would be good to maintain parity between them. This would be an enhancement for a subsequent PR, we agreed.

Signed-off-by: Danny Kopping <danny@coder.com>
Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke-tested:

After creating workspace:

curl -fsSL http://localhost:21112 | grep coderd_api_workspace_detail
# HELP coderd_api_workspace_detail The current workspace details by template, transition, owner, and status.
# TYPE coderd_api_workspace_detail gauge
coderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="nice_hofstadter1",workspace_name="test",workspace_owner="admin",workspace_transition="start"} 1

After stopping a workspace and waiting approx 1 minute:

curl -fsSL http://localhost:21112 | grep coderd_api_workspace_detail
# HELP coderd_api_workspace_detail The current workspace details by template, transition, owner, and status.
# TYPE coderd_api_workspace_detail gauge
coderd_api_workspace_detail{status="succeeded",template_name="docker",template_version="nice_hofstadter1",workspace_name="test",workspace_owner="admin",workspace_transition="stop"} 1

@dannykopping
Copy link
Contributor Author

Based on @johnstcn's comment and a little time away from this, I'm thinking the current metric name is not quite explanatory enough.

I'm thinking of coderd_api_workspace_latest_build_status. Thoughts?

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level review round

@@ -118,7 +119,7 @@ LEFT JOIN LATERAL (
provisioner_jobs.job_status
FROM
workspace_builds
LEFT JOIN
JOIN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check this with @mafredri if it was not intentional 👍

Copy link
Member

@mafredri mafredri Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid making this change. I believe there's a window of time when there exists a new build but it's not assigned to a job. So the nullability is something that should be handled.

This logic change essentially results in returning the previous build until the it's been assigned to a job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that's possible?

create table if not exists workspace_builds
(
    id                  uuid                                                                                not null
    ...
    workspace_id        uuid                                                                                not null
        references public.workspaces
            on delete cascade,
    ...
    job_id              uuid                                                                                not null
        unique
        references public.provisioner_jobs
            on delete cascade,
    ...
);

These keys are all non-nullable.
Is there something I'm missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should say though that I'd rather not make this change if you in any way suspect this will cause issues.
The way I had it prior to #12762 (comment) worked fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builds are inserted with their respective job in a single transaction. c.f. wsbuilder

Also, Danny's right that job_id is NOT NULL with a foreign key constraint, so even if we broke wsbuilder it still wouldn't be possible to have a build with no job.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the constraint, the change is perfectly fine 👍. I should’ve checked it myself before commenting. Sorry for the noise @dannykopping.

Copy link
Member

@mafredri mafredri Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DB schema should probably be updated to include NOT NULL on the foreign keys, though. Otherwise NULL entries are not enforced by the DB.

Actually, where did you get the schema @dannykopping you referenced above? Doesn’t look the same as our dump which has NOT NULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries @mafredri!

I just generated the DDL from an existing database in my IDE (GoLand). The above also has not null but you have to scroll over a bit to the right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, it was perfectly hidden by the code block size 😂.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know right??! Sorry about that lol

return nil, err
}

workspaceDetails := prometheus.NewGaugeVec(prometheus.GaugeOpts{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking loud:

During the last scale tests we hit 2000 workspaces, does it mean that we're going to swamp our Prometheus endpoint with details for all of them? If so, maybe we should make it configurable? cc @johnstcn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a time-series per workspace is excessive here. Operators want to know general counts of workspaces in different states. If they want to know specific workspace names they can just hit our API or look at the /workspaces page in our frontend.

This is especially true since the metrics will be reported by each Coderd, so if you have M workspaces and N Coderd replicas, that's M*N time-series that prometheus needs to track.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I was basing the decision to include it from the original request so I'll defer to @bpmct before removing it.

Copy link
Member

@johnstcn johnstcn Mar 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also Spike's comment: #12462 (comment)

I think what they want is a GUAGE that tracks the current number of workspaces in a particular state, not a counter of builds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow, it seems I misunderstood the initial ask.

I was under the incorrect assumption that @bpmct wanted a metric to match the format described in the issue, and "the builds metric" which he was referring to was the existing coderd_api_workspace_latest_build_total metric 🤦

coderd_workspace_builds_total already exists!

# HELP coderd_workspace_builds_total The number of workspaces started, updated, or deleted.
# TYPE coderd_workspace_builds_total counter
coderd_workspace_builds_total{status="failed",template_name="docker",template_version="vibrant_davinci2",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="START"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="2",workspace_owner="danny",workspace_transition="START"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="START"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_name="asfsasfafs",workspace_owner="danny",workspace_transition="STOP"} 1
coderd_workspace_builds_total{status="success",template_name="docker",template_version="mystifying_driscoll3",workspace_name="2",workspace_owner="danny",workspace_transition="STOP"} 1

Plus we don't have to worry about cardinality because these metrics are written by the provisioner when executing a job, not reading from the database state.

I have amended the new metric to drop the workspace label.

I think this will satisfy the requirements but @bpmct please confirm.

# HELP coderd_workspace_latest_build_status The current workspace statuses by template, transition, and owner.
# TYPE coderd_workspace_latest_build_status gauge
coderd_workspace_latest_build_status{status="failed",template_name="docker",template_version="happy_feynman7",workspace_owner="danny",workspace_transition="stop"} 1
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_owner="danny",workspace_transition="start"
} 2
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_owner="danny",workspace_transition="start"} 1
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="ecstatic_greider3",workspace_owner="danny",workspace_transition="start"} 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will satisfy the requirements but @bpmct please confirm.

# HELP coderd_workspace_latest_build_status The current workspace statuses by template, transition, and owner.
# TYPE coderd_workspace_latest_build_status gauge
coderd_workspace_latest_build_status{status="failed",template_name="docker",template_version="happy_feynman7",workspace_owner="danny",workspace_transition="stop"} 1
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="beautiful_visvesvaraya1",workspace_owner="danny",workspace_transition="start"
} 2
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="distracted_hopper6",workspace_owner="danny",workspace_transition="start"} 1
coderd_workspace_latest_build_status{status="succeeded",template_name="docker",template_version="ecstatic_greider3",workspace_owner="danny",workspace_transition="start"} 1

Yep!

@mtojek
Copy link
Member

mtojek commented Mar 27, 2024

I'm thinking of coderd_api_workspace_latest_build_status. Thoughts?

I wonder if it isn't too late for renaming as people may use it on their dashboards.

@dannykopping
Copy link
Contributor Author

I'm thinking of coderd_api_workspace_latest_build_status. Thoughts?

I wonder if it isn't too late for renaming as people may use it on their dashboards.

This is a new metric being added by this PR.

@mtojek
Copy link
Member

mtojek commented Mar 27, 2024

My bad, I assumed that this is coderd_workspace_builds_total being refactored.

Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
Signed-off-by: Danny Kopping <danny@coder.com>
return
}

workspaceStatuses.Reset()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this has a race condition where if the registry collects the metrics before the for-loop below is complete, it will get partial data. This can result in glitchy stats were they are zeroed out for a single data point and then return to "normal." If anyone puts alerts on those stats it could result in false alerting in addition to visual noise on the dashboard.

I realize this is following the pattern of the existing metric, which has the same bug.

Instead these stats should be a custom collector that queries the database and returns the requested data on demand.

Copy link
Contributor Author

@dannykopping dannykopping Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead these stats should be a custom collector that queries the database and returns the requested data on demand.
Definitely agree! Called this out in the description as well.

I'm not sure if we can fix this race without the change to the collector pattern. Do you have any ideas?
In reality the race would have to be on the nanosecond scale because this loop will iterate very quickly, but your point is absolutely valid.

I'll add a follow-up issue for this in any case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it works as intended 👍 I left some comments on naming, but definitely not blockers for pushing.

workspaceStatuses.Reset()
}

logger.Warn(ctx, "failed to load active workspaces", slog.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe logger.Error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't strictly call it an error because it could be an instance of sql.ErrNoRows which is valid but undesirable. To be the most correct I would make it warn in case of sql.ErrNoRows and error in all other cases, but that feels unnecessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is valid but undesirable

Will it also return sql.ErrNoRows if there are zero workspaces = fresh deployment?

Copy link
Contributor Author

@dannykopping dannykopping Mar 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Believe so, yes.
Or if all workspaces have been deleted.

@dannykopping
Copy link
Contributor Author

I believe I've addressed all the comments. I'll merge, but if there any subsequent thoughts, I'll amend those in a follow-up.
Thanks for the reviews!

@dannykopping dannykopping merged commit 79fb8e4 into coder:main Apr 2, 2024
@dannykopping dannykopping deleted the dk/workspace-metric branch April 2, 2024 07:57
@github-actions github-actions bot locked and limited conversation to collaborators Apr 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants