perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395

cstyan · 2025-08-18T16:44:06Z

At the moment, the loop which retrieves and updates the values of the agents metrics excessively calls GetUserByID (a DB query). First it retrieves a list of all workspaces, filtering out inactive agents (not entirely clear to me whether this is non-running workspaces, or just dead agents), and then iterates over those workspaces to get the rest of the relevant data for the metrics. The next call is GetUserByID for workspace.OwnerID

This should at least partially resolve coder/internal#726 ~~by caching seen User uuid.UUID in a map for each iteration of the loop.~~

UPDATE: It looks like, in theory, the calls here for GetUserByID should not even be necessary as we already have a database.Workspace object which also already has the owner ID and Username.

I left comments in both spots as to why the username should never be empty on the workspace again, but I'll reiterate here:

The owner_id field on the workspaces table is a FK reference to IDs in the users table and has a NOT NULL constraint, so the owner fields of a workspace will always be populated
While the users table technically only has a constraint that the username has to be NOT NULL (meaning empty string is valid), at user creation time our httpmw package enforces non-empty usernames (for example it calls codersdk.NameValid which enforces that the name is at least 1 character and fits the UsernameValidRegex)
The workspaces_expanded view has an inner join on workspaces.owner_id = visible_users.id, and if the owner is valid in the users table (which visible_users is a view of) then they will have a username set

reduce calls to GetUserByID Signed-off-by: Callum Styan <callumstyan@gmail.com>

johnstcn

Seeing as the only referenced field of user is Username, why not use workspace.OwnerUsername instead?

cstyan · 2025-08-18T17:10:27Z

Seeing as the only referenced field of user is Username, why not use workspace.OwnerUsername instead?

Good point, I'll make that change so that we're only caching the username in the map 👍

Signed-off-by: Callum Styan <callumstyan@gmail.com>

johnstcn · 2025-08-19T08:58:53Z

coderd/prometheusmetrics/prometheusmetrics.go

+				if username == "" {
+					logger.Warn(ctx, "in prometheusmetrics.Agents the username on workspace was empty string, this should not be possible",
+						slog.F("workspace_id", workspace.ID),
+						slog.F("template_id", workspace.TemplateID))
+					// Fallback to GetUserByID if OwnerUsername is empty (e.g., in tests)
+					user, err := db.GetUserByID(ctx, workspace.OwnerID)
+					if err != nil {
+						logger.Error(ctx, "can't get user from the database", slog.F("user_id", workspace.OwnerID), slog.Error(err))
+						agentsGauge.WithLabelValues(VectorOperationAdd, 0, user.Username, workspace.Name, templateName, templateVersionName)
+						continue
+					}
+					username = user.Username


As far as I can tell, this situation would only be possible if we completely messed up the view. In this case, many other things would be also broken. However, in this case we not only potentially do a whole bunch of user lookups, we also spam a bunch of log errors.

IMO this should just error loudly so it's very obvious.

As far as I can tell, this situation would only be possible if we completely messed up the view.

Yes it seems that way, though I'm not sure if there's some other edge case where this could happen, in which case dropping the fallback path of querying GetUserByID would techncially be a breaking change.

Removing the logging (warning level for the "workspace did not have a username set") feels reasonable. I would lean towards keeping the fallback DB query unless we have no requirements around breaking changes for metrics.

I would lean towards keeping the fallback DB query unless we have no requirements around breaking changes for metrics.

My point is, if we mess up the view in such a way that OwnerUsername is empty, it's very likely that more things would be broken than metrics. Having a non-empty username is a pretty fundamental assumption baked into the codebase right now.

Right, but it could be possible that OwnerUsername is empty for only a very small subset of workspaces.

Is your suggestion that we error updating the metric as a whole (for all found workspace agents) if we see any empty username, or just for those workspaces which have an empty username (and do not call the GetUserByID query as a fallback)?

I see two possible scenarios here:

Some users end up actually having empty usernames (which is theoretically possible based on the database schema). In this case, there would be no benefit from calling GetUserByID.

Some error updating the view results in an empty string being returned for owner_username. As far as I can tell, this would break things like application routing, SSH access, and also be distinctly visible in the UI. In this case, we have three options:
a) Continue submitting metrics with an empty username field,
b) Fall back to the GetUserByID query,
c) Skip over the user/agent.
d) Refuse to generate potentially invalid metrics at all, error and exit early.

a) could actually signal the issue more quickly, but at the cost of messed up metrics.
b) would likely correct the issue, but we would suddenly be performing more database queries unexpectedly.
c) would also surface the error in a way
d) might actually be overkill now that I think about it.

Out of curiousity, I decided to see what would happen if a migration messed up the view and pushed #19426. The main takeaway I got from that is that a number of existing tests failed, but -- most notably -- coderd/prometheusmetrics and coderd/workspacestats didn't fail. If this is an area of concern, I'd suggest modifying the existing tests to guard for this so that we catch it.

Out of curiousity, I decided to see what would happen if a migration messed up the view and pushed #19426. The main takeaway I got from that is that a number of existing tests failed, but -- most notably -- coderd/prometheusmetrics and coderd/workspacestats didn't fail. If this is an area of concern, I'd suggest modifying the existing tests to guard for this so that we catch it.

IIUC that's because we would still have the GetUserByID call, so the view doesn't matter, we're taking the workspace owner ID and looking up that user.

I'm not sure I understand your point about B would likely correct the issue, but we would suddenly be performing more database queries unexpectedly since we're currently making a DB call for every active workspace.

Unless we want to introduce a potentially breaking change I think B is our only option. Otherwise we can remove the fallback query and go with option C, then workspaces with correct usernames set would still emit agent related metrics.

cache the seen user IDs in each iteration of the Agents metric loop to

eb527fe

reduce calls to GetUserByID Signed-off-by: Callum Styan <callumstyan@gmail.com>

github-actions bot assigned cstyan Aug 18, 2025

johnstcn reviewed Aug 18, 2025

View reviewed changes

cstyan added 3 commits August 18, 2025 17:33

Only cache the username instead of the entire user object

a65de4a

Signed-off-by: Callum Styan <callumstyan@gmail.com>

better optimization for metrics related GetUserByID calls

6376d48

Signed-off-by: Callum Styan <callumstyan@gmail.com>

TYPO not

332acb5

Signed-off-by: Callum Styan <callumstyan@gmail.com>

johnstcn reviewed Aug 19, 2025

View reviewed changes

cstyan changed the title ~~perf: cache the seen user IDs in each iteration of the Agents metric loop~~ perf: don't call GetUserByID unnecessarily for Agents metrics loops Aug 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395

perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395

cstyan commented Aug 18, 2025 •

edited

Loading

Uh oh!

johnstcn left a comment

Uh oh!

cstyan commented Aug 18, 2025

Uh oh!

johnstcn Aug 19, 2025

Uh oh!

cstyan Aug 19, 2025

Uh oh!

johnstcn Aug 19, 2025

Uh oh!

cstyan Aug 19, 2025

Uh oh!

johnstcn Aug 19, 2025

Uh oh!

cstyan Aug 19, 2025

Uh oh!

Uh oh!

perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395

Are you sure you want to change the base?

perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395

Conversation

cstyan commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnstcn left a comment

Choose a reason for hiding this comment

Uh oh!

cstyan commented Aug 18, 2025

Uh oh!

johnstcn Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

johnstcn Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

johnstcn Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

cstyan Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cstyan commented Aug 18, 2025 •

edited

Loading