-
Notifications
You must be signed in to change notification settings - Fork 968
perf: don't call GetUserByID unnecessarily for Agents metrics loops #19395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
reduce calls to GetUserByID Signed-off-by: Callum Styan <callumstyan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing as the only referenced field of user
is Username
, why not use workspace.OwnerUsername
instead?
Good point, I'll make that change so that we're only caching the username in the map 👍 |
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
if username == "" { | ||
logger.Warn(ctx, "in prometheusmetrics.Agents the username on workspace was empty string, this should not be possible", | ||
slog.F("workspace_id", workspace.ID), | ||
slog.F("template_id", workspace.TemplateID)) | ||
// Fallback to GetUserByID if OwnerUsername is empty (e.g., in tests) | ||
user, err := db.GetUserByID(ctx, workspace.OwnerID) | ||
if err != nil { | ||
logger.Error(ctx, "can't get user from the database", slog.F("user_id", workspace.OwnerID), slog.Error(err)) | ||
agentsGauge.WithLabelValues(VectorOperationAdd, 0, user.Username, workspace.Name, templateName, templateVersionName) | ||
continue | ||
} | ||
username = user.Username |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this situation would only be possible if we completely messed up the view. In this case, many other things would be also broken. However, in this case we not only potentially do a whole bunch of user lookups, we also spam a bunch of log errors.
IMO this should just error loudly so it's very obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this situation would only be possible if we completely messed up the view.
Yes it seems that way, though I'm not sure if there's some other edge case where this could happen, in which case dropping the fallback path of querying GetUserByID
would techncially be a breaking change.
Removing the logging (warning level for the "workspace did not have a username set") feels reasonable. I would lean towards keeping the fallback DB query unless we have no requirements around breaking changes for metrics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would lean towards keeping the fallback DB query unless we have no requirements around breaking changes for metrics.
My point is, if we mess up the view in such a way that OwnerUsername
is empty, it's very likely that more things would be broken than metrics. Having a non-empty username is a pretty fundamental assumption baked into the codebase right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but it could be possible that OwnerUsername
is empty for only a very small subset of workspaces.
Is your suggestion that we error updating the metric as a whole (for all found workspace agents) if we see any empty username, or just for those workspaces which have an empty username (and do not call the GetUserByID query as a fallback)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two possible scenarios here:
-
Some users end up actually having empty usernames (which is theoretically possible based on the database schema). In this case, there would be no benefit from calling
GetUserByID
. -
Some error updating the view results in an empty string being returned for
owner_username
. As far as I can tell, this would break things like application routing, SSH access, and also be distinctly visible in the UI. In this case, we have three options:
a) Continue submitting metrics with an emptyusername
field,
b) Fall back to theGetUserByID
query,
c) Skip over the user/agent.
d) Refuse to generate potentially invalid metrics at all, error and exit early.
a) could actually signal the issue more quickly, but at the cost of messed up metrics.
b) would likely correct the issue, but we would suddenly be performing more database queries unexpectedly.
c) would also surface the error in a way
d) might actually be overkill now that I think about it.
Out of curiousity, I decided to see what would happen if a migration messed up the view and pushed #19426. The main takeaway I got from that is that a number of existing tests failed, but -- most notably -- coderd/prometheusmetrics
and coderd/workspacestats
didn't fail. If this is an area of concern, I'd suggest modifying the existing tests to guard for this so that we catch it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiousity, I decided to see what would happen if a migration messed up the view and pushed #19426. The main takeaway I got from that is that a number of existing tests failed, but -- most notably -- coderd/prometheusmetrics and coderd/workspacestats didn't fail. If this is an area of concern, I'd suggest modifying the existing tests to guard for this so that we catch it.
IIUC that's because we would still have the GetUserByID
call, so the view doesn't matter, we're taking the workspace owner ID and looking up that user.
I'm not sure I understand your point about B would likely correct the issue, but we would suddenly be performing more database queries unexpectedly
since we're currently making a DB call for every active workspace.
Unless we want to introduce a potentially breaking change I think B is our only option. Otherwise we can remove the fallback query and go with option C, then workspaces with correct usernames set would still emit agent related metrics.
At the moment, the loop which retrieves and updates the values of the agents metrics excessively calls
GetUserByID
(a DB query). First it retrieves a list of all workspaces, filtering out inactive agents (not entirely clear to me whether this is non-running workspaces, or just dead agents), and then iterates over those workspaces to get the rest of the relevant data for the metrics. The next call isGetUserByID
forworkspace.OwnerID
This should at least partially resolve coder/internal#726
by caching seen Useruuid.UUID
in a map for each iteration of the loop.UPDATE: It looks like, in theory, the calls here for
GetUserByID
should not even be necessary as we already have adatabase.Workspace
object which also already has the owner ID and Username.I left comments in both spots as to why the username should never be empty on the workspace again, but I'll reiterate here:
owner_id
field on theworkspaces
table is a FK reference to IDs in theusers
table and has aNOT NULL
constraint, so the owner fields of a workspace will always be populatedNOT NULL
(meaning empty string is valid), at user creation time our httpmw package enforces non-empty usernames (for example it callscodersdk.NameValid
which enforces that the name is at least 1 character and fits theUsernameValidRegex
)workspaces_expanded
view has an inner join onworkspaces.owner_id = visible_users.id
, and if the owner is valid in theusers
table (whichvisible_users
is a view of) then they will have a username set