-
Notifications
You must be signed in to change notification settings - Fork 896
fix(coderd): fix memory leak in watchWorkspaceAgentMetadata
#10685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
coderd/workspaceagents.go
Outdated
newKeysSet := make(map[string]struct{}) | ||
for _, key := range payload.Keys { | ||
newKeysSet[key] = struct{}{} | ||
} | ||
keys := prev.Keys | ||
prevKeysSet := make(map[string]struct{}, len(prev.Keys)) | ||
for _, key := range prev.Keys { | ||
if _, ok := newKeysSet[key]; !ok { | ||
keys = append(keys, key) | ||
prevKeysSet[key] = struct{}{} | ||
} | ||
for _, key := range payload.Keys { | ||
if _, ok := prevKeysSet[key]; !ok { | ||
prev.Keys = append(prev.Keys, key) | ||
} | ||
} | ||
payload.Keys = keys | ||
payload.Keys = prev.Keys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the actual memory leak (logic bug).
coderd/workspaceagents.go
Outdated
cancel() | ||
<-sseSenderClosed | ||
}() | ||
// Synchronize cancellation from SSE -> context, this lets us simplify the | ||
// cancellation logic. | ||
go func() { | ||
select { | ||
case <-ctx.Done(): | ||
case <-sseSenderClosed: | ||
cancel() | ||
} | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes were added so that we can be sure we're not blocking on <-sseSenderClosed
and can rely on ctx
only for shutdown signaling.
// Ignore pings. | ||
if sse.Type == ServerSentEventTypePing { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug in WatchWorkspaceAgentMetadata
, leading to hitting an error below (nil data) and breaking the watcher. This could be a/the cause that the agent metadata tests have been flakey in the past.
FWIW, I think we'd hit this bug on an ordinary DB fetch error, not just a slow or hung fetch. That's because we had
and once you call this, the only things that allow So, when we do
|
I like what you've done with squirreling away a cancelable context on the Request, but perhaps it makes sense to just accept a context in One other thing I'd like to see on this PR is a test that reproduces the issue with the old code --- e.g. by triggering a database error, and passes with the new code. |
Yeah, I agree. I know that wasn't the cause in my repro though since my logs were missing the line indicating a DB error happened. Good catch on the
Big agree, I wasn't super happy with the squirreling, so I was thinking I'll take a look at refactoring the SSE sender in a separate PR. I do think it should at minimum be closeable, not just "hoping" for a close. If we can find an alternative to using contexts that outlive their function call, I think that'd be good too. I managed to reproduce the leak by issuing the following cURL command (the agent is v2.2.1 which is more efficient at triggering the ballooning, due to old metadata implementation): curl \
-H 'Coder-Session-Token: ...' \
-H 'Accept: text/event-stream' \
--limit-rate 1 \
'http://my-coder:7080/api/v2/workspaceagents/025226f8-a637-4ea4-8fea-620ddbf81272/watch-metadata' And then going AFK so my computer went a sleep, when I came back and woke my computer, I saw how Coder memory usage started ballooning and was able to stop it in its tracks by pressing Ctrl+C on the cURL command. Not sure whether limit-rate helped in reproduction, or if it was mainly due to what happens to the connection when the other client sleeps. EDIT: Second repro, using |
@spikecurtis I added a test in ab3cea2 that reproduces the leak. Using the DB error exit path is not the same mechanism that triggers it in the wild, but I didn't manage to reproduce it in any other way (in the test). |
//nolint:ineffassign // Release memory. | ||
initialMD = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding here is that we only need the initial slice of metadata from the DB once in order to massage it into a map.
Without this, I assume the initial slice would stay around for the duration of the SSE request?
A change to agent metadata processing was introduced in https://github.com/coder/coder/releases/tag/v2.3.1 which introduced a logic bug leading to a memory leak. However, it's not entirely clear how this can be triggered. I managed to trigger it twice, but the exact steps elude me. One time
coderd
was left with a permanent memory consumption of 1.7 GB and the other time, it was OOM killed (ironically, after I had given up reproducing the issue).To trigger the memory leak that was fixed here, the
coderd
watch metadata handler has to 1) subscribe to metadata updates via pubsub and 2) somewhere along the way, begin teardown of the handler or exit the consumer running in a goroutine without getting so far as to unsubscribing from the pubsub.Potential triggers for the bug:
This is not guaranteed to fix #10550, but it's a good candidate.Fixes #10550
Closes https://github.com/coder/customers/issues/334