Skip to content

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata #10685

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 16, 2023

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Nov 14, 2023

A change to agent metadata processing was introduced in https://github.com/coder/coder/releases/tag/v2.3.1 which introduced a logic bug leading to a memory leak. However, it's not entirely clear how this can be triggered. I managed to trigger it twice, but the exact steps elude me. One time coderd was left with a permanent memory consumption of 1.7 GB and the other time, it was OOM killed (ironically, after I had given up reproducing the issue).

To trigger the memory leak that was fixed here, the coderd watch metadata handler has to 1) subscribe to metadata updates via pubsub and 2) somewhere along the way, begin teardown of the handler or exit the consumer running in a goroutine without getting so far as to unsubscribing from the pubsub.

Potential triggers for the bug:

  1. Slow/hung database read
  2. Slow/hung write on the HTTP connection (SSE)
  3. Unsubscribe from pubsub failing -- seems unlikely

This is not guaranteed to fix #10550, but it's a good candidate.

Fixes #10550
Closes https://github.com/coder/customers/issues/334

Comment on lines 1862 to 1877
newKeysSet := make(map[string]struct{})
for _, key := range payload.Keys {
newKeysSet[key] = struct{}{}
}
keys := prev.Keys
prevKeysSet := make(map[string]struct{}, len(prev.Keys))
for _, key := range prev.Keys {
if _, ok := newKeysSet[key]; !ok {
keys = append(keys, key)
prevKeysSet[key] = struct{}{}
}
for _, key := range payload.Keys {
if _, ok := prevKeysSet[key]; !ok {
prev.Keys = append(prev.Keys, key)
}
}
payload.Keys = keys
payload.Keys = prev.Keys
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the actual memory leak (logic bug).

Comment on lines 1899 to 1910
cancel()
<-sseSenderClosed
}()
// Synchronize cancellation from SSE -> context, this lets us simplify the
// cancellation logic.
go func() {
select {
case <-ctx.Done():
case <-sseSenderClosed:
cancel()
}
}()
Copy link
Member Author

@mafredri mafredri Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes were added so that we can be sure we're not blocking on <-sseSenderClosed and can rely on ctx only for shutdown signaling.

// Ignore pings.
if sse.Type == ServerSentEventTypePing {
continue
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug in WatchWorkspaceAgentMetadata, leading to hitting an error below (nil data) and breaking the watcher. This could be a/the cause that the agent metadata tests have been flakey in the past.

@spikecurtis
Copy link
Contributor

FWIW, I think we'd hit this bug on an ordinary DB fetch error, not just a slow or hung fetch.

That's because we had

	sseSendEvent, sseSenderClosed, err := httpapi.ServerSentEventSender(rw, r)
	if err != nil {
		httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
			Message: "Internal error setting up server-sent events.",
			Detail:  err.Error(),
		})
		return
	}
	// Prevent handler from returning until the sender is closed.
	defer func() {
		<-sseSenderClosed
	}()

and once you call this, the only things that allow sseSenderClosed to close is r.Context() expiring or a write error sending an event.

So, when we do

	md, err := api.Database.GetWorkspaceAgentMetadata(ctx, database.GetWorkspaceAgentMetadataParams{
		WorkspaceAgentID: workspaceAgent.ID,
		Keys:             nil,
	})
	if err != nil {
		// If we can't successfully pull the initial metadata, pubsub
		// updates will be no-op so we may as well terminate the
		// connection early.
		httpapi.InternalServerError(rw, err)
		return
	}

httpapi.InternalServerError(rw, err) doesn't actually shut down the SSE, and in fact, probably just corrupts the SSE stream, if anything. At that point it sort of depends on what the client does. If they cleanly hang up then maybe r.Context() gets closed, but if they just ignore the malformed data and keep reading then we just block, since the server will send no more events.

@spikecurtis
Copy link
Contributor

I like what you've done with squirreling away a cancelable context on the Request, but perhaps it makes sense to just accept a context in httpapi.ServerSentEventSender() instead of the Request. That might make it a little more obvious how to cancel.

One other thing I'd like to see on this PR is a test that reproduces the issue with the old code --- e.g. by triggering a database error, and passes with the new code.

@mafredri
Copy link
Member Author

mafredri commented Nov 15, 2023

FWIW, I think we'd hit this bug on an ordinary DB fetch error, not just a slow or hung fetch.

Yeah, I agree. I know that wasn't the cause in my repro though since my logs were missing the line indicating a DB error happened.

Good catch on the httpapi.InternalServerError, we should make sure these invocations fail/are not permitted on SSE connections. Or automatically translated to SSE error payload (but that's too much magic perhaps).

I like what you've done with squirreling away a cancelable context on the Request, but perhaps it makes sense to just accept a context in httpapi.ServerSentEventSender() instead of the Request. That might make it a little more obvious how to cancel.

Big agree, I wasn't super happy with the squirreling, so I was thinking I'll take a look at refactoring the SSE sender in a separate PR. I do think it should at minimum be closeable, not just "hoping" for a close. If we can find an alternative to using contexts that outlive their function call, I think that'd be good too.


I managed to reproduce the leak by issuing the following cURL command (the agent is v2.2.1 which is more efficient at triggering the ballooning, due to old metadata implementation):

curl \
  -H 'Coder-Session-Token: ...' \
  -H 'Accept: text/event-stream' \
  --limit-rate 1 \
  'http://my-coder:7080/api/v2/workspaceagents/025226f8-a637-4ea4-8fea-620ddbf81272/watch-metadata'

And then going AFK so my computer went a sleep, when I came back and woke my computer, I saw how Coder memory usage started ballooning and was able to stop it in its tracks by pressing Ctrl+C on the cURL command. Not sure whether limit-rate helped in reproduction, or if it was mainly due to what happens to the connection when the other client sleeps.

EDIT: Second repro, using --limit-rate 10, no sleep. Surfaced ~11 minutes after launching cURL.

@mafredri
Copy link
Member Author

mafredri commented Nov 15, 2023

@spikecurtis I added a test in ab3cea2 that reproduces the leak. Using the DB error exit path is not the same mechanism that triggers it in the wild, but I didn't manage to reproduce it in any other way (in the test).

@mafredri mafredri marked this pull request as ready for review November 16, 2023 11:32
Comment on lines +1909 to +1910
//nolint:ineffassign // Release memory.
initialMD = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding here is that we only need the initial slice of metadata from the DB once in order to massage it into a map.
Without this, I assume the initial slice would stay around for the duration of the SSE request?

@mafredri mafredri merged commit 198b56c into main Nov 16, 2023
@mafredri mafredri deleted the mafredri/fix-coderd-watch-meta-memory-leak branch November 16, 2023 15:03
@github-actions github-actions bot locked and limited conversation to collaborators Nov 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Potential Memory Leak in 2.3
4 participants