fix(coderd): fix memory leak in `watchWorkspaceAgentMetadata` #10685

mafredri · 2023-11-14T18:25:29Z

A change to agent metadata processing was introduced in https://github.com/coder/coder/releases/tag/v2.3.1 which introduced a logic bug leading to a memory leak. However, it's not entirely clear how this can be triggered. I managed to trigger it twice, but the exact steps elude me. One time coderd was left with a permanent memory consumption of 1.7 GB and the other time, it was OOM killed (ironically, after I had given up reproducing the issue).

To trigger the memory leak that was fixed here, the coderd watch metadata handler has to 1) subscribe to metadata updates via pubsub and 2) somewhere along the way, begin teardown of the handler or exit the consumer running in a goroutine without getting so far as to unsubscribing from the pubsub.

Potential triggers for the bug:

Slow/hung database read
Slow/hung write on the HTTP connection (SSE)
Unsubscribe from pubsub failing -- seems unlikely

~~This is not guaranteed to fix #10550, but it's a good candidate.~~

Fixes #10550
Closes https://github.com/coder/customers/issues/334

Refs: #10550

mafredri · 2023-11-14T18:26:50Z

coderd/workspaceagents.go

-			newKeysSet := make(map[string]struct{})
-			for _, key := range payload.Keys {
-				newKeysSet[key] = struct{}{}
-			}
-			keys := prev.Keys
+			prevKeysSet := make(map[string]struct{}, len(prev.Keys))
 			for _, key := range prev.Keys {
-				if _, ok := newKeysSet[key]; !ok {
-					keys = append(keys, key)
+				prevKeysSet[key] = struct{}{}
+			}
+			for _, key := range payload.Keys {
+				if _, ok := prevKeysSet[key]; !ok {
+					prev.Keys = append(prev.Keys, key)
 				}
 			}
-			payload.Keys = keys
+			payload.Keys = prev.Keys


This is the actual memory leak (logic bug).

mafredri · 2023-11-14T18:30:46Z

coderd/workspaceagents.go

+		cancel()
 		<-sseSenderClosed
 	}()
+	// Synchronize cancellation from SSE -> context, this lets us simplify the
+	// cancellation logic.
+	go func() {
+		select {
+		case <-ctx.Done():
+		case <-sseSenderClosed:
+			cancel()
+		}
+	}()


These changes were added so that we can be sure we're not blocking on <-sseSenderClosed and can rely on ctx only for shutdown signaling.

mafredri · 2023-11-14T18:33:31Z

codersdk/workspaceagents.go

+			// Ignore pings.
+			if sse.Type == ServerSentEventTypePing {
+				continue
+			}


This was a bug in WatchWorkspaceAgentMetadata, leading to hitting an error below (nil data) and breaking the watcher. This could be a/the cause that the agent metadata tests have been flakey in the past.

spikecurtis · 2023-11-15T10:31:55Z

FWIW, I think we'd hit this bug on an ordinary DB fetch error, not just a slow or hung fetch.

That's because we had

	sseSendEvent, sseSenderClosed, err := httpapi.ServerSentEventSender(rw, r)
	if err != nil {
		httpapi.Write(ctx, rw, http.StatusInternalServerError, codersdk.Response{
			Message: "Internal error setting up server-sent events.",
			Detail:  err.Error(),
		})
		return
	}
	// Prevent handler from returning until the sender is closed.
	defer func() {
		<-sseSenderClosed
	}()

and once you call this, the only things that allow sseSenderClosed to close is r.Context() expiring or a write error sending an event.

So, when we do

	md, err := api.Database.GetWorkspaceAgentMetadata(ctx, database.GetWorkspaceAgentMetadataParams{
		WorkspaceAgentID: workspaceAgent.ID,
		Keys:             nil,
	})
	if err != nil {
		// If we can't successfully pull the initial metadata, pubsub
		// updates will be no-op so we may as well terminate the
		// connection early.
		httpapi.InternalServerError(rw, err)
		return
	}

httpapi.InternalServerError(rw, err) doesn't actually shut down the SSE, and in fact, probably just corrupts the SSE stream, if anything. At that point it sort of depends on what the client does. If they cleanly hang up then maybe r.Context() gets closed, but if they just ignore the malformed data and keep reading then we just block, since the server will send no more events.

spikecurtis · 2023-11-15T10:36:22Z

I like what you've done with squirreling away a cancelable context on the Request, but perhaps it makes sense to just accept a context in httpapi.ServerSentEventSender() instead of the Request. That might make it a little more obvious how to cancel.

One other thing I'd like to see on this PR is a test that reproduces the issue with the old code --- e.g. by triggering a database error, and passes with the new code.

mafredri · 2023-11-15T13:06:10Z

FWIW, I think we'd hit this bug on an ordinary DB fetch error, not just a slow or hung fetch.

Yeah, I agree. I know that wasn't the cause in my repro though since my logs were missing the line indicating a DB error happened.

Good catch on the httpapi.InternalServerError, we should make sure these invocations fail/are not permitted on SSE connections. Or automatically translated to SSE error payload (but that's too much magic perhaps).

I like what you've done with squirreling away a cancelable context on the Request, but perhaps it makes sense to just accept a context in httpapi.ServerSentEventSender() instead of the Request. That might make it a little more obvious how to cancel.

Big agree, I wasn't super happy with the squirreling, so I was thinking I'll take a look at refactoring the SSE sender in a separate PR. I do think it should at minimum be closeable, not just "hoping" for a close. If we can find an alternative to using contexts that outlive their function call, I think that'd be good too.

I managed to reproduce the leak by issuing the following cURL command (the agent is v2.2.1 which is more efficient at triggering the ballooning, due to old metadata implementation):

curl \
  -H 'Coder-Session-Token: ...' \
  -H 'Accept: text/event-stream' \
  --limit-rate 1 \
  'http://my-coder:7080/api/v2/workspaceagents/025226f8-a637-4ea4-8fea-620ddbf81272/watch-metadata'

And then going AFK so my computer went a sleep, when I came back and woke my computer, I saw how Coder memory usage started ballooning and was able to stop it in its tracks by pressing Ctrl+C on the cURL command. Not sure whether limit-rate helped in reproduction, or if it was mainly due to what happens to the connection when the other client sleeps.

EDIT: Second repro, using --limit-rate 10, no sleep. Surfaced ~11 minutes after launching cURL.

mafredri · 2023-11-15T19:26:54Z

@spikecurtis I added a test in ab3cea2 that reproduces the leak. Using the DB error exit path is not the same mechanism that triggers it in the wild, but I didn't manage to reproduce it in any other way (in the test).

testutil/go.go

coderd/workspaceagents_test.go

johnstcn · 2023-11-16T11:52:12Z

coderd/workspaceagents.go

+	//nolint:ineffassign // Release memory.
+	initialMD = nil


My understanding here is that we only need the initial slice of metadata from the DB once in order to massage it into a map.
Without this, I assume the initial slice would stay around for the duration of the SSE request?

coderd/workspaceagents.go

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata

e6ad02f

Refs: #10550

github-actions bot assigned mafredri Nov 14, 2023

mafredri commented Nov 14, 2023

View reviewed changes

mafredri added 2 commits November 15, 2023 20:56

fixes

65a9899

add test that can reproduce memory leak

ab3cea2

mafredri mentioned this pull request Nov 15, 2023

Potential Memory Leak in 2.3 #10550

Closed

mafredri marked this pull request as ready for review November 16, 2023 11:32

mafredri requested review from johnstcn, spikecurtis and mtojek November 16, 2023 11:32

johnstcn reviewed Nov 16, 2023

View reviewed changes

remove unused testutil function

1da56e6

johnstcn approved these changes Nov 16, 2023

View reviewed changes

break out slice set merger into appendUnique

5bb2a4b

mtojek approved these changes Nov 16, 2023

View reviewed changes

mafredri merged commit 198b56c into main Nov 16, 2023

mafredri deleted the mafredri/fix-coderd-watch-meta-memory-leak branch November 16, 2023 15:03

github-actions bot locked and limited conversation to collaborators Nov 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(coderd): fix memory leak in `watchWorkspaceAgentMetadata` #10685

fix(coderd): fix memory leak in `watchWorkspaceAgentMetadata` #10685

Uh oh!

mafredri commented Nov 14, 2023 •

edited

Loading

Uh oh!

mafredri Nov 14, 2023

Uh oh!

mafredri Nov 14, 2023 •

edited

Loading

Uh oh!

mafredri Nov 14, 2023

Uh oh!

spikecurtis commented Nov 15, 2023

Uh oh!

spikecurtis commented Nov 15, 2023

Uh oh!

mafredri commented Nov 15, 2023 •

edited

Loading

Uh oh!

mafredri commented Nov 15, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

johnstcn Nov 16, 2023

Uh oh!

Uh oh!

Uh oh!

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata #10685

fix(coderd): fix memory leak in watchWorkspaceAgentMetadata #10685

Uh oh!

Conversation

mafredri commented Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri Nov 14, 2023

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri Nov 14, 2023

Choose a reason for hiding this comment

Uh oh!

spikecurtis commented Nov 15, 2023

Uh oh!

spikecurtis commented Nov 15, 2023

Uh oh!

mafredri commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri commented Nov 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnstcn Nov 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fix(coderd): fix memory leak in `watchWorkspaceAgentMetadata` #10685

fix(coderd): fix memory leak in `watchWorkspaceAgentMetadata` #10685

mafredri commented Nov 14, 2023 •

edited

Loading

mafredri Nov 14, 2023 •

edited

Loading

mafredri commented Nov 15, 2023 •

edited

Loading

mafredri commented Nov 15, 2023 •

edited

Loading