Skip to content

fix: ensure wsproxy MultiAgent is closed when websocket dies #11414

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

coadler
Copy link
Contributor

@coadler coadler commented Jan 4, 2024

The SingleTailnet behavior only checked to see if the MultiAgent was
closed, but the websocket error was not being propogated into the
MultiAgent, causing it to never be swapped for a new working one.

Fixes #11401

Before:

Coder Workspace Proxy v0.0.0-devel+85ff030 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001

View the Web UI: http://127.0.0.1:3001

==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:11:56.376 [warn]  net.workspace-proxy.servertailnet: broadcast server node to agents ...
    error= write message:
               github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*remoteMultiAgentHandler).writeJSON
                   /home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:524
             - failed to write msg: WebSocket closed: failed to read frame header: EOF

After:

Coder Workspace Proxy v0.0.0-devel+12f1878 - Your Self-Hosted Remote Development Platform
Started HTTP listener at http://0.0.0.0:3001

View the Web UI: http://127.0.0.1:3001

==> Logs will stream in below (press ctrl+c to gracefully exit):
2024-01-04 20:26:38.545 [warn]  net.workspace-proxy.servertailnet: multiagent closed, reinitializing
2024-01-04 20:26:38.546 [erro]  net.workspace-proxy.servertailnet: reinit multi agent ...
    error= dial coordinate websocket:
               github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
                   /home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
             - failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:38.587 [erro]  net.workspace-proxy.servertailnet: reinit multi agent ...
    error= dial coordinate websocket:
               github.com/coder/coder/v2/enterprise/wsproxy/wsproxysdk.(*Client).DialCoordinator
                   /home/coder/coder/enterprise/wsproxy/wsproxysdk/wsproxysdk.go:454
             - failed to WebSocket dial: failed to send handshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refusedhandshake request: Get "http://127.0.0.1:3000/api/v2/workspaceproxies/me/coordinate": dial tcp 127.0.0.1:3000: connect: connection refused
2024-01-04 20:26:40.446 [info]  net.workspace-proxy.servertailnet: successfully reinitialized multiagent  agents=0  took=1.900892615s

@coadler
Copy link
Contributor Author

coadler commented Jan 4, 2024

In this stack: 🚀 Improve WebSocket reliability and resource management

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @coadler and the rest of your teammates on Graphite Graphite

@coadler coadler requested a review from spikecurtis January 4, 2024 18:51
@coadler coadler force-pushed the colin/fixensurewsproxyMultiAgentisclosedwhenwebsocketdies branch from 50c92b4 to 12f1878 Compare January 4, 2024 20:24
Copy link
Contributor

@spikecurtis spikecurtis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work tracking down the bug!

I don't see any new tests though that will catch and prevent a regression, so please add

<-ctx.Done()
ma.Close()
}()

go func() {
defer cancel()
dec := json.NewDecoder(nc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Re: lines 488 to 488]

I think it's worth dropping an INFO log here.

See this comment inline on Graphite.

@@ -472,6 +472,11 @@ func (c *Client) DialCoordinator(ctx context.Context) (agpl.MultiAgentConn, erro
OnRemove: func(agpl.Queue) { conn.Close(websocket.StatusGoingAway, "closed") },
}).Init()

go func() {
<-ctx.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding this correctly, we're depending on the fact that the reader goroutine below cancels the context on a failed read.

I think we should also tear down the multi-agent on a failed write of subscription messages. It's unlikely that we'd have a failure that leaves the connection half-open (e.g. for reads but not writes), but such things are possible and you don't want the proxy limping on unable to subscribe to new agents.

func HeartbeatClose(ctx context.Context, exit func(), conn *websocket.Conn) {
ticker := time.NewTicker(30 * time.Second)
ticker := time.NewTicker(15 * time.Second)
defer ticker.Stop()

for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Re: lines 43 to 43]

Drop an INFO log here

See this comment inline on Graphite.

The `SingleTailnet` behavior only checked to see if the `MultiAgent` was
closed, but the websocket error was not being propogated into the
`MultiAgent`, causing it to never be swapped for a new working one.
@coadler coadler force-pushed the colin/fixensurewsproxyMultiAgentisclosedwhenwebsocketdies branch from 12f1878 to 72ad811 Compare January 11, 2024 16:35
@coadler coadler merged commit 4a08082 into main Jan 11, 2024
@coadler coadler deleted the colin/fixensurewsproxyMultiAgentisclosedwhenwebsocketdies branch January 11, 2024 17:37
@github-actions github-actions bot locked and limited conversation to collaborators Jan 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

workspace proxy fails to proxy; error "ensure agent: subscribe agent"
2 participants