Agent RPC socket failure during handleManifest can lock up the agent

Seen in a test flake: https://github.com/coder/coder/actions/runs/8921350830/job/24501290702?pr=13128

But, the problem is a race condition in our product code.

The agent has a cascade of routines that need to use the RPC connection, and uses a channel called `manifestOK` to communicate that the manifest has been successfully retrieved.

Due to an "idiosyncratic" property of our `WebsocketNetConn`, a failed socket can sometimes manifest as a canceled context in the error.  Our routine management code has special processing to handle `context.Canceled` to preserve some routines that need to keep running even if the main context is canceled, and so it swallows that error and doesn't propagate it to the other routines.

This leaves us in a deadlock if the websocket fails _during_ the `handleManifest` routine.  We won't close the `manifestOK` channel, but we also don't propagate an error to the other routines, so they hang forever, deadlocked until something external kills the agent.

To fix, we need to do at least one of the following:

1) Fix `WebsocketNetConn` so it won't return `context.Canceled` unless the input context really was canceled
2) Fix the `APIConnRoutineManager` so that we don't swallow `context.Canceled` unless the graceful context was actually canceled.
3) Fix the communication between the `handleManifest` routine and its dependents, so that when it completes it always sends success or failure signal(s)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Agent RPC socket failure during handleManifest can lock up the agent #13139

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Agent RPC socket failure during handleManifest can lock up the agent #13139

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions