Skip to content

Agent RPC socket failure during handleManifest can lock up the agent #13139

Closed
@spikecurtis

Description

@spikecurtis

Seen in a test flake: https://github.com/coder/coder/actions/runs/8921350830/job/24501290702?pr=13128

But, the problem is a race condition in our product code.

The agent has a cascade of routines that need to use the RPC connection, and uses a channel called manifestOK to communicate that the manifest has been successfully retrieved.

Due to an "idiosyncratic" property of our WebsocketNetConn, a failed socket can sometimes manifest as a canceled context in the error. Our routine management code has special processing to handle context.Canceled to preserve some routines that need to keep running even if the main context is canceled, and so it swallows that error and doesn't propagate it to the other routines.

This leaves us in a deadlock if the websocket fails during the handleManifest routine. We won't close the manifestOK channel, but we also don't propagate an error to the other routines, so they hang forever, deadlocked until something external kills the agent.

To fix, we need to do at least one of the following:

  1. Fix WebsocketNetConn so it won't return context.Canceled unless the input context really was canceled
  2. Fix the APIConnRoutineManager so that we don't swallow context.Canceled unless the graceful context was actually canceled.
  3. Fix the communication between the handleManifest routine and its dependents, so that when it completes it always sends success or failure signal(s)

Metadata

Metadata

Assignees

Labels

s4Internal bugs (e.g. test flakes), extreme edge cases, and bug risks

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions