Description
Seen in a test flake: https://github.com/coder/coder/actions/runs/8921350830/job/24501290702?pr=13128
But, the problem is a race condition in our product code.
The agent has a cascade of routines that need to use the RPC connection, and uses a channel called manifestOK
to communicate that the manifest has been successfully retrieved.
Due to an "idiosyncratic" property of our WebsocketNetConn
, a failed socket can sometimes manifest as a canceled context in the error. Our routine management code has special processing to handle context.Canceled
to preserve some routines that need to keep running even if the main context is canceled, and so it swallows that error and doesn't propagate it to the other routines.
This leaves us in a deadlock if the websocket fails during the handleManifest
routine. We won't close the manifestOK
channel, but we also don't propagate an error to the other routines, so they hang forever, deadlocked until something external kills the agent.
To fix, we need to do at least one of the following:
- Fix
WebsocketNetConn
so it won't returncontext.Canceled
unless the input context really was canceled - Fix the
APIConnRoutineManager
so that we don't swallowcontext.Canceled
unless the graceful context was actually canceled. - Fix the communication between the
handleManifest
routine and its dependents, so that when it completes it always sends success or failure signal(s)