You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But, the problem is a race condition in our product code.
The agent has a cascade of routines that need to use the RPC connection, and uses a channel called manifestOK to communicate that the manifest has been successfully retrieved.
Due to an "idiosyncratic" property of our WebsocketNetConn, a failed socket can sometimes manifest as a canceled context in the error. Our routine management code has special processing to handle context.Canceled to preserve some routines that need to keep running even if the main context is canceled, and so it swallows that error and doesn't propagate it to the other routines.
This leaves us in a deadlock if the websocket fails during the handleManifest routine. We won't close the manifestOK channel, but we also don't propagate an error to the other routines, so they hang forever, deadlocked until something external kills the agent.
To fix, we need to do at least one of the following:
Fix WebsocketNetConn so it won't return context.Canceled unless the input context really was canceled
Fix the APIConnRoutineManager so that we don't swallow context.Canceled unless the graceful context was actually canceled.
Fix the communication between the handleManifest routine and its dependents, so that when it completes it always sends success or failure signal(s)
The text was updated successfully, but these errors were encountered:
Fixes#13139
Using a bare channel to signal dependent goroutines means that we can only signal success, not failure, which leads to deadlock if we fail in a way that doesn't cause the whole `apiConnRoutineManager` to tear down routines.
Instead, we use a new object called a `checkpoint` that signals success or failure, so that dependent routines get unblocked if the routine they depend on fails.
One cause of #13139 is a peculiar failure mode of `WebsocketNetConn` which causes it to return `context.Canceled` in some circumstances when the underlying websocket fails. We have special processing for that error in the `agent.run()` routine, which is erroneously being triggered.
Since we don't actually need the returned context from `WebsocketNetConn`, we can simplify and just use the netConn from the `websocket` library directly.
Seen in a test flake: https://github.com/coder/coder/actions/runs/8921350830/job/24501290702?pr=13128
But, the problem is a race condition in our product code.
The agent has a cascade of routines that need to use the RPC connection, and uses a channel called
manifestOK
to communicate that the manifest has been successfully retrieved.Due to an "idiosyncratic" property of our
WebsocketNetConn
, a failed socket can sometimes manifest as a canceled context in the error. Our routine management code has special processing to handlecontext.Canceled
to preserve some routines that need to keep running even if the main context is canceled, and so it swallows that error and doesn't propagate it to the other routines.This leaves us in a deadlock if the websocket fails during the
handleManifest
routine. We won't close themanifestOK
channel, but we also don't propagate an error to the other routines, so they hang forever, deadlocked until something external kills the agent.To fix, we need to do at least one of the following:
WebsocketNetConn
so it won't returncontext.Canceled
unless the input context really was canceledAPIConnRoutineManager
so that we don't swallowcontext.Canceled
unless the graceful context was actually canceled.handleManifest
routine and its dependents, so that when it completes it always sends success or failure signal(s)The text was updated successfully, but these errors were encountered: