Skip to content

Agent RPC socket failure during handleManifest can lock up the agent #13139

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spikecurtis opened this issue May 3, 2024 · 0 comments · Fixed by #13141
Closed

Agent RPC socket failure during handleManifest can lock up the agent #13139

spikecurtis opened this issue May 3, 2024 · 0 comments · Fixed by #13141
Assignees
Labels
s4 Internal bugs (e.g. test flakes), extreme edge cases, and bug risks

Comments

@spikecurtis
Copy link
Contributor

Seen in a test flake: https://github.com/coder/coder/actions/runs/8921350830/job/24501290702?pr=13128

But, the problem is a race condition in our product code.

The agent has a cascade of routines that need to use the RPC connection, and uses a channel called manifestOK to communicate that the manifest has been successfully retrieved.

Due to an "idiosyncratic" property of our WebsocketNetConn, a failed socket can sometimes manifest as a canceled context in the error. Our routine management code has special processing to handle context.Canceled to preserve some routines that need to keep running even if the main context is canceled, and so it swallows that error and doesn't propagate it to the other routines.

This leaves us in a deadlock if the websocket fails during the handleManifest routine. We won't close the manifestOK channel, but we also don't propagate an error to the other routines, so they hang forever, deadlocked until something external kills the agent.

To fix, we need to do at least one of the following:

  1. Fix WebsocketNetConn so it won't return context.Canceled unless the input context really was canceled
  2. Fix the APIConnRoutineManager so that we don't swallow context.Canceled unless the graceful context was actually canceled.
  3. Fix the communication between the handleManifest routine and its dependents, so that when it completes it always sends success or failure signal(s)
@spikecurtis spikecurtis self-assigned this May 3, 2024
@coder-labeler coder-labeler bot added the bug label May 3, 2024
@spikecurtis spikecurtis added the s4 Internal bugs (e.g. test flakes), extreme edge cases, and bug risks label May 3, 2024
spikecurtis added a commit that referenced this issue May 6, 2024
Fixes #13139

Using a bare channel to signal dependent goroutines means that we can only signal success, not failure, which leads to deadlock if we fail in a way that doesn't cause the whole `apiConnRoutineManager` to tear down routines.

Instead, we use a new object called a `checkpoint` that signals success or failure, so that dependent routines get unblocked if the routine they depend on fails.
spikecurtis added a commit that referenced this issue May 6, 2024
One cause of #13139 is a peculiar failure mode of `WebsocketNetConn` which causes it to return `context.Canceled` in some circumstances when the underlying websocket fails.  We have special processing for that error in the `agent.run()` routine, which is erroneously being triggered.

Since we don't actually need the returned context from `WebsocketNetConn`, we can simplify and just use the netConn from the `websocket` library directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
s4 Internal bugs (e.g. test flakes), extreme edge cases, and bug risks
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant