-
Notifications
You must be signed in to change notification settings - Fork 887
fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729
Conversation
…ct outdated build Fixes #9721
@@ -703,7 +703,7 @@ func (c *Client) WorkspaceAgentLogsAfter(ctx context.Context, agentID uuid.UUID, | |||
} | |||
return nil, nil, ReadBodyAsError(res) | |||
} | |||
logChunks := make(chan []WorkspaceAgentLog) | |||
logChunks := make(chan []WorkspaceAgentLog, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review: Bonus change to speed up detection of closed connection.
I'm afraid I don't understand what you mean by "lax pubsub propagation," or what was wrong with the previous test that caused the failure. While I agree it's an improvement to directly receive notification of workspace changes, I'm worried there is still a bug in log notifications. |
I was referring to how we separate different pars of a workspace into separate pubsub channels (in this case workspace and agent log updates), but you’re required to have the domain knowledge of how these channels are connected vs it being handled in places like workspace stop. The domain knowledge here being that there will be no more logs when the workspace is stopped. And what could instead happen is that the workspace stop also sends an eof on the agent log channels, hence lax.
The previous test has that second log push which, I’m guessing (since it’s impossible to confirm from the logs), raced with the newly started workspace build (since we don’t wait for build completion). The log push was intended to trigger the coderd loop to recheck if the agent is outdated which was the racy behavior.
Are you thinking about this particular case that lead to the test flake or some other buggy behavior with the log notifications? |
I don't think this explanation is consistent with the logs from the failed test, which show that the build completed a couple hundred ms before the second log is PATCHed.
That's why I'm still suspicious about the logic that notifies about new logs and triggers updates on the channel |
@spikecurtis that’s what I meant with it not being verifiable in the logs. We log HTTP requests when they end, but not when they start. There’s no telling whether or not it completed this instant or a second ago: Line 62 in b358e3d
Perhaps we should include start time as a kv as well. |
We compute the latency, which, along with the end time should indicate the start time. Line 41 in b358e3d
|
We don’t log the end time either, and we don’t know when that log itself will be scheduled to run. |
Because of lax pubsub propagation (e.g. agents channels aren't signaled that they're outdated), we were hitting the 1 minute timeout to detect that the workspace build was outdated leading to test flakes.
This changes improves outdated build detection by subscribing to workspace updates as well.
Fixes #9721