Skip to content

fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 19, 2023

Conversation

mafredri
Copy link
Member

Because of lax pubsub propagation (e.g. agents channels aren't signaled that they're outdated), we were hitting the 1 minute timeout to detect that the workspace build was outdated leading to test flakes.

This changes improves outdated build detection by subscribing to workspace updates as well.

Fixes #9721

@mafredri mafredri self-assigned this Sep 18, 2023
@mafredri mafredri changed the title fix(coderd): Subscribe to workspace when streaming agent logs to detect outdated build fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build Sep 18, 2023
@@ -703,7 +703,7 @@ func (c *Client) WorkspaceAgentLogsAfter(ctx context.Context, agentID uuid.UUID,
}
return nil, nil, ReadBodyAsError(res)
}
logChunks := make(chan []WorkspaceAgentLog)
logChunks := make(chan []WorkspaceAgentLog, 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Bonus change to speed up detection of closed connection.

@spikecurtis
Copy link
Contributor

I'm afraid I don't understand what you mean by "lax pubsub propagation," or what was wrong with the previous test that caused the failure. While I agree it's an improvement to directly receive notification of workspace changes, I'm worried there is still a bug in log notifications.

@mafredri
Copy link
Member Author

mafredri commented Sep 19, 2023

I'm afraid I don't understand what you mean by "lax pubsub propagation,"

I was referring to how we separate different pars of a workspace into separate pubsub channels (in this case workspace and agent log updates), but you’re required to have the domain knowledge of how these channels are connected vs it being handled in places like workspace stop. The domain knowledge here being that there will be no more logs when the workspace is stopped. And what could instead happen is that the workspace stop also sends an eof on the agent log channels, hence lax.

or what was wrong with the previous test that caused the failure.

The previous test has that second log push which, I’m guessing (since it’s impossible to confirm from the logs), raced with the newly started workspace build (since we don’t wait for build completion). The log push was intended to trigger the coderd loop to recheck if the agent is outdated which was the racy behavior.

While I agree it's an improvement to directly receive notification of workspace changes, I'm worried there is still a bug in log notifications.

Are you thinking about this particular case that lead to the test flake or some other buggy behavior with the log notifications?

@spikecurtis
Copy link
Contributor

The previous test has that second log push which, I’m guessing (since it’s impossible to confirm from the logs), raced with the newly started workspace build (since we don’t wait for build completion).

I don't think this explanation is consistent with the logs from the failed test, which show that the build completed a couple hundred ms before the second log is PATCHed.

    t.go:85: 2023-09-18 07:15:32.500 [debu]  provisionerd.runner: sending CompletedJob  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4  template_name=stupefied-turing1-8pH  template_version=agitated_gould2  workspace_build_id=e000bb28-88be-4438-b78d-18ade7ede426  workspace_id=effb2a01-0e53-4ae2-95e9-5c85d72f9391  workspace_name=competent-wing4-wKu  workspace_owner=testuser  workspace_transition=start
    t.go:85: 2023-09-18 07:15:32.500 [info]  echo: recv done on Session  session_id=37bbc0bd-ce7c-481f-93dc-2fbe25f94c3f  error=EOF
    t.go:85: 2023-09-18 07:15:32.500 [debu]  inmem-provisionerd-serene_chatelet8: stage CompleteJob starting  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4
    t.go:85: 2023-09-18 07:15:32.501 [debu]  inmem-provisionerd-serene_chatelet8: triggering workspace notification after agent timeout  workspace_build_id=e000bb28-88be-4438-b78d-18ade7ede426  timeout=0s
    t.go:85: 2023-09-18 07:15:32.502 [debu]  inmem-provisionerd-serene_chatelet8: stage CompleteJob done  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4
    t.go:85: 2023-09-18 07:15:32.502 [debu]  provisionerd.runner: sent CompletedJob  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4  template_name=stupefied-turing1-8pH  template_version=agitated_gould2  workspace_build_id=e000bb28-88be-4438-b78d-18ade7ede426  workspace_id=effb2a01-0e53-4ae2-95e9-5c85d72f9391  workspace_name=competent-wing4-wKu  workspace_owner=testuser  workspace_transition=start
    t.go:85: 2023-09-18 07:15:32.502 [debu]  provisionerd: start of acquireAndRunOne
    t.go:85: 2023-09-18 07:15:32.502 [debu]  acquirer: acquiring job  worker_id=c11bf591-0daa-45f5-943a-daae2c371e99  provisioner_types="[echo terraform]"  tags={"scope":"organization"}
    t.go:85: 2023-09-18 07:15:32.502 [debu]  acquirer: got clearance to call database  worker_id=c11bf591-0daa-45f5-943a-daae2c371e99  provisioner_types="[echo terraform]"  tags={"scope":"organization"}
    t.go:85: 2023-09-18 07:15:32.502 [debu]  acquirer: no job available  worker_id=c11bf591-0daa-45f5-943a-daae2c371e99  provisioner_types="[echo terraform]"  tags={"scope":"organization"}
    t.go:85: 2023-09-18 07:15:32.505 [debu]  echo: cleaned up work directory  session_id=37bbc0bd-ce7c-481f-93dc-2fbe25f94c3f
    t.go:85: 2023-09-18 07:15:32.765 [debu]  PATCH  host=localhost:55352  path=/api/v2/workspaceagents/me/logs  proto=HTTP/1.1  remote_addr=127.0.0.1  took=0s  status_code=200  latency_ms=0  request_id=78d90611-9ebf-43eb-8eb2-8c14ebb8b7af
    workspaceagents_test.go:419: 
        	Error Trace:	C:/a/coder/coder/coderd/workspaceagents_test.go:419
        	Error:      	context done while waiting for logs close
        	Test:       	TestWorkspaceAgentStartupLogs/Close_logs_on_outdated_build
    t.go:85: 2023-09-18 07:15:51.396 [info]  provisionerd: attempting graceful shutdown
    t.go:85: 2023-09-18 07:15:51.396 [debu]  provisionerd: sending acquire job cancel

That's why I'm still suspicious about the logic that notifies about new logs and triggers updates on the channel

@mafredri
Copy link
Member Author

@spikecurtis that’s what I meant with it not being verifiable in the logs. We log HTTP requests when they end, but not when they start.

There’s no telling whether or not it completed this instant or a second ago:

tracing.RunWithoutSpan(r.Context(), func(ctx context.Context) {

Perhaps we should include start time as a kv as well.

@spikecurtis
Copy link
Contributor

There’s no telling whether or not it completed this instant or a second ago

We compute the latency, which, along with the end time should indicate the start time.

slog.F("latency_ms", float64(time.Since(start)/time.Millisecond)),

@mafredri
Copy link
Member Author

mafredri commented Sep 19, 2023

We compute the latency, which, along with the end time should indicate the start time.

slog.F("latency_ms", float64(time.Since(start)/time.Millisecond)),

We don’t log the end time either, and we don’t know when that log itself will be scheduled to run.

@mafredri mafredri merged commit 530dd9d into main Sep 19, 2023
@mafredri mafredri deleted the mafredri/fix-workspace-agent-log-close-on-new-build branch September 19, 2023 17:02
@github-actions github-actions bot locked and limited conversation to collaborators Sep 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test flake: TestWorkspaceAgentStartupLogs/Close_logs_on_outdated_build
2 participants