fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729

mafredri · 2023-09-18T11:21:25Z

Because of lax pubsub propagation (e.g. agents channels aren't signaled that they're outdated), we were hitting the 1 minute timeout to detect that the workspace build was outdated leading to test flakes.

This changes improves outdated build detection by subscribing to workspace updates as well.

Fixes #9721

…ct outdated build Fixes #9721

mafredri · 2023-09-18T11:23:17Z

codersdk/workspaceagents.go

@@ -703,7 +703,7 @@ func (c *Client) WorkspaceAgentLogsAfter(ctx context.Context, agentID uuid.UUID,
 		}
 		return nil, nil, ReadBodyAsError(res)
 	}
-	logChunks := make(chan []WorkspaceAgentLog)
+	logChunks := make(chan []WorkspaceAgentLog, 1)


Review: Bonus change to speed up detection of closed connection.

coderd/workspaceagents.go

spikecurtis · 2023-09-19T06:54:55Z

I'm afraid I don't understand what you mean by "lax pubsub propagation," or what was wrong with the previous test that caused the failure. While I agree it's an improvement to directly receive notification of workspace changes, I'm worried there is still a bug in log notifications.

mafredri · 2023-09-19T08:40:25Z

I'm afraid I don't understand what you mean by "lax pubsub propagation,"

I was referring to how we separate different pars of a workspace into separate pubsub channels (in this case workspace and agent log updates), but you’re required to have the domain knowledge of how these channels are connected vs it being handled in places like workspace stop. The domain knowledge here being that there will be no more logs when the workspace is stopped. And what could instead happen is that the workspace stop also sends an eof on the agent log channels, hence lax.

or what was wrong with the previous test that caused the failure.

The previous test has that second log push which, I’m guessing (since it’s impossible to confirm from the logs), raced with the newly started workspace build (since we don’t wait for build completion). The log push was intended to trigger the coderd loop to recheck if the agent is outdated which was the racy behavior.

While I agree it's an improvement to directly receive notification of workspace changes, I'm worried there is still a bug in log notifications.

Are you thinking about this particular case that lead to the test flake or some other buggy behavior with the log notifications?

spikecurtis · 2023-09-19T09:01:33Z

The previous test has that second log push which, I’m guessing (since it’s impossible to confirm from the logs), raced with the newly started workspace build (since we don’t wait for build completion).

I don't think this explanation is consistent with the logs from the failed test, which show that the build completed a couple hundred ms before the second log is PATCHed.

    t.go:85: 2023-09-18 07:15:32.500 [debu]  provisionerd.runner: sending CompletedJob  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4  template_name=stupefied-turing1-8pH  template_version=agitated_gould2  workspace_build_id=e000bb28-88be-4438-b78d-18ade7ede426  workspace_id=effb2a01-0e53-4ae2-95e9-5c85d72f9391  workspace_name=competent-wing4-wKu  workspace_owner=testuser  workspace_transition=start
    t.go:85: 2023-09-18 07:15:32.500 [info]  echo: recv done on Session  session_id=37bbc0bd-ce7c-481f-93dc-2fbe25f94c3f  error=EOF
    t.go:85: 2023-09-18 07:15:32.500 [debu]  inmem-provisionerd-serene_chatelet8: stage CompleteJob starting  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4
    t.go:85: 2023-09-18 07:15:32.501 [debu]  inmem-provisionerd-serene_chatelet8: triggering workspace notification after agent timeout  workspace_build_id=e000bb28-88be-4438-b78d-18ade7ede426  timeout=0s
    t.go:85: 2023-09-18 07:15:32.502 [debu]  inmem-provisionerd-serene_chatelet8: stage CompleteJob done  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4
    t.go:85: 2023-09-18 07:15:32.502 [debu]  provisionerd.runner: sent CompletedJob  job_id=34bea8ee-c5a1-4b78-a78c-1cba64ecb3c4  template_name=stupefied-turing1-8pH  template_version=agitated_gould2  workspace_build_id=e000bb28-88be-4438-b78d-18ade7ede426  workspace_id=effb2a01-0e53-4ae2-95e9-5c85d72f9391  workspace_name=competent-wing4-wKu  workspace_owner=testuser  workspace_transition=start
    t.go:85: 2023-09-18 07:15:32.502 [debu]  provisionerd: start of acquireAndRunOne
    t.go:85: 2023-09-18 07:15:32.502 [debu]  acquirer: acquiring job  worker_id=c11bf591-0daa-45f5-943a-daae2c371e99  provisioner_types="[echo terraform]"  tags={"scope":"organization"}
    t.go:85: 2023-09-18 07:15:32.502 [debu]  acquirer: got clearance to call database  worker_id=c11bf591-0daa-45f5-943a-daae2c371e99  provisioner_types="[echo terraform]"  tags={"scope":"organization"}
    t.go:85: 2023-09-18 07:15:32.502 [debu]  acquirer: no job available  worker_id=c11bf591-0daa-45f5-943a-daae2c371e99  provisioner_types="[echo terraform]"  tags={"scope":"organization"}
    t.go:85: 2023-09-18 07:15:32.505 [debu]  echo: cleaned up work directory  session_id=37bbc0bd-ce7c-481f-93dc-2fbe25f94c3f
    t.go:85: 2023-09-18 07:15:32.765 [debu]  PATCH  host=localhost:55352  path=/api/v2/workspaceagents/me/logs  proto=HTTP/1.1  remote_addr=127.0.0.1  took=0s  status_code=200  latency_ms=0  request_id=78d90611-9ebf-43eb-8eb2-8c14ebb8b7af
    workspaceagents_test.go:419: 
        	Error Trace:	C:/a/coder/coder/coderd/workspaceagents_test.go:419
        	Error:      	context done while waiting for logs close
        	Test:       	TestWorkspaceAgentStartupLogs/Close_logs_on_outdated_build
    t.go:85: 2023-09-18 07:15:51.396 [info]  provisionerd: attempting graceful shutdown
    t.go:85: 2023-09-18 07:15:51.396 [debu]  provisionerd: sending acquire job cancel

That's why I'm still suspicious about the logic that notifies about new logs and triggers updates on the channel

mafredri · 2023-09-19T09:19:10Z

@spikecurtis that’s what I meant with it not being verifiable in the logs. We log HTTP requests when they end, but not when they start.

There’s no telling whether or not it completed this instant or a second ago:

coder/coderd/httpmw/logger.go

Line 62 in b358e3d

tracing.RunWithoutSpan(r.Context(), func(ctx context.Context) {

Perhaps we should include start time as a kv as well.

spikecurtis · 2023-09-19T09:35:14Z

There’s no telling whether or not it completed this instant or a second ago

We compute the latency, which, along with the end time should indicate the start time.

coder/coderd/httpmw/logger.go

Line 41 in b358e3d

slog.F("latency_ms", float64(time.Since(start)/time.Millisecond)),

mafredri · 2023-09-19T09:37:13Z

We compute the latency, which, along with the end time should indicate the start time.

coder/coderd/httpmw/logger.go

Line 41 in b358e3d

slog.F("latency_ms", float64(time.Since(start)/time.Millisecond)),

We don’t log the end time either, and we don’t know when that log itself will be scheduled to run.

…w-build

fix(coderd): Subscribe to workspace when streaming agent logs to dete…

96a912c

…ct outdated build Fixes #9721

mafredri self-assigned this Sep 18, 2023

mafredri requested a review from spikecurtis September 18, 2023 11:21

mafredri changed the title ~~fix(coderd): Subscribe to workspace when streaming agent logs to detect outdated build~~ fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build Sep 18, 2023

mafredri commented Sep 18, 2023

View reviewed changes

coderd/workspaceagents.go Outdated Show resolved Hide resolved

Update workspaceagents.go

2de6c3a

add debug logs and sql no rows handling

e0c0a70

mafredri mentioned this pull request Sep 19, 2023

feat(coderd/httpmw): log start timestamp for http requests #9776

Merged

spikecurtis approved these changes Sep 19, 2023

View reviewed changes

Merge branch 'main' into mafredri/fix-workspace-agent-log-close-on-ne…

6e41ada

…w-build

mafredri merged commit 530dd9d into main Sep 19, 2023

mafredri deleted the mafredri/fix-workspace-agent-log-close-on-new-build branch September 19, 2023 17:02

github-actions bot locked and limited conversation to collaborators Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729

fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729

Uh oh!

mafredri commented Sep 18, 2023

Uh oh!

mafredri Sep 18, 2023

Uh oh!

Uh oh!

spikecurtis commented Sep 19, 2023

Uh oh!

mafredri commented Sep 19, 2023 •

edited

Loading

Uh oh!

spikecurtis commented Sep 19, 2023

Uh oh!

mafredri commented Sep 19, 2023

Uh oh!

spikecurtis commented Sep 19, 2023

Uh oh!

mafredri commented Sep 19, 2023 •

edited

Loading

Uh oh!

Uh oh!

fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729

fix(coderd): subscribe to workspace when streaming agent logs to detect outdated build #9729

Uh oh!

Conversation

mafredri commented Sep 18, 2023

Uh oh!

mafredri Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

spikecurtis commented Sep 19, 2023

Uh oh!

mafredri commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spikecurtis commented Sep 19, 2023

Uh oh!

mafredri commented Sep 19, 2023

Uh oh!

spikecurtis commented Sep 19, 2023

Uh oh!

mafredri commented Sep 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mafredri commented Sep 19, 2023 •

edited

Loading

mafredri commented Sep 19, 2023 •

edited

Loading