streaming job logs race condition means not all logs are returned / TestDelete flake

Relates to #2347 

HTTP endpoint `/api/v2/workspacebuilds/<build>/logs` will sometimes end without returning all logs.

This causes our `TestDelete` cli tests to be flaky because they wait for the final "Cleaning Up" log which sometimes never comes,

e.g.

```
    ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✔ Queued [-18ms]\r\n⧗  \x1b[;mRunning\x1b[0m \r\n⧗  \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✔ Running [345ms]\r\n⧗  \x1b[;mSetting up\x1b[0m \r\n\x1b[1A\r✔ Setting up [73ms]\r\n⧗  \x1b[;mDestroying workspace\x1b[0m \r\n\x1b[1A\r✔ Destroying workspace [204ms]\r\n"
    ptytest.go:97: 
        	Error Trace:	ptytest.go:97
        	            				delete_test.go:75
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	TestDelete/DifferentUser
    t.go:81: 2022-06-23 04:01:14.391 [DEBUG]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:1078>	(*Server).closeWithError	closing server with error	{"error": null}
    --- FAIL: TestDelete/DifferentUser (18.15s)
```

The fundamental problem is that this endpoint receives logs from the PubSub stream for the job, and asynchronously checks for the job being marked as complete in the database every 250ms.  When the trigger to check for job completion fires, there is no guarantee that all logs have been received or processed, and so we can miss some.

Exacerbating the problem is the fact that the provisionerd calls `CompleteJob` _before_ the final `UpdateJob` that logs successful cleanup, due to the later being in a `defer`.  But, even if these calls were made in the correct order, there is still a race between receiving and processing the logs and detecting that the job is done.

Some fix options:

1. Introduce an "end of record" log that gets inserted into the database and pubsub for each job when it is completed, canceled or failed.  This allows us to avoid polling the database for job completion, and so we don't have the associated race between the poll and the log subscription.
2. Keep track of the most recently processed log ID, and when we detect the job is complete, query the database for any remaining logs after the one we have processed.

Note that both above options depend on fixing the problem where provisionerd sends logs after `CompleteJob`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

streaming job logs race condition means not all logs are returned / TestDelete flake #2603

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

streaming job logs race condition means not all logs are returned / TestDelete flake #2603

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions