Skip to content

streaming job logs race condition means not all logs are returned / TestDelete flake #2603

Closed
@spikecurtis

Description

@spikecurtis

Relates to #2347

HTTP endpoint /api/v2/workspacebuilds/<build>/logs will sometimes end without returning all logs.

This causes our TestDelete cli tests to be flaky because they wait for the final "Cleaning Up" log which sometimes never comes,

e.g.

    ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✔ Queued [-18ms]\r\n⧗  \x1b[;mRunning\x1b[0m \r\n⧗  \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✔ Running [345ms]\r\n⧗  \x1b[;mSetting up\x1b[0m \r\n\x1b[1A\r✔ Setting up [73ms]\r\n⧗  \x1b[;mDestroying workspace\x1b[0m \r\n\x1b[1A\r✔ Destroying workspace [204ms]\r\n"
    ptytest.go:97: 
        	Error Trace:	ptytest.go:97
        	            				delete_test.go:75
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	TestDelete/DifferentUser
    t.go:81: 2022-06-23 04:01:14.391 [DEBUG]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:1078>	(*Server).closeWithError	closing server with error	{"error": null}
    --- FAIL: TestDelete/DifferentUser (18.15s)

The fundamental problem is that this endpoint receives logs from the PubSub stream for the job, and asynchronously checks for the job being marked as complete in the database every 250ms. When the trigger to check for job completion fires, there is no guarantee that all logs have been received or processed, and so we can miss some.

Exacerbating the problem is the fact that the provisionerd calls CompleteJob before the final UpdateJob that logs successful cleanup, due to the later being in a defer. But, even if these calls were made in the correct order, there is still a race between receiving and processing the logs and detecting that the job is done.

Some fix options:

  1. Introduce an "end of record" log that gets inserted into the database and pubsub for each job when it is completed, canceled or failed. This allows us to avoid polling the database for job completion, and so we don't have the associated race between the poll and the log subscription.
  2. Keep track of the most recently processed log ID, and when we detect the job is complete, query the database for any remaining logs after the one we have processed.

Note that both above options depend on fixing the problem where provisionerd sends logs after CompleteJob.

Metadata

Metadata

Assignees

Labels

apiArea: HTTP API

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions