Description
Relates to #2347
HTTP endpoint /api/v2/workspacebuilds/<build>/logs
will sometimes end without returning all logs.
This causes our TestDelete
cli tests to be flaky because they wait for the final "Cleaning Up" log which sometimes never comes,
e.g.
ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✔ Queued [-18ms]\r\n⧗ \x1b[;mRunning\x1b[0m \r\n⧗ \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✔ Running [345ms]\r\n⧗ \x1b[;mSetting up\x1b[0m \r\n\x1b[1A\r✔ Setting up [73ms]\r\n⧗ \x1b[;mDestroying workspace\x1b[0m \r\n\x1b[1A\r✔ Destroying workspace [204ms]\r\n"
ptytest.go:97:
Error Trace: ptytest.go:97
delete_test.go:75
Error: Received unexpected error:
EOF
Test: TestDelete/DifferentUser
t.go:81: 2022-06-23 04:01:14.391 [DEBUG] (provisionerd) <github.com/coder/coder/provisionerd/provisionerd.go:1078> (*Server).closeWithError closing server with error {"error": null}
--- FAIL: TestDelete/DifferentUser (18.15s)
The fundamental problem is that this endpoint receives logs from the PubSub stream for the job, and asynchronously checks for the job being marked as complete in the database every 250ms. When the trigger to check for job completion fires, there is no guarantee that all logs have been received or processed, and so we can miss some.
Exacerbating the problem is the fact that the provisionerd calls CompleteJob
before the final UpdateJob
that logs successful cleanup, due to the later being in a defer
. But, even if these calls were made in the correct order, there is still a race between receiving and processing the logs and detecting that the job is done.
Some fix options:
- Introduce an "end of record" log that gets inserted into the database and pubsub for each job when it is completed, canceled or failed. This allows us to avoid polling the database for job completion, and so we don't have the associated race between the poll and the log subscription.
- Keep track of the most recently processed log ID, and when we detect the job is complete, query the database for any remaining logs after the one we have processed.
Note that both above options depend on fixing the problem where provisionerd sends logs after CompleteJob
.