database race on create build / acquire job #2436

spikecurtis · 2022-06-16T20:36:52Z

Summary: Our current postgres code that handles creating builds and sending them out to provisionerd is subtly bugged from an ACID standpoint, and results in provisionerd failing to acquire a job.

Details:

I'm hitting this error in the TestDelete CLI test.

    t.go:81: 2022-06-16 18:41:25.096 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	POST /api/v2/workspaces/7138ac7f-4cbc-49d0-a772-aa2420a0e91d/builds
    t.go:81: 2022-06-16 18:41:25.097 [DEBUG]	(provisionerd-lucid_ptolemy4)	<github.com/coder/coder/coderd/provisionerdaemons.go:154>	(*provisionerdServer).AcquireJob	locked job from database	{"id": "da766fb0-6e98-4a92-892b-a0565570d3cd"}
    t.go:81: 2022-06-16 18:41:25.097 [WARN]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:232>	(*Server).acquireJob	acquire job ...
        "error": request job was invalidated: get workspace build: sql: no rows in result set
                 	storj.io/drpc/drpcwire.UnmarshalError:26
                 	storj.io/drpc/drpcstream.(*Stream).HandlePacket:198
                 	storj.io/drpc/drpcmanager.(*Manager).manageReader:211
    t.go:81: 2022-06-16 18:41:25.097 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
    t.go:81: 2022-06-16 18:41:25.104 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4/logs
✘ Queued [0ms]:53: 
    ptytest.go:53: ⧗  Running 
    ptytest.go:53: ⧗  Running 
    t.go:81: 2022-06-16 18:41:25.357 [DEBUG]	<github.com/coder/coder/coderd/coderd.go:382>	debugLogRequest.func1.1	GET /api/v2/workspacebuilds/f3e2c1b9-4948-4fbc-8d94-e399ba09ded4
    delete_test.go:71: 
        	Error Trace:	delete_test.go:71
        	            				asm_arm64.s:1263
        	Error:      	Target error should be in err chain:
        	            	expected: "EOF"
        	            	in chain: "get workspace build: sql: no rows in result set"
        	Test:       	TestDelete/DifferentUser
✘ Running [0ms]53: 
    ptytest.go:92: match exceeded deadline: wanted "Cleaning Up"; got "\x1b[1A\r✘ Queued [0ms]\r\n⧗  \x1b[;mRunning\x1b[0m \r\n⧗  \x1b[;mRunning\x1b[0m \r\n\x1b[1A\r✘ Running [0ms]\r\n"
    ptytest.go:97: 
        	Error Trace:	ptytest.go:97
        	            				delete_test.go:75
        	Error:      	Received unexpected error:
        	            	EOF
        	Test:       	TestDelete/DifferentUser
    t.go:81: 2022-06-16 18:41:35.098 [DEBUG]	(provisionerd)	<github.com/coder/coder/provisionerd/provisionerd.go:1073>	(*Server).closeWithError	closing server with error	{"error": null}
    --- FAIL: TestDelete/DifferentUser (10.27s)

Coderd tries to acquire a job and fails with request job was invalidated: get workspace build: sql: no rows in result set that is to say, that it selected and locked a job in the database, then tried to look up the corresponding workspace build, and the build wasn't there.

How can this be?

When we create the build, we insert the job and build in a single transaction, so you wouldn't expect some other query to be able to find the job but then not the build.

After reading https://www.postgresql.org/docs/current/transaction-iso.html very carefully, I think this has to do with our use of SELECT FOR UPDATE in the locking query.

it is possible for an updating command to see an inconsistent snapshot: it can see the effects of concurrent updating commands on the same rows it is trying to update, but it does not see effects of those commands on other rows in the database. This behavior makes Read Committed mode unsuitable for commands that involve complex search conditions

It seems that the locking query, run in Read Committed mode (default), is able to see the effect of the transaction adding the job prior to that transaction being committed. We may need to upgrade one or both of these transactions to a higher level of isolation.

The text was updated successfully, but these errors were encountered:

spikecurtis · 2022-06-16T23:29:48Z

Actually, I might be full of it.

I've been chasing postgres stuff all day and realized that this error was hit with our fake database. So, we might be fine in production, but tests are flaky anyway.

Still the language on that postgres documentation page has me worried about our locking query.

spikecurtis · 2022-06-17T21:44:16Z

Ok, this is fixed in the fake database, but I'll post some notes on postgres for posterity. I think we're ok with our current use, despite the dire warning I quoted in the description.

For the read committed isolation level, which is default:

UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands behave the same as SELECT in terms of searching for target rows: they will only find target rows that were committed as of the command start time.

Good so far. The transaction that creates the job must have committed in order for us to acquire the job, and thus we won't hit the bug this issue is about.

However, there is a slightly related worry with respect to our use of locks to prevent multiple provisionerds from acquiring the same job.

However, such a target row might have already been updated (or deleted or locked) by another concurrent transaction by the time it is found. In this case, the would-be updater will wait for the first updating transaction to commit or roll back (if it is still in progress). If the first updater rolls back, then its effects are negated and the second updater can proceed with updating the originally found row. If the first updater commits, the second updater will ignore the row if the first updater deleted it, otherwise it will attempt to apply its operation to the updated version of the row.

Yikes! We don't want to acquire the job if some other transaction updated it (e.g. another provisionerd).

The search condition of the command (the WHERE clause) is re-evaluated to see if the updated version of the row still matches the search condition. If so, the second updater proceeds with its operation using the updated version of the row. In the case of SELECT FOR UPDATE and SELECT FOR SHARE, this means it is the updated version of the row that is locked and returned to the client.

And, we're back to being ok, since the where clause ensures we only grab jobs that haven't already been grabbed.

spikecurtis added bug api Area: HTTP API labels Jun 16, 2022

This was referenced Jun 17, 2022

fake database locks during transactions #2473

Closed

Lock the fake database during transactions #2478

Merged

spikecurtis closed this as completed in #2478 Jun 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

database race on create build / acquire job #2436

database race on create build / acquire job #2436

spikecurtis commented Jun 16, 2022

spikecurtis commented Jun 16, 2022

spikecurtis commented Jun 17, 2022

database race on create build / acquire job #2436

database race on create build / acquire job #2436

Comments

spikecurtis commented Jun 16, 2022

spikecurtis commented Jun 16, 2022

spikecurtis commented Jun 17, 2022