-
Notifications
You must be signed in to change notification settings - Fork 887
database race on create build / acquire job #2436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Actually, I might be full of it. I've been chasing postgres stuff all day and realized that this error was hit with our fake database. So, we might be fine in production, but tests are flaky anyway. Still the language on that postgres documentation page has me worried about our locking query. |
Ok, this is fixed in the fake database, but I'll post some notes on postgres for posterity. I think we're ok with our current use, despite the dire warning I quoted in the description. For the read committed isolation level, which is default:
Good so far. The transaction that creates the job must have committed in order for us to acquire the job, and thus we won't hit the bug this issue is about. However, there is a slightly related worry with respect to our use of locks to prevent multiple provisionerds from acquiring the same job.
Yikes! We don't want to acquire the job if some other transaction updated it (e.g. another provisionerd).
And, we're back to being ok, since the where clause ensures we only grab jobs that haven't already been grabbed. |
Summary: Our current postgres code that handles creating builds and sending them out to provisionerd is subtly bugged from an ACID standpoint, and results in provisionerd failing to acquire a job.
Details:
I'm hitting this error in the
TestDelete
CLI test.Coderd tries to acquire a job and fails with
request job was invalidated: get workspace build: sql: no rows in result set
that is to say, that it selected and locked a job in the database, then tried to look up the corresponding workspace build, and the build wasn't there.How can this be?
When we create the build, we insert the job and build in a single transaction, so you wouldn't expect some other query to be able to find the job but then not the build.
After reading https://www.postgresql.org/docs/current/transaction-iso.html very carefully, I think this has to do with our use of
SELECT FOR UPDATE
in the locking query.It seems that the locking query, run in Read Committed mode (default), is able to see the effect of the transaction adding the job prior to that transaction being committed. We may need to upgrade one or both of these transactions to a higher level of isolation.
The text was updated successfully, but these errors were encountered: