Skip to content

fix: prevent db deadlock when workspaces go dormant #10618

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 13, 2023
Merged

Conversation

sreya
Copy link
Collaborator

@sreya sreya commented Nov 9, 2023

  • Fixes an issue where during the evaluation of workspaces for dormancy, an attempt was made to insert an audit log in the midst of a transaction. If a sufficiently large number of workspaces were found eligible for dormancy a deadlock could occur where each transaction would attempt to write an audit log outside the transaction, consuming a DB connection. We limit the number of concurrent transactions in autobuild to 10 so 10 connections could potentially be consumed leaving none free to handle the audit logging, effectively deadlocking the DB.
  • We also now return errors from the transaction where it makes sense. I can't tell if it was a conscious decision to always return nil or not. Perhaps because it's mostly a read transaction it's less expensive to commit instead of rollback?
  • The feature has not been released so this does not effect production deployments.

The solution is obviously to not do non-transaction DB operations from within a transaction. This was not an intentional implementation decision.

We could potentially add a timeout to the context we use in the transactions we open in autobuild but if a similar bug were to occur it would just provide temporary relief.

I'm a bit hesitant on adding the test for this since it's sort of easy for it to produce a false-positive. If we want to retain it I'll rewrite it to use our new dbfake implementation.

fixes https://github.com/coder/customers/issues/314

@sreya sreya requested a review from kylecarbs November 9, 2023 23:42
@sreya sreya marked this pull request as ready for review November 9, 2023 23:50
Comment on lines -491 to -492
workspaces.template_id = templates.id
AND
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: already in the WHERE cond so 👍

Comment on lines -148 to +149
log.Error(e.ctx, "get workspace autostart failed", slog.Error(err))
return nil
return xerrors.Errorf("get workspace by id: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

review: IIRC the reason this was written this way was to avoid a single error causing an entire iteration to fail. Did the previous logic negatively impact the issue here? If not, I'd advise keeping the scope of the change small.

Copy link
Collaborator Author

@sreya sreya Nov 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only returns errors for the transaction not for the errgroup. Seems like we were trying to commit the transaction regardless of whether we encountered an error or not which is bizarre. That shouldn't cause an entire evaluation iteration to abort early.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah gotcha, I think I was operating on some stale cached assumptions about this code that are no longer true.

@@ -309,6 +312,84 @@ func TestWorkspaceAutobuild(t *testing.T) {
require.True(t, ws.LastUsedAt.After(lastUsedAt))
})

// This test serves as a regression prevention for generating
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍 We could rewrite using the new shiny dbfake but this can be a follow-up.

@sreya sreya requested a review from coadler November 13, 2023 16:25
@sreya sreya merged commit 75ab16d into main Nov 13, 2023
@sreya sreya deleted the jon/fixdeadlock branch November 13, 2023 19:40
@github-actions github-actions bot locked and limited conversation to collaborators Nov 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants