chore: acquire lock for individual workspace transition #15859

DanielleMaywood · 2024-12-13T09:50:41Z

When Coder is ran in High Availability mode, each Coder instance has a lifecycle executor. These lifecycle executors are all trying to do the same work, and whilst transactions saves us from this causing an issue, we are still doing extra work that could be prevented.

This PR adds a TryAcquireLock call for each attempted workspace transition, meaning two Coder instances shouldn't duplicate effort.

This approach does still allow some duplicated effort to occur though. This is because we aren't locking the entire runOnce function, meaning the follow scenario could still occur:

Instance X calls GetWorkspacesEligibleForTransition, returning Workspace W
Instance X acquires lock to transition workspace W
Instance X starts transitioning Workspace W
Instance Y calls GetWorkspacesEligibleForTransition, returning Workspace W
Instance X finishes transitioning Workspace W
Instance X releases lock to transition workspace W
Instance Y acquires lock to transition workspace W
Instance Y starts transitioning Workspace W
Instance Y fails to transition Workspace W
Instance Y releases lock to transition workspace W

I decided against locking runOnce for now as we run each workspace transition in their own transaction. Using nested transactions here will require extra design work and consideration.

coderd/autobuild/lifecycle_executor.go

mafredri · 2024-12-13T12:47:26Z

coderd/autobuild/lifecycle_executor_test.go

+	go func() {
+		tickChB <- next
+		close(tickChB)
+	}()


Is this potentially racy? We're testing that the lock acquire works but theoretically that might not happen if the first coderd grabs the job, completes it, and then the second one does.

I doubt it matters as I suppose we're happy even if the try acquire is hit only a faction of the time, but thought I'd flag it anyway.

Looking again, you're probably right. I ran the test with verbose logging and it looks like this all occurs within 0.05s.

If the test doesn't hit the lock, then we are likely to hit a flake. I'll have a go at increasing this time buffer.

I think you might be able to reduce (but not eliminate) racyness by having a second chan struct{} that you then close after starting both goroutines, making them both wait until it's closed to start.

e.g.

startCh := make(chan struct{}) go func() { <-startCh tickChA <- next close(tickChA) }() go func() { <-startCh tickChB <- next close(tickChB) }() close(startCh)

You might also be able to get both of them to tick very closely in time by sharing the same tick channel, and making it buffered with size 2. (Of course then you'd want to avoid closing the channel twice to avoid a panic)

I've gone with your proposal @johnstcn.

It looks like for testing we just use an echo provisioner job, so getting that to take artificially longer for this specific test may not be a trivial task.

johnstcn · 2024-12-13T16:07:45Z

coderd/autobuild/lifecycle_executor_test.go

+	go func() {
+		tickChB <- next
+		close(tickChB)
+	}()


I think you might be able to reduce (but not eliminate) racyness by having a second chan struct{} that you then close after starting both goroutines, making them both wait until it's closed to start.

e.g.

startCh := make(chan struct{}) go func() { <-startCh tickChA <- next close(tickChA) }() go func() { <-startCh tickChB <- next close(tickChB) }() close(startCh)

You might also be able to get both of them to tick very closely in time by sharing the same tick channel, and making it buffered with size 2. (Of course then you'd want to avoid closing the channel twice to avoid a panic)

coderd/autobuild/lifecycle_executor.go

DanielleMaywood added 2 commits December 10, 2024 16:36

chore: attempt to fix race

1213769

chore: use log instead of e.log

c45817b

github-actions bot assigned DanielleMaywood Dec 13, 2024

DanielleMaywood requested review from johnstcn, mtojek and mafredri December 13, 2024 11:00

DanielleMaywood marked this pull request as ready for review December 13, 2024 11:33

mafredri approved these changes Dec 13, 2024

View reviewed changes

chore: add check for err != context.Canceled

46fa216

johnstcn approved these changes Dec 13, 2024

View reviewed changes

DanielleMaywood added 2 commits December 13, 2024 16:26

chore: use Cian's suggestion

ae7fca9

chore: ignore last commit, misread the comment

6b133e7

DanielleMaywood merged commit 50ff06c into main Dec 13, 2024
30 checks passed

DanielleMaywood deleted the dm-lifecycle-executor-race branch December 13, 2024 16:59

github-actions bot locked and limited conversation to collaborators Dec 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: acquire lock for individual workspace transition #15859

chore: acquire lock for individual workspace transition #15859

Uh oh!

DanielleMaywood commented Dec 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mafredri Dec 13, 2024

Uh oh!

DanielleMaywood Dec 13, 2024

Uh oh!

johnstcn Dec 13, 2024 •

edited

Loading

Uh oh!

DanielleMaywood Dec 13, 2024

Uh oh!

johnstcn Dec 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chore: acquire lock for individual workspace transition #15859

chore: acquire lock for individual workspace transition #15859

Uh oh!

Conversation

DanielleMaywood commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

DanielleMaywood Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

johnstcn Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DanielleMaywood Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

johnstcn Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanielleMaywood commented Dec 13, 2024 •

edited

Loading

johnstcn Dec 13, 2024 •

edited

Loading

johnstcn Dec 13, 2024 •

edited

Loading