-
Notifications
You must be signed in to change notification settings - Fork 875
chore: acquire lock for individual workspace transition #15859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
go func() { | ||
tickChB <- next | ||
close(tickChB) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this potentially racy? We're testing that the lock acquire works but theoretically that might not happen if the first coderd grabs the job, completes it, and then the second one does.
I doubt it matters as I suppose we're happy even if the try acquire is hit only a faction of the time, but thought I'd flag it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking again, you're probably right. I ran the test with verbose logging and it looks like this all occurs within 0.05s
.
If the test doesn't hit the lock, then we are likely to hit a flake. I'll have a go at increasing this time buffer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to reduce (but not eliminate) racyness by having a second chan struct{}
that you then close after starting both goroutines, making them both wait until it's closed to start.
e.g.
startCh := make(chan struct{})
go func() {
<-startCh
tickChA <- next
close(tickChA)
}()
go func() {
<-startCh
tickChB <- next
close(tickChB)
}()
close(startCh)
You might also be able to get both of them to tick very closely in time by sharing the same tick channel, and making it buffered with size 2. (Of course then you'd want to avoid closing the channel twice to avoid a panic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone with your proposal @johnstcn.
It looks like for testing we just use an echo provisioner job, so getting that to take artificially longer for this specific test may not be a trivial task.
go func() { | ||
tickChB <- next | ||
close(tickChB) | ||
}() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you might be able to reduce (but not eliminate) racyness by having a second chan struct{}
that you then close after starting both goroutines, making them both wait until it's closed to start.
e.g.
startCh := make(chan struct{})
go func() {
<-startCh
tickChA <- next
close(tickChA)
}()
go func() {
<-startCh
tickChB <- next
close(tickChB)
}()
close(startCh)
You might also be able to get both of them to tick very closely in time by sharing the same tick channel, and making it buffered with size 2. (Of course then you'd want to avoid closing the channel twice to avoid a panic)
When Coder is ran in High Availability mode, each Coder instance has a lifecycle executor. These lifecycle executors are all trying to do the same work, and whilst transactions saves us from this causing an issue, we are still doing extra work that could be prevented.
This PR adds a
TryAcquireLock
call for each attempted workspace transition, meaning two Coder instances shouldn't duplicate effort.This approach does still allow some duplicated effort to occur though. This is because we aren't locking the entire
runOnce
function, meaning the follow scenario could still occur:X
callsGetWorkspacesEligibleForTransition
, returning WorkspaceW
X
acquires lock to transition workspaceW
X
starts transitioning WorkspaceW
Y
callsGetWorkspacesEligibleForTransition
, returning WorkspaceW
X
finishes transitioning WorkspaceW
X
releases lock to transition workspaceW
Y
acquires lock to transition workspaceW
Y
starts transitioning WorkspaceW
Y
fails to transition WorkspaceW
Y
releases lock to transition workspaceW
I decided against locking
runOnce
for now as we run each workspace transition in their own transaction. Using nested transactions here will require extra design work and consideration.