-
Notifications
You must be signed in to change notification settings - Fork 887
feat: add provisioner job hang detector #7927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice feature!
I left a few comments and thoughts. One general theme I observed is that this adds a service that races with the provisioner. Provisioner might race to complete whereas unhanger writes the failure, both log. I wonder if it would make sense to invert some of the behavior so that it's unhanger who asks the provisioner to do the job? For instance, unhanger -> insert stop request with reason -> provisioner get request -> fail job.
Perhaps you've thought about this and there's a good reason not to do it though, just putting it out there.
Edit: Reading your description again, I'm pretty sure you thought of it. I guess there could be a middle ground between these two implementations, but I'm not sure it's worth the effort or rewrite for now.
cli/server.go
Outdated
hangDetectorTicker := time.NewTicker(cfg.JobHangDetectorInterval.Value()) | ||
defer hangDetectorTicker.Stop() | ||
hangDetector := unhanger.New(ctx, options.Database, options.Pubsub, logger, hangDetectorTicker.C) | ||
hangDetector.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since unhanger has no Close
method, nor is Run
blocking, we have no way to ensure this service is actually shut down on exit. Ideally we would replace ctx
dependence outside New
function with a synchronous Close
method to ensure cleanup. Alternatively we could defer <-hangDetector.Wait()
but then the caller must be careful to cancel the context later in the stack (easier to make mistakes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a Close function which cancels the context
FROM | ||
provisioner_jobs | ||
WHERE | ||
updated_at < $1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor semantic sanity check, do we want inclusive or exclusive check here? I do feel like <= would make sense here considering the terminology (hung since).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters since it's time comparison, so the precision is millisecond (or smaller) and the chance of a collision is tiny (and doesn't have any consequence)
What's the status here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for amending my previous feedback! Found some corner cases this round that I believe we should try to address. Also raised some questions where I wasn't sure if it could be a problem or not.
PS. Tests seem to be failing.
coderd/unhanger/detector.go
Outdated
} | ||
} | ||
|
||
stats.HungJobIDs = append(stats.HungJobIDs, job.ID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do refactor, stats could track both hung jobs and terminated jobs, given that an attempt to terminate could fail.
// the stream so that we can control when to terminate the process. | ||
killCtx, kill := context.WithCancel(context.Background()) | ||
defer kill() | ||
|
||
// Ensure processes are eventually cleaned up on graceful | ||
// cancellation or disconnect. | ||
go func() { | ||
<-stream.Context().Done() | ||
<-ctx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to ask a (perhaps stupid) question since we're relying on exit/hung timeouts to be predictable.
Do we have anyway to ensure that this specific context is cancelled in the event that heartbeats or updates are hanging/failing/timing out. Let's say network conditions are such that the stream doesn't die and this stream context remains open, but provisioner heartbeats to coderd are not coming through (perhaps stream writes simply hang).
Or, let's say it takes 3 minutes longer for this context to be cancelled than what hang detector is expecting. We would then be waiting 3 + 3 minutes and thus still potentially be canceling (SIGINT) the terraform apply for a minute after the job is marked as terminated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a 30 second timeout to updates, and failed heartbeats will cause the stream context to be canceled which should result in graceful cancellation starting immediately
} | ||
|
||
func (d *Detector) run(t time.Time) Stats { | ||
ctx, cancel := context.WithTimeout(d.ctx, 5*time.Minute) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the event that we would have 100 hung jobs, a per-job transaction (see other comment) would also help ensure that this context is never too short to at least do some work (and eventually mark everything).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be nice, but having multiple transactions while holding a lock is not possible with the in-memory DB. I tried doing multiple transactions at first but hit this issue. Adding multiple transaction support to in-memory DB seems like a bad idea because we don't have table/row locks like PSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just add a limit of 10 jobs per run so we don't timeout and risk rolling back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I see some test is still unhappy, but other than that I think this looks good to go!
Adds a new system
unhanger.Detector
which will look for hung provisioner jobs (provisioner jobs that are running and haven't been updated in more than 5 minutes) and will print a message to the job logs and mark the job as failed.This mitigates an issue that happened to a customer which prevented their workspace from being used entirely (could not start, stop, update, restart, cancel build, delete, orphan) caused by the provisioner dying mid job or the "job complete" message failing to send from the provisioner to coder.
This fix will apply to jobs/workspaces that are currently in this broken state and prevent future hung workspaces. If this occurs in the future, the user will only be unable to control their workspace for up to 5 minutes.
This also makes a few changes to the terraform provisioner as jobs weren't being killed after exitTimeout if they were canceled, which seemed wrong to me. Also fixes a leaked 5m
time.After
instance.TODO:
unhanger.Detector
unhanger.Detector
Closes https://github.com/coder/v2-customers/issues/190