Skip to content

feat: add provisioner job hang detector #7927

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jun 25, 2023
Merged

feat: add provisioner job hang detector #7927

merged 13 commits into from
Jun 25, 2023

Conversation

deansheather
Copy link
Member

@deansheather deansheather commented Jun 9, 2023

Adds a new system unhanger.Detector which will look for hung provisioner jobs (provisioner jobs that are running and haven't been updated in more than 5 minutes) and will print a message to the job logs and mark the job as failed.

This mitigates an issue that happened to a customer which prevented their workspace from being used entirely (could not start, stop, update, restart, cancel build, delete, orphan) caused by the provisioner dying mid job or the "job complete" message failing to send from the provisioner to coder.

This fix will apply to jobs/workspaces that are currently in this broken state and prevent future hung workspaces. If this occurs in the future, the user will only be unable to control their workspace for up to 5 minutes.

This also makes a few changes to the terraform provisioner as jobs weren't being killed after exitTimeout if they were canceled, which seemed wrong to me. Also fixes a leaked 5m time.After instance.

TODO:

  • unhanger.Detector
  • Tests for unhanger.Detector
  • Automatically kill provisioner jobs if they haven't been able to successfully send an update in over 3 minutes
  • Put the constants for detector kill duration and provisioner kill duration in the same place with a comment explaining why they can't be the same

Closes https://github.com/coder/v2-customers/issues/190

@deansheather deansheather requested review from coadler and mafredri June 9, 2023 12:53
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature!

I left a few comments and thoughts. One general theme I observed is that this adds a service that races with the provisioner. Provisioner might race to complete whereas unhanger writes the failure, both log. I wonder if it would make sense to invert some of the behavior so that it's unhanger who asks the provisioner to do the job? For instance, unhanger -> insert stop request with reason -> provisioner get request -> fail job.

Perhaps you've thought about this and there's a good reason not to do it though, just putting it out there.

Edit: Reading your description again, I'm pretty sure you thought of it. I guess there could be a middle ground between these two implementations, but I'm not sure it's worth the effort or rewrite for now.

cli/server.go Outdated
hangDetectorTicker := time.NewTicker(cfg.JobHangDetectorInterval.Value())
defer hangDetectorTicker.Stop()
hangDetector := unhanger.New(ctx, options.Database, options.Pubsub, logger, hangDetectorTicker.C)
hangDetector.Run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since unhanger has no Close method, nor is Run blocking, we have no way to ensure this service is actually shut down on exit. Ideally we would replace ctx dependence outside New function with a synchronous Close method to ensure cleanup. Alternatively we could defer <-hangDetector.Wait() but then the caller must be careful to cancel the context later in the stack (easier to make mistakes).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a Close function which cancels the context

FROM
provisioner_jobs
WHERE
updated_at < $1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor semantic sanity check, do we want inclusive or exclusive check here? I do feel like <= would make sense here considering the terminology (hung since).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it matters since it's time comparison, so the precision is millisecond (or smaller) and the chance of a collision is tiny (and doesn't have any consequence)

@sreya sreya added this to the 🦛 Sprint 1 milestone Jun 13, 2023
@ammario
Copy link
Member

ammario commented Jun 19, 2023

What's the status here?

@deansheather deansheather requested review from mafredri and coadler June 20, 2023 14:58
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for amending my previous feedback! Found some corner cases this round that I believe we should try to address. Also raised some questions where I wasn't sure if it could be a problem or not.

PS. Tests seem to be failing.

}
}

stats.HungJobIDs = append(stats.HungJobIDs, job.ID)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do refactor, stats could track both hung jobs and terminated jobs, given that an attempt to terminate could fail.

// the stream so that we can control when to terminate the process.
killCtx, kill := context.WithCancel(context.Background())
defer kill()

// Ensure processes are eventually cleaned up on graceful
// cancellation or disconnect.
go func() {
<-stream.Context().Done()
<-ctx.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to ask a (perhaps stupid) question since we're relying on exit/hung timeouts to be predictable.

Do we have anyway to ensure that this specific context is cancelled in the event that heartbeats or updates are hanging/failing/timing out. Let's say network conditions are such that the stream doesn't die and this stream context remains open, but provisioner heartbeats to coderd are not coming through (perhaps stream writes simply hang).

Or, let's say it takes 3 minutes longer for this context to be cancelled than what hang detector is expecting. We would then be waiting 3 + 3 minutes and thus still potentially be canceling (SIGINT) the terraform apply for a minute after the job is marked as terminated.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a 30 second timeout to updates, and failed heartbeats will cause the stream context to be canceled which should result in graceful cancellation starting immediately

}

func (d *Detector) run(t time.Time) Stats {
ctx, cancel := context.WithTimeout(d.ctx, 5*time.Minute)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the event that we would have 100 hung jobs, a per-job transaction (see other comment) would also help ensure that this context is never too short to at least do some work (and eventually mark everything).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be nice, but having multiple transactions while holding a lock is not possible with the in-memory DB. I tried doing multiple transactions at first but hit this issue. Adding multiple transaction support to in-memory DB seems like a bad idea because we don't have table/row locks like PSQL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just add a limit of 10 jobs per run so we don't timeout and risk rolling back.

@deansheather deansheather requested a review from mafredri June 21, 2023 19:12
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I see some test is still unhappy, but other than that I think this looks good to go!

@deansheather deansheather enabled auto-merge (squash) June 25, 2023 13:09
@deansheather deansheather merged commit 98a5ae7 into main Jun 25, 2023
@deansheather deansheather deleted the dean/hang-detector branch June 25, 2023 13:17
@github-actions github-actions bot locked and limited conversation to collaborators Jun 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants