feat: add provisioner job hang detector #7927

deansheather · 2023-06-09T06:21:33Z

Adds a new system unhanger.Detector which will look for hung provisioner jobs (provisioner jobs that are running and haven't been updated in more than 5 minutes) and will print a message to the job logs and mark the job as failed.

This mitigates an issue that happened to a customer which prevented their workspace from being used entirely (could not start, stop, update, restart, cancel build, delete, orphan) caused by the provisioner dying mid job or the "job complete" message failing to send from the provisioner to coder.

This fix will apply to jobs/workspaces that are currently in this broken state and prevent future hung workspaces. If this occurs in the future, the user will only be unable to control their workspace for up to 5 minutes.

This also makes a few changes to the terraform provisioner as jobs weren't being killed after exitTimeout if they were canceled, which seemed wrong to me. Also fixes a leaked 5m time.After instance.

TODO:

unhanger.Detector
Tests for unhanger.Detector
Automatically kill provisioner jobs if they haven't been able to successfully send an update in over 3 minutes
Put the constants for detector kill duration and provisioner kill duration in the same place with a comment explaining why they can't be the same

Closes https://github.com/coder/v2-customers/issues/190

mafredri

Nice feature!

I left a few comments and thoughts. One general theme I observed is that this adds a service that races with the provisioner. Provisioner might race to complete whereas unhanger writes the failure, both log. I wonder if it would make sense to invert some of the behavior so that it's unhanger who asks the provisioner to do the job? For instance, unhanger -> insert stop request with reason -> provisioner get request -> fail job.

Perhaps you've thought about this and there's a good reason not to do it though, just putting it out there.

Edit: Reading your description again, I'm pretty sure you thought of it. I guess there could be a middle ground between these two implementations, but I'm not sure it's worth the effort or rewrite for now.

coderd/unhanger/detector.go

mafredri · 2023-06-13T12:41:54Z

cli/server.go

+			hangDetectorTicker := time.NewTicker(cfg.JobHangDetectorInterval.Value())
+			defer hangDetectorTicker.Stop()
+			hangDetector := unhanger.New(ctx, options.Database, options.Pubsub, logger, hangDetectorTicker.C)
+			hangDetector.Run()


Since unhanger has no Close method, nor is Run blocking, we have no way to ensure this service is actually shut down on exit. Ideally we would replace ctx dependence outside New function with a synchronous Close method to ensure cleanup. Alternatively we could defer <-hangDetector.Wait() but then the caller must be careful to cancel the context later in the stack (easier to make mistakes).

I added a Close function which cancels the context

coderd/database/lock.go

mafredri · 2023-06-13T12:49:52Z

coderd/database/queries/provisionerjobs.sql

+FROM
+	provisioner_jobs
+WHERE
+	updated_at < $1


Minor semantic sanity check, do we want inclusive or exclusive check here? I do feel like <= would make sense here considering the terminology (hung since).

I don't think it matters since it's time comparison, so the precision is millisecond (or smaller) and the chance of a collision is tiny (and doesn't have any consequence)

coderd/unhanger/detector.go

provisioner/terraform/serve.go

provisioner/terraform/provision.go

coderd/unhanger/detector.go

ammario · 2023-06-19T21:54:53Z

What's the status here?

mafredri

Thanks for amending my previous feedback! Found some corner cases this round that I believe we should try to address. Also raised some questions where I wasn't sure if it could be a problem or not.

PS. Tests seem to be failing.

coderd/unhanger/detector.go

mafredri · 2023-06-21T08:59:25Z

coderd/unhanger/detector.go

+				}
+			}
+
+			stats.HungJobIDs = append(stats.HungJobIDs, job.ID)


If you do refactor, stats could track both hung jobs and terminated jobs, given that an attempt to terminate could fail.

coderd/unhanger/detector.go

mafredri · 2023-06-21T09:21:55Z

provisioner/terraform/provision.go

 	// the stream so that we can control when to terminate the process.
 	killCtx, kill := context.WithCancel(context.Background())
 	defer kill()

 	// Ensure processes are eventually cleaned up on graceful
 	// cancellation or disconnect.
 	go func() {
-		<-stream.Context().Done()
+		<-ctx.Done()


I need to ask a (perhaps stupid) question since we're relying on exit/hung timeouts to be predictable.

Do we have anyway to ensure that this specific context is cancelled in the event that heartbeats or updates are hanging/failing/timing out. Let's say network conditions are such that the stream doesn't die and this stream context remains open, but provisioner heartbeats to coderd are not coming through (perhaps stream writes simply hang).

Or, let's say it takes 3 minutes longer for this context to be cancelled than what hang detector is expecting. We would then be waiting 3 + 3 minutes and thus still potentially be canceling (SIGINT) the terraform apply for a minute after the job is marked as terminated.

I added a 30 second timeout to updates, and failed heartbeats will cause the stream context to be canceled which should result in graceful cancellation starting immediately

mafredri · 2023-06-21T09:24:19Z

coderd/unhanger/detector.go

+}
+
+func (d *Detector) run(t time.Time) Stats {
+	ctx, cancel := context.WithTimeout(d.ctx, 5*time.Minute)


In the event that we would have 100 hung jobs, a per-job transaction (see other comment) would also help ensure that this context is never too short to at least do some work (and eventually mark everything).

That would be nice, but having multiple transactions while holding a lock is not possible with the in-memory DB. I tried doing multiple transactions at first but hit this issue. Adding multiple transaction support to in-memory DB seems like a bad idea because we don't have table/row locks like PSQL.

I might just add a limit of 10 jobs per run so we don't timeout and risk rolling back.

mafredri

Nice work! I see some test is still unhappy, but other than that I think this looks good to go!

coderd/unhanger/detector.go

deansheather added 2 commits June 9, 2023 06:14

feat: add provisioner job hang detector

a38e949

Merge branch 'main' into dean/hang-detector

dfaf836

github-actions bot assigned deansheather Jun 9, 2023

chore: ensure provisioner kills upon failed update

530e8f2

deansheather requested review from coadler and mafredri June 9, 2023 12:53

mafredri reviewed Jun 13, 2023

View reviewed changes

deansheather added 3 commits June 13, 2023 13:35

Merge branch 'main' into dean/hang-detector

027443a

fix cancel timeout test

0218151

fixup! fix cancel timeout test

9e0ae3b

sreya added this to the 🦛 Sprint 1 milestone Jun 13, 2023

coadler reviewed Jun 14, 2023

View reviewed changes

coderd/unhanger/detector.go Outdated Show resolved Hide resolved

deansheather added 2 commits June 20, 2023 14:52

comments

590f76a

Merge branch 'main' into dean/hang-detector

6f1e127

deansheather requested review from mafredri and coadler June 20, 2023 14:58

mafredri reviewed Jun 21, 2023

View reviewed changes

deansheather added 2 commits June 21, 2023 18:33

small refactor

e284b47

tx per unhang job

0b9e78a

deansheather requested a review from mafredri June 21, 2023 19:12

Merge branch 'main' into dean/hang-detector

8f16c3b

mafredri approved these changes Jun 21, 2023

View reviewed changes

coderd/unhanger/detector.go Show resolved Hide resolved

deansheather added 2 commits June 21, 2023 20:15

fixup! Merge branch 'main' into dean/hang-detector

f25938a

Merge branch 'main' into dean/hang-detector

aa36a0d

deansheather enabled auto-merge (squash) June 25, 2023 13:09

deansheather merged commit 98a5ae7 into main Jun 25, 2023

deansheather deleted the dean/hang-detector branch June 25, 2023 13:17

github-actions bot locked and limited conversation to collaborators Jun 25, 2023

feat: add provisioner job hang detector #7927

feat: add provisioner job hang detector #7927

Uh oh!

Conversation

deansheather commented Jun 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Uh oh!

mafredri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ammario commented Jun 19, 2023

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

deansheather commented Jun 9, 2023 •

edited

Loading

mafredri left a comment •

edited

Loading