Skip to content

fix: use background context for inmem provisionerd #11545

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 10, 2024

Conversation

spikecurtis
Copy link
Contributor

This test case fails with an error log, showing "context canceled" when trying to send an acquired job to an in-mem provisionerd.

https://github.com/coder/coder/runs/20331469006

In this case, we don't want to supress this error, since it could mean that we acquired a job, locked it in the database, then failed to send it to a provisioner.
(We also don't want to mark the job as failed because we don't know whether the job made it to the provisionerd or not --- in the failed test you can see that the job is actually processed just fine).

The reason we got context canceled is because the API was shutting down --- we don't want provisionerdserver to abruptly stop processing job stuff as the API shuts down as this will leave jobs in a bad state. This PR fixes up the use of contexts with provisionerdserver and the associated drpc service calls.

Copy link
Contributor Author

spikecurtis commented Jan 10, 2024

In this stack: 🛠️ Enhanced Provisioning with Background Context

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @spikecurtis and the rest of your teammates on Graphite Graphite

@spikecurtis spikecurtis changed the title fix: use background context for inmem provisionerd serve fix: use background context for inmem provisionerd serve to avoid orphaned jobs on shutdown Jan 10, 2024
@spikecurtis spikecurtis changed the title fix: use background context for inmem provisionerd serve to avoid orphaned jobs on shutdown fix: use background context for inmem provisionerd Jan 10, 2024
@@ -1371,10 +1371,10 @@ func newProvisionerDaemon(
connector[string(database.ProvisionerTypeTerraform)] = sdkproto.NewDRPCProvisionerClient(terraformClient)
}

return provisionerd.New(func(ctx context.Context) (proto.DRPCProvisionerDaemonClient, error) {
return provisionerd.New(func(dialCtx context.Context) (proto.DRPCProvisionerDaemonClient, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I don't think this rename is required as I feel this is the default way we should think about contexts, unless explicitly specified otherwise. But there's also no harm so feel free.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear from here that this function argument is dialer, so I thought it would be a nice reminder.

// drop a job that we locked in the database but never passed to the provisionerd. The
// provisionerd is local, in-mem, so there isn't a danger of losing contact with it and
// having a dead connection we don't know the status of.
err := server.Serve(context.Background(), serverSession)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check: will this result in provisioner jobs not being cancelled on shutdown? And is that really what we want?

Let's say a provisioner job (terraform) is running and it will take 5 minutes, I imagine we would like to ensure that job is cancelled gracefully, so we only wait for the cleanup but not the full 5 minutes. Is that what will happen or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in cli/server.go we explicitly close the provisionerDaemons before closing the API. Closing the provisioner daemon fails the active job. That's not exactly graceful, but at least it leaves the job in an understandable state.

@spikecurtis spikecurtis merged commit dfe8efc into main Jan 10, 2024
@spikecurtis spikecurtis deleted the spike/inmem-serve-context branch January 10, 2024 11:29
@github-actions github-actions bot locked and limited conversation to collaborators Jan 10, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants