Skip to content

fix: Allow terraform provisions to be gracefully cancelled #3526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 18, 2022

Conversation

mafredri
Copy link
Member

@mafredri mafredri commented Aug 16, 2022

This change allows terraform commands to be gracefully cancelled on
Unix-like platforms by signaling interrupt on provision cancellation.

One implementation detail to note is that we do not necessarily kill a
running terraform command immediately even if the stream is closed. The
reason for this is to allow for graceful cancellation even in such an
event. Currently the timeout is set to 5 minutes by default.

Related: #2683

The above issue may be partially or fully fixed by this change.

This change allows terraform commands to be gracefully cancelled on
Unix-like platforms by signaling interrupt on provision cancellation.

One implementation detail to note is that we do not necessarily kill a
running terraform command immediately even if the stream is closed. The
reason for this is to allow for graceful cancellation even in such an
event. Currently the timeout is set to one minute, which was chosen
arbitrarily.

Also note that the `force` flag was added to `provisioner.proto` and is
handled in the `terraform` package, however, it is not used by the
`provisionerd/runner`. The reason is that the `runner` would need to be
refactored. Currently the "force stop" context is used as a base for
streams, and this would mean that in a force stop event, the stream
would close and we can't send out our force stop cancellation message.

Related: #2683

The above issue may be partially of fully fixed by this change.
@mafredri mafredri self-assigned this Aug 16, 2022
@kylecarbs
Copy link
Member

Should we remove force? If it's unused, it seems like there's little harm.

Could we add some tests for cancelation too? I'm not entirely sure how we would but just asking.

@mafredri
Copy link
Member Author

mafredri commented Aug 16, 2022

Should we remove force? If it's unused, it seems like there's little harm.

From proto? I'm open to do both approaches (keep it or remove it).

This is how it'd be used:

diff --git provisionerd/runner/runner.go provisionerd/runner/runner.go
index f26808c0..d9f68583 100644
--- provisionerd/runner/runner.go
+++ provisionerd/runner/runner.go
@@ -619,6 +619,13 @@ func (r *Runner) runTemplateImportProvision(values []*sdkproto.ParameterValue, m
        go func() {
                select {
                case <-r.notStopped.Done():
+                       _ = stream.Send(&sdkproto.Provision_Request{
+                               Type: &sdkproto.Provision_Request_Cancel{
+                                       Cancel: &sdkproto.Provision_Cancel{
+                                               Force: true,
+                                       },
+                               },
+                       })
                        return
                case <-r.notCanceled.Done():
                        _ = stream.Send(&sdkproto.Provision_Request{
@@ -773,6 +780,13 @@ func (r *Runner) runWorkspaceBuild() (
        go func() {
                select {
                case <-r.notStopped.Done():
+                       _ = stream.Send(&sdkproto.Provision_Request{
+                               Type: &sdkproto.Provision_Request_Cancel{
+                                       Cancel: &sdkproto.Provision_Cancel{
+                                               Force: true,
+                                       },
+                               },
+                       })
                        return
                case <-r.notCanceled.Done():
                        _ = stream.Send(&sdkproto.Provision_Request{

But like I mentioned in the PR description, needs a bit of refactor in the runner.

Could we add some tests for cancelation too? I'm not entirely sure how we would but just asking.

It's a bit tricky, but I can try to add some. Essentially we'd need to fake a terraform binary that we can tweak the behavior for in our test.

Copy link
Member

@kylecarbs kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm down to remove Force in entirety if we don't see a great use-case for now!

@mafredri
Copy link
Member Author

mafredri commented Aug 16, 2022

I'm down to remove Force in entirety if we don't see a great use-case for now!

The only use-case currently, as I see it, is to ensure (r *Runner) Fail() (or ForceStop) actually kills the job. Right now calling (r *Runner) Fail() would result in the stream disconnecting and triggering the 1 minute (or 5 minute, if we bump it) timeout after which terraform gets killed (if it's still running).

Comment on lines +117 to +124
return stream.Send(&proto.Provision_Response{
Type: &proto.Provision_Response_Complete{
Complete: &proto.Provision_Complete{
Error: err.Error(),
},
},
})
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylecarbs yay or nay on this change? I'm thinking it'd be OK to return provision complete on cancellation, similar to plan and apply below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with it!

@mafredri
Copy link
Member Author

Cancellation tests have been added for init and apply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants