Skip to content

Workspace goes to failed state and cannot be started, stopped or deleted after cancel-start->stop->start actions #2683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nadzeyav opened this issue Jun 27, 2022 · 4 comments
Assignees

Comments

@nadzeyav
Copy link

nadzeyav commented Jun 27, 2022

Problem

Workspace goes to failed state and cannot be started or stopped or deleted after cancel-start->stop->autostart actions

I have met with the issue, that I cannot work with my workspace, getting error Error: Error creating instance: googleapi: Error 409: The resource 'projects/coder-dogfood/zones/europe-west4-b/instances/coder-nadzeya-nadzeya-test-autostart' already exists

Steps to reproduce:

coder version:
Coder v0.7.4-devel+bbbd5241

  1. Create a workspace (for me it was windows-gcp template on dev.coder.com)
  2. Stop it
  3. Start it, but cancel action
  4. Stop it
  5. Try to start

Start fails with
Error: Error creating instance: googleapi: Error 409: The resource 'projects/coder-dogfood/zones/us-central1-a/instances/coder-nadzeya-nadzeya-windows' already exists, alreadyExists

And cannot perform any other actions

Events order list sample:

image

Definition of Done

If a start is cancelled, a user should be able to start/stop a workspace without problems

@misskniss
Copy link

@bpmct can we get a couple of us together and see:

  • is this still happening?
  • what behavior do we expect in these cases?

@mafredri mafredri assigned mafredri and unassigned bpmct Aug 16, 2022
@mafredri
Copy link
Member

I'm tackling this issue from the perspective that we currently terminate the Terraform process immediately on a cancel request which does not give Terraform a chance to clean up resources.

This will be changed so that cancellation signals an interrupt and we will wait for Terraform to clean up. A second forceful cancellation request terminates Terraform and may leave resources but can be necessary in some cases where Terraform is stuck cleaning up.

mafredri added a commit that referenced this issue Aug 16, 2022
This change allows terraform commands to be gracefully cancelled on
Unix-like platforms by signaling interrupt on provision cancellation.

One implementation detail to note is that we do not necessarily kill a
running terraform command immediately even if the stream is closed. The
reason for this is to allow for graceful cancellation even in such an
event. Currently the timeout is set to one minute, which was chosen
arbitrarily.

Also note that the `force` flag was added to `provisioner.proto` and is
handled in the `terraform` package, however, it is not used by the
`provisionerd/runner`. The reason is that the `runner` would need to be
refactored. Currently the "force stop" context is used as a base for
streams, and this would mean that in a force stop event, the stream
would close and we can't send out our force stop cancellation message.

Related: #2683

The above issue may be partially of fully fixed by this change.
mafredri added a commit that referenced this issue Aug 18, 2022
* fix: Allow terraform provisions to be gracefully cancelled

This change allows terraform commands to be gracefully cancelled on
Unix-like platforms by signaling interrupt on provision cancellation.

One implementation detail to note is that we do not necessarily kill a
running terraform command immediately even if the stream is closed. The
reason for this is to allow for graceful cancellation even in such an
event. Currently the timeout is set to 5 minutes by default.

Related: #2683

The above issue may be partially or fully fixed by this change.

* fix: Remove incorrect minimumTerraformVersion variable

* Allow init to return provision complete response
@mafredri
Copy link
Member

With #3526 merged, I'm hoping this issue is fixed but I'll leave this ticket open for now in case there are more reports of it (it'll be part of the next release, the one after v0.8.5).

@ammario
Copy link
Member

ammario commented Aug 18, 2022

Let's reopen if there is a new report.

@ammario ammario closed this as completed Aug 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants