Skip to content

Warn (or prevent) volumes from being deleted unintentionally #3386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mafredri opened this issue Aug 5, 2022 · 15 comments
Closed

Warn (or prevent) volumes from being deleted unintentionally #3386

mafredri opened this issue Aug 5, 2022 · 15 comments
Assignees
Labels
api Area: HTTP API stale This issue is like stale bread.
Milestone

Comments

@mafredri
Copy link
Member

mafredri commented Aug 5, 2022

As a developer I expect my coder volume to stay around until my workspace is deleted so that I can trust my half-finished work is not lost. (Barring exceptional circumstance.)

Scenario 1: Currently there is one way for a volume to be deleted unintentionally (with #3000 there are two):

  1. A template author updates a template and changes the volume name format
  2. The user sees their workspace is outdated and goes ahead and updates
  3. The workspace is re-created, during start the old volume will be deleted (no longer referenced in the TF template) and a new one will be created in its place

Scenario 2: When #3000 is implemented, our current users will run into this scenario:

  1. The user renames their workspace
  2. The user (or autostart/stop) starts or stops the workspace
  3. The volume will be deleted and a new one created in its place

To avoid #3000, template authors should move to using data.coder_workspace.me.id instead of data.coder_workspace.me.name, which will prevent volume deletion due to a naming change. However, the migration path will mean that all existing coder workspaces will lose their volumes either due to the former case (rename in template + update), or due to workspace rename.

There are a few ways we could dampen this impact:

  1. We could detect when a template push would cause this and warn the operator
  2. We could detect when a workspace update would cause this and warn the user
  3. We could detect when a rename would cause this and warn the user
  4. We could consider workspace volume names immutable after creation (i.e. we compute the volume name and store it separately, re-use it for the lifetime of the workspace)

It could be argued that renaming volumes in the template is an operator error and we shouldn't care about it. On the flip-side, operator error or not, it's something we could've guarded against and prevented potential loss in developer productivity.

I think option 4 would be best, as it lets us guard against data loss in all cases. It's also the one requiring least changes to our code to handle the edge cases.

At minimum, we should implement 1 and 2 so that operators and users are aware of the impact. It'd be great to have 3 too in this case but not mandatory if we consider all our users should migrate to using IDs.


I think this is something of a feature-bug, so I've labeled it as such. Depending on solutions it will also require CLI and frontend work (but I've omitted those tags for now).

@mafredri
Copy link
Member Author

mafredri commented Aug 9, 2022

An alternative idea to solve this would be to introduce the concept of "workspace features". This could be a relatively flexible way to introduce new features without risk of breakage.

If we take Scenario 1 for example:

  1. Operator updates a template with new name for volumes
  2. Workspace no longer supports the update workspace feature because the volume name would change
  3. The user is asked to delete and re-create the workspace (essentially what would happen otherwise as well)
  4. The new workspace supports the update workspace feature again

If we take Scenario 2:

  1. User tries to rename workspace
  2. User receives error that template does not support the workspace rename feature (because the volume name would change)
  3. User is asked to delete/create
  4. The new workspace would also not support the workspace rename feature (template needs updating)

Alternatively, if we implement immutable volume names, then Scenario 2 would look like:

  1. User tries to rename workspace
  2. Workspace does not support the rename workspace feature because it does not have an immutable volume name (has not been stopped/started since Coder upgrade)
  3. User is asked to stop/start the workspace so that the workspace rename feature can be enabled

@kylecarbs
Copy link
Member

kylecarbs commented Aug 9, 2022

I wonder if the names expose a larger problem with our usage of Terraform.

There isn't a sure-fire way to ensure a persistent volume isn't deleted unless we plan, and prompt the user to confirm the action. That'd be my preferred route - when you update a template, explicitly prompt the user how this will affect the infrastructure.

@mafredri
Copy link
Member Author

mafredri commented Aug 9, 2022

I think that could be an acceptable solution to the issue. I do think the deletion of a volume needs to be prominent though, if we train users that the output of "plan" (e.g. update workspace) is usually safe, eventually they will start ignoring it (this is my hunch anyway) and won't notice the potential destructive action.

How would planning work for different states of a workspace? For instance, do we plan start or stop for a workspace depending on its state? What if the we only plan stop, but start has the destructive action? Can we plan both?

@spikecurtis
Copy link
Contributor

Coder templates are a powerful abstraction that means that ordinary users don't usually need to care what is happening in the infrastructure. They press Start and get a workspace. In complex scenarios, end users will likely be unable to fully understand infrastructure level information because it is deliberately hidden from them by their platform teams.

I also think we should try to not give template authors a loaded gun that they can shoot themselves in the foot with. In my mind, this means we either shouldn't pass in any unstable names/identifiers at all (i.e. deprecate coder_workspace.me.name), or make it a hard error if they use unstable IDs as resource names.

@mafredri
Copy link
Member Author

mafredri commented Aug 10, 2022

I agree with that take @spikecurtis, what Terraform does (or the resulting state) is only really meaningful to operators. I see no reason that an end user would even want to understand what is going on (in most cases) and this would somewhat take away from the usefulness of showing plan output.

I also think we should try to not give template authors a loaded gun that they can shoot themselves in the foot with.

Disallowing name entirely would help, for sure, but I see the loaded gun in template authors being able to customize the volume name at all. Let's say they go through a refactor and want to change coder-[owner_id]-[id]-root to coder-[owner_id]-[id]-homevol, for whatever reason. The result would be the same (data loss). Both are stable names, but they can be changed, which is the problem.

If we don't make it hard (or rather extremely hard) to lose data -- data will be lost. Ultimately it won't be the template authors fault, it will be "Coder lost my/our data".


I talked a bit with @Emyrk yesterday and he pointed out that perhaps workspace parameters (#802) could somehow be utilized for this. Perhaps a volume name could be stored as a workspace parameter and the template format only utilized on workspace creation. It could still be editable on a per-workspace basis, but it could be treated special (i.e. warning: data loss imminent).

@bpmct
Copy link
Member

bpmct commented Aug 23, 2022

Terraform's prevent_destroy meta-argument was created as a "catch-all" to prevent an important resource from being destroyed. There is also the ignore_changes meta-argument that ensures the resource will not update even if there were changes to a variable/data source/output.

I agree with a more comprehensive approach that lets a template admin indicate "this resource should not be deleted" instead of entirely relying on the admin to use best practices with parameters, and naming conventions (use uid instead of username, etc). We might not be able to predict all the ways a terraform apply may delete a workspace and unexpected destruction is very harmful.

Leveraging prevent_destoy in examples would be a Terraform-native approach, however, we'd have to do some magic to ensure the resource can, in fact, be deleted upon workspace deletion. (see hashicorp/terraform#22544).

We also have a way to apply arbitrary Coder-specific data to resources with coder_metadata. Perhaps it could warn the user (#3384) if delete is about to occur (or prevent it altogether) if we add a destruction = (allow/warn/deny) property for resources. This feels a bit sillier to do via metadata since it effects the lifecycle of the resource. It could get fancy, however, a warn_message property could communicate the severity to the user: This change will delete all the files in your workspace

We could consider workspace volume names immutable after creation (i.e. we compute the volume name and store it separately, re-use it for the lifetime of the workspace)

I really like this solution for naming-related changes! I know other changes might also destroy a resource before re-building. For example, setting a Kubernetes volume size to a lower-than-previous value.

Scenario 2: When #3000 is implemented, our current users will run into this scenario:

The user renames their workspace
The user (or autostart/stop) starts or stops the workspace
The volume will be deleted and a new one created in its place

This seems like a good use case for Terraform's ignore_changes meta argument, to ensure the volume name does not change even if there were changes to the data source. I do think we'll need to have other stopgaps (see above) to ensure other changes don't result in data loss.

@bpmct bpmct added this to the EE milestone Aug 23, 2022
@mafredri
Copy link
Member Author

mafredri commented Aug 24, 2022

Thanks for your research @bpmct! The ignore_changes option looks like a great fit for protecting the users volume from deletion due to name changes.

It might still be problematic in the rename case though. Consider that a workspace is named work and volume work-homevol. The user renames the workspace to old (volume is still work-homevol). The user now creates a new workspace named work and depending on how the Terraform provider is implemented, it might either fail, inherit or delete the existing work-homevol. The workspaces may then be fighting back-n-forth for the same volume resource.

Another option discussed with @deansheather would be to use immutable variables, say we had a new resource named coder_immutable_value:

resource "coder_immutable_value" "home_volume_name" {
  value = "coder-${resource.coder_workspace.me.owner}-${resource.coder_workspace.me.name}-home"
}

resource "docker_volume" "home_volume" {
  name = "${resource.coder_immutable_value.home_volume_name.value}"
}

Both options still require our templates to set best practices, and template authors to follow them (which is unfortunate).


I was hoping we'd be able to leverage the output from Terraform (coder state pull ws) to peek at the name, and "make it immutable" (e.g. by modifying the template on-the-fly). But the more I think about it, we simply can't know all use-cases. Even if we added all known volume resources (digitalocean_volume, digitalocean_volume, etc..) to be checked, some volumes may be ephemeral by intent.

This could work fine with metadata attached to the volume though, informing Coder that it must be protected.

@bpmct
Copy link
Member

bpmct commented Aug 24, 2022

I think #3663 partially addresses this, but think we should think we should have a more declarative way to "protect" resources (prevent_destroy or via metadata).

@bpmct
Copy link
Member

bpmct commented Aug 24, 2022

How would coder_immutable_value work if the workspace is renamed and another workspace tries to create a volume with that name? I see the problem with ignore_changes but may be missing how this addresses it.

@mafredri
Copy link
Member Author

I think #3663 partially addresses this, but think we should think we should have a more declarative way to "protect" resources (prevent_destroy or via metadata).

Yeah, it'd definitely be an improvement, but I'm not very happy about the fact operator error can still result in data loss (i.e. rename the volume name format, add prefix, etc.)

How would coder_immutable_value work if the workspace is renamed and another workspace tries to create a volume with that name? I see the problem with ignore_changes but may be missing how this addresses it.

I'm not sure how we'd specifically do that, I just have some vague ideas. For instance maybe we could add a scope = global_unique to the immutable variable (or have the resource be named similarly). Maybe we then could use terraform plan to determine if the value being created is globally unique or not?

@mafredri
Copy link
Member Author

I've been thinking about a potential solution for this that allows guiding template authors and informing users. This solution would introduce a new resource, coder_protected_volume. The motivation behind it is the following:

  • This resource can tell us which resource(s) represents the important user data
  • It can be used to automagically protect the volume resource via lifecycle changes
  • It can tell us what the unique constraint is (e.g. volume name) so that conflicting workspaces can't be created
    • This is important so that we never try to adopt an existing resource belonging to another workspace (use-case e.g. renames / bad templates)
  • If this resource is missing
    • We can inform the template author that there are no protected volumes and this can result in data loss for users
    • We can inform users that their workspace doesn't have any protected volumes and that it can result in data loss e.g. when the template is updated or the workspace is renamed
resource "docker_volume" "home_volume" {
  name = "coder-${data.coder_workspace.me.owner}-${data.coder_workspace.me.name}-home"

  # Added automagically by coder_protected_volume or enforced via checks
  # when validating template or provisoning workspace.
  lifecycle {
    prevent_destroy = true
    ignore_changes  = [name]
  }
}

resource "coder_protected_volume" "home_volume" {
  volume_id = docker_volume.home_volume.id
  unique    = ["name"]  # Optional? Some resources may not require this.
}

Since prevent_destroy cannot use variables, we'll need to change this value on-the-fly if the workspace is being deleted. It might not be strictly necessary but it's a nice safety-net.

I believe this use-case (protecting user data) could be unique enough to warrant the addition of the explicit coder_protected_volume resource, but another option would be a more generic approach, via:

resource "coder_protected_resource" "home_volume" {
  resource_id = docker_volume.home_volume.id
  type        = volume  # Supported types e.g. [volume, generic]
  unique      = ["name"]
}

Questions I don't (yet) have the answer to:

  • Will we be able to make the appropriate pre/post-processing to the Terraform template?
  • Can we do the id link/traversal during terraform plan even though IDs are generally known only after apply?
  • There may be cases where the unique argument can only be inspected after apply (e.g. using inputs form another provider), how do we handle these? Or are they rare enough that we don't care / only surface a warning?

Thoughts @bpmct @kylecarbs?

@bpmct
Copy link
Member

bpmct commented Sep 22, 2022

I believe this use-case (protecting user data) could be unique enough to warrant the addition of the explicit coder_protected_volume resource, but another option would be a more generic approach.

I'd be in favor of a more generic resource that doesn't mention "volume" since I may want to protect an entire VM, dedicated machine, API key instead of a volume 👍🏼


coder_protected_resource

What do you think about using coder_metadata which we already have to apply arbitrary values to workspaces?

resource "coder_metadata" "volume" {
  resource_id = resource.docker_volume.homedir
  protected = true
  unique =  ["name"]
}

type        = volume  # Supported types e.g. [volume, generic]`

Would the type be used in order to detect/warn the admin if they are creating a workspace without a protected volume?


  # Added automagically by coder_protected_volume or enforced via checks
  # when validating template or provisoning workspace.
  lifecycle {
    prevent_destroy = true
    ignore_changes  = [name]
    item {
      key   = "whatever"
      value = "here"
    }
  }

What is the rationale for introducing a new resource type vs authors using generic Terraform lifecycle blocks in the template? This may be misguided, but my rationale is using familiar TF concepts would be easier for template authors to understand. Can parsing generic lifecycle blocks can give us all the information we need (detect protected resources, detect unique variables, etc.) in order to warn authors and prevent actions?

My understanding is that the resource coder_protected_(resource/volume) would automatically inject the lifecycle values during the terraform apply). If we introduced 0 new resource types and added lifecycle blocks to default templates with proper comments, could we add similar logic to remove the lifecycle values when a terraform destroy occurs (template deletion)?

Without a unique resource type, is it difficult for Coder to recognize which values are unique + warn admins if their templates aren't protected? I don't know we parse the Terraform, so genuinely don't know.


There may be cases where the unique argument can only be inspected after apply (e.g. using inputs form another provider), how do we handle these? Or are they rare enough that we don't care / only surface a warning?

I think it's fine to surface a warning here. My main concern is data loss which I think prevent_destroy handles

@mafredri
Copy link
Member Author

What do you think about using coder_metadata which we already have to apply arbitrary values to workspaces?

It can probably work, but I wanted to differentiate between these to keep metadata cleaner and to make a difference between "attaching information" vs "creating a constraint".

Would the type be used in order to detect/warn the admin if they are creating a workspace without a protected volume?

Yes, exactly so (and also the user). We can't easily know if a resource is a volume or not otherwise (we could program in knowledge for common providers, but that would have its limitations).

What is the rationale for introducing a new resource type vs authors using generic Terraform lifecycle blocks in the template?

The rationale is that if an author adds a lifecycle block, we know nothing more about that resource than that it has a lifecycle block. We still need to know if it's a volume and appropriately protected.

Regarding putting variables or item {} inside the lifecycle-block, I don't think that would be standard/valid Terraform? Are you suggesting we parse the TF file and detect/remove the custom block?

If I'm not entirely mistaken, manual parsing of a .tf file is a bit difficult (I think Terraform doesn't make these methods importable?). So the premise of my suggestion is that the resulting Terraform template has to be successfully parseable by the terraform command so that we then can modify the resulting plan. But I may have some poor assumptions here about what is and isn't possible.

Without a unique resource type, is it difficult for Coder to recognize which values are unique + warn admins if their templates aren't protected? I don't know we parse the Terraform, so genuinely don't know.

To my understanding, yes. The problem is that we don't know the meaning of arguments/values for different providers. For instance, docker provider will delete and recreate volumes if name changes. But it could just as well be named volname, Coder can't know it's purpose or effect. So without hinting at what creates uniqueness, we can't protect against resource conflicts caused by ignore_changes.

mafredri added a commit that referenced this issue Nov 8, 2022
mafredri added a commit that referenced this issue Nov 9, 2022
* fix: Use immutable names for volumes in example templates

This contributes towards #3000, #3386

Related #3409

* Add lifecycle and labels
@github-actions
Copy link

This issue is becoming stale. In order to keep the tracker readable and actionable, I'm going close to this issue in 7 days if there isn't more activity.

@github-actions github-actions bot added the stale This issue is like stale bread. label Nov 28, 2022
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2022
@julianpoy
Copy link

I'd really love to see this implemented. Several issues point at this issue, but it's closed as not planned due to going stale :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: HTTP API stale This issue is like stale bread.
Projects
None yet
Development

No branches or pull requests

5 participants