Skip to content

What to do when workspace is state-bricked? #2256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ammario opened this issue Jun 10, 2022 · 18 comments
Closed

What to do when workspace is state-bricked? #2256

ammario opened this issue Jun 10, 2022 · 18 comments
Labels
api Area: HTTP API site Area: frontend dashboard

Comments

@ammario
Copy link
Member

ammario commented Jun 10, 2022

So, my workspace state has become misaligned with infrastructure. It looks like this in the logs:

14:02:06.474
Plan: 1 to add, 0 to change, 0 to destroy.
14:02:06.474
Plan: 1 to add, 0 to change, 0 to destroy.
14:02:06.873
kubernetes_deployment.coder[0]: Creating...
14:02:06.873
kubernetes_deployment.coder[0]: Creating...
14:02:06.919
kubernetes_deployment.coder[0]: Creation errored after 0s
14:02:06.919
kubernetes_deployment.coder[0]: Creation errored after 0s
14:02:06.937
Error: Failed to create deployment: deployments.apps "coder-ammario-ab" already exists
14:02:06.937
Error: Failed to create deployment: deployments.apps "coder-ammario-ab" already exists
14:02:06.941
14:02:06.941

I'm now dead in the water until an infrastructure admin deletes that deployment or repulls my state. What can we do to let the user reconcile broken state? This kind of downtime didn't exist in v1, and is a big threat to OSS usability IMO.


Can we have Terraform force delete all ephemeral infrastructure? This seems like it would reconcile the state.

@ammario ammario added the bug label Jun 10, 2022
@ketang
Copy link
Contributor

ketang commented Jun 10, 2022

Misaligned how?

I wonder if we can extend the template format to include an rm -r equivalent. I think it would have to be on top of Terraform rather than using Terraform; I infer what you're describing in terms of misalignment is that the state file doesn't match reality.

Resource tracking for auditing and also for quotas would be helpful in this case.

@ammario
Copy link
Member Author

ammario commented Jun 10, 2022

@ketang if the terraform command is terminated mid-way through, the infrastructure can be deleted or created without the state being updated. Of course, there's no way to guarantee that the terraform command will ever complete, so there's no way to avoid some kind of explicit reconciliation.

@sharkymark
Copy link
Contributor

@ammario a 'Re-pull state' button?

@ammario
Copy link
Member Author

ammario commented Jun 10, 2022

@mtm20176 terraform state import is supported on the most common resources, but not all, so that isn't a sufficient solution. I think telling terraform to try and delete everything even if it thinks stuff exists is the best way to do a "hard reset". If this avoids persistent resources, there won't be any data loss.

@ketang
Copy link
Contributor

ketang commented Jun 10, 2022

How do we featurize this "hard reset?" Also could resources attached to an incompletely-built workspace have orphaned identifiers that cannot be retrieved or reconstructed?

@ammario
Copy link
Member Author

ammario commented Jun 10, 2022

How do we featurize this "hard reset?" Also could resources attached to an incompletely-built workspace have orphaned identifiers that cannot be retrieved or reconstructed?

Yes, it's possible that we leak orphaned resources. One idea I had was attaching the build serial number to the end of each resource. That would help identify orphans for periodic cleanup in large installations. It's a later problem though. We could also create a tool that does this for people automatically. A nice enterprise feature.


Kyle proposed a solution whereby you can direct the build to use the last successful state. In 95% of cases, this should allow terraform to assume ownership of the disassociated resources and recover.

@ketang
Copy link
Contributor

ketang commented Jun 10, 2022

Do most cloud systems support some form of arbitrary resource tagging? If so we could tag them with <workspace id> + <sequence number>.

@ammario
Copy link
Member Author

ammario commented Jun 10, 2022

Do most cloud systems support some form of arbitrary resource tagging? If so we could tag them with <workspace id> + <sequence number>.

The problem this this approach is the persistent disk has a stable identifier (it can't be renamed) and so it's possible that the new resources you're trying to create are competing with the old ones for that disk.

@ketang
Copy link
Contributor

ketang commented Jun 11, 2022

Mmm okay. I wonder if there's a way to use sequence numbers here and have a field for the last successful batch of persistent resources. If you try to create one and something fails, you increment the number for the next try. If creation succeeds, you leave the number alone. I think this should allow for reattachment (another flavor of success) while also making it obvious which resources need to be purged, either manually or automatically. We should be able to do this with one sequence number for all the persistent resources associated with a single workspace.

@ammario
Copy link
Member Author

ammario commented Jun 12, 2022

Mmm okay. I wonder if there's a way to use sequence numbers here and have a field for the last successful batch of persistent resources. If you try to create one and something fails, you increment the number for the next try. If creation succeeds, you leave the number alone. I think this should allow for reattachment (another flavor of success) while also making it obvious which resources need to be purged, either manually or automatically. We should be able to do this with one sequence number for all the persistent resources associated with a single workspace.

There is probably a solution there. It would add another coderism to the way people configure Terraform, so I think we should see how far we get with @kylecarbs's last-state-reuse idea first.

@ketang
Copy link
Contributor

ketang commented Jun 13, 2022

Ah, sorry, I meant for those two things to be used together. If the last-state-reuse idea works 95% of the time, then 95% of the time customers won't see our idiosyncrasy, while the other 5% of the time they'll appreciate it (I hope).

Also, how Terraform-specific is the state reuse? I guess it works for any IaC that has rich local state. If it's weak or nonexistent, though, that'll call for some way to inspect current state or just completely ignore it. Or track it ourselves, I guess.

@misskniss misskniss added needs grooming site Area: frontend dashboard api Area: HTTP API labels Jun 14, 2022
@ammario
Copy link
Member Author

ammario commented Jun 15, 2022

Also, how Terraform-specific is the state reuse? I guess it works for any IaC that has rich local state. If it's weak or nonexistent, though, that'll call for some way to inspect current state or just completely ignore it. Or track it ourselves, I guess.

The other major declarative one (Pulumi) is backed by Terraform, so likely resolved in the same way. I'm not sure how an imperative configuration management tool like Ansible or Chef would fit into our model in general.

If we built a Docker provider, we could append the version number on the provisioner side so it's out of mind for the configurer. If we built a Kubernetes provider, the declarative interface saves us and this issue disappears entirely.

@ketang
Copy link
Contributor

ketang commented Jun 15, 2022

I'm not sure how an imperative configuration management tool like Ansible or Chef would fit into our model in general.

We're going to have to support them.

@presleyp
Copy link
Contributor

Current workaround, given variables YOUR_WORKSPACE_NAME, with a value like presleyp/my-workspace, and SOME_TEXT_FILE, with a value like my-data.txt.

coder state pull YOUR_WORKSPACE_NAME > SOME_TEXT_FILE
coder state push YOUR_WORKSPACE_NAME SOME_TEXT_FILE

I'm told this sometimes fixes it and sometimes doesn't. cc @bpmct

@ammario
Copy link
Member Author

ammario commented Jul 25, 2022

An admin can always intervene and modify the state. I'm interested in finding out how a user can resolve this on their own, avoiding downtime.

@ammario
Copy link
Member Author

ammario commented Jul 25, 2022

Current workaround, given variables YOUR_WORKSPACE_NAME, with a value like presleyp/my-workspace, and SOME_TEXT_FILE, with a value like my-data.txt.

coder state pull YOUR_WORKSPACE_NAME > SOME_TEXT_FILE
coder state push YOUR_WORKSPACE_NAME SOME_TEXT_FILE

I'm told this sometimes fixes it and sometimes doesn't. cc @bpmct

Since these commands don't modify state I don't understand how this would affect behavior. It's repushing the same exact state that was already in the database.

@kylecarbs
Copy link
Member

#3526 helps with this. Should we add docs to improve this?

@kylecarbs
Copy link
Member

Seems like this is solved with #3844 and some cancellation work that was done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Area: HTTP API site Area: frontend dashboard
Projects
None yet
Development

No branches or pull requests

6 participants