-
Notifications
You must be signed in to change notification settings - Fork 881
What to do when workspace is state-bricked? #2256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Misaligned how? I wonder if we can extend the template format to include an Resource tracking for auditing and also for quotas would be helpful in this case. |
@ketang if the terraform command is terminated mid-way through, the infrastructure can be deleted or created without the state being updated. Of course, there's no way to guarantee that the terraform command will ever complete, so there's no way to avoid some kind of explicit reconciliation. |
@ammario a 'Re-pull state' button? |
@mtm20176 terraform state import is supported on the most common resources, but not all, so that isn't a sufficient solution. I think telling terraform to try and delete everything even if it thinks stuff exists is the best way to do a "hard reset". If this avoids persistent resources, there won't be any data loss. |
How do we featurize this "hard reset?" Also could resources attached to an incompletely-built workspace have orphaned identifiers that cannot be retrieved or reconstructed? |
Yes, it's possible that we leak orphaned resources. One idea I had was attaching the build serial number to the end of each resource. That would help identify orphans for periodic cleanup in large installations. It's a later problem though. We could also create a tool that does this for people automatically. A nice enterprise feature. Kyle proposed a solution whereby you can direct the build to use the last successful state. In 95% of cases, this should allow terraform to assume ownership of the disassociated resources and recover. |
Do most cloud systems support some form of arbitrary resource tagging? If so we could tag them with |
The problem this this approach is the persistent disk has a stable identifier (it can't be renamed) and so it's possible that the new resources you're trying to create are competing with the old ones for that disk. |
Mmm okay. I wonder if there's a way to use sequence numbers here and have a field for the last successful batch of persistent resources. If you try to create one and something fails, you increment the number for the next try. If creation succeeds, you leave the number alone. I think this should allow for reattachment (another flavor of success) while also making it obvious which resources need to be purged, either manually or automatically. We should be able to do this with one sequence number for all the persistent resources associated with a single workspace. |
There is probably a solution there. It would add another coderism to the way people configure Terraform, so I think we should see how far we get with @kylecarbs's last-state-reuse idea first. |
Ah, sorry, I meant for those two things to be used together. If the last-state-reuse idea works 95% of the time, then 95% of the time customers won't see our idiosyncrasy, while the other 5% of the time they'll appreciate it (I hope). Also, how Terraform-specific is the state reuse? I guess it works for any IaC that has rich local state. If it's weak or nonexistent, though, that'll call for some way to inspect current state or just completely ignore it. Or track it ourselves, I guess. |
The other major declarative one (Pulumi) is backed by Terraform, so likely resolved in the same way. I'm not sure how an imperative configuration management tool like Ansible or Chef would fit into our model in general. If we built a Docker provider, we could append the version number on the provisioner side so it's out of mind for the configurer. If we built a Kubernetes provider, the declarative interface saves us and this issue disappears entirely. |
We're going to have to support them. |
Current workaround, given variables YOUR_WORKSPACE_NAME, with a value like
I'm told this sometimes fixes it and sometimes doesn't. cc @bpmct |
An admin can always intervene and modify the state. I'm interested in finding out how a user can resolve this on their own, avoiding downtime. |
Since these commands don't modify state I don't understand how this would affect behavior. It's repushing the same exact state that was already in the database. |
#3526 helps with this. Should we add docs to improve this? |
Seems like this is solved with #3844 and some cancellation work that was done! |
So, my workspace state has become misaligned with infrastructure. It looks like this in the logs:
I'm now dead in the water until an infrastructure admin deletes that deployment or repulls my state. What can we do to let the user reconcile broken state? This kind of downtime didn't exist in v1, and is a big threat to OSS usability IMO.
Can we have Terraform force delete all ephemeral infrastructure? This seems like it would reconcile the state.
The text was updated successfully, but these errors were encountered: