Description
Problem
If a prebuilt workspace's template uses ignore_changes
as recommended in the docs, its agent may not reconnect after a workspace template upgrade.
e.g.
resource "docker_container" "workspace" {
lifecycle {
ignore_changes = all
}
count = data.coder_workspace.me.start_count
entrypoint = ["sh", "-c", coder_agent.main.init_script]
env = ["CODER_AGENT_TOKEN=${coder_agent.main.token}"]
...
}
Details
A template upgrade kicks off a start
build. start
builds set coder_workspace.start_count
to 1
, which is used in the count
attribute of compute resources (see above example). If the workspace is already started, then any resources which already have count=1
will attempt to be updated in-place.
A start
build causes the coder_agent
to be recreated, which generates a new auth token. Normally, without ignore_changes
, the env
attribute above would be modified, since the token value changes. env
is immutable (i.e. defined as ForceNew
), therefore Terraform will see any changes to this attribute as drift from the original and force a replacement. This would lead to the docker_container
being recreated and the agent would start afresh and connect to the control plane.
With ignore_changes
, however, changes to these attributes are ignored in order for prebuilds to work, which means the template update for the workspace has no real effect at all, but the coder_agent
's token is still changed and so the agent can no longer connect to the control plane on behalf of the workspace. The previous agent token would still be used, even though the control plane will only accept the new one.
Workaround
Manually restarting the workspace will allow the agent to reconnect successfully.
Proposed Solution
Template updates should not be start
builds, but rather a logical restart (i.e. successive stop
and start
builds) in order to guarantee the behaviour customers expect. This should apply for both claimed prebuilt workspaces AND regular workspaces alike, to guarantee that the compute resource is created anew. I fear the current start
-only mechanism is working by accident, because of Terraform drift taking care of destroying and recreating the resource like a stop
+ start
would.