Description
Problem
TL;DR provisionerd can hang on exit if it gets a 403 (for example, after a license change) instead of exiting with an error to try reconnecting at a later date.
A developer modified the license on dev.coder.com to test out an issue.
They were unable to restore the original license after their testing, and accidentally replaced it with a different one that did not include the entitlement for external provisioners.
This caused the external provisioner deployments to disconnect with status 403 as expected:
{"ts":"2025-02-26T12:47:10.029756766Z","level":"ERROR","msg":"not authorized to dial coderd","caller":"/home/runner/work/coder/coder/provisionerd/provisionerd.go:240","func":"github.com/coder/coder/v2/provisionerd.(*Server).connect","fields":{"error":"GET https://dev.coder.com/api/v2/organizations/default/provisionerdaemons/serve?id=2d9b6b15-03d0-464d-a6f5-48f64e836bec\u0026name=coder-provisioner-tagged-57bf5899-4ph9k\u0026provisioner=terraform\u0026version=1.3\u0026version=1.3: unexpected status code 403: External provisioner daemons is an Enterprise feature. Contact sales!"}}
{"ts":"2025-02-26T12:47:10.029848836Z","level":"DEBUG","msg":"connect loop exited","caller":"/home/runner/work/coder/coder/provisionerd/provisionerd.go:241","func":"github.com/coder/coder/v2/provisionerd.(*Server).connect"}
The license was later replaced.
As we regularly build and deploy main
to dogfood, one of the provisioner deployments was restarted and reconnected successfully. However, a corresponding step wasn't being done to the second provisioner deployment (corrected here: coder/coder#16716).
The provisioner deployment that was not restarted by CI ended up 'hanging' in that state and not exiting with an error. Users of the dogfood deployment encountered "pending" workspace builds. After manually restarting the deployment, the provisioner reconnected successfully and builds proceeded.