Feature Request: Improve Workspace Recovery if backup/restore only a limited subset #17260
bjornrobertsson
started this conversation in
Feature Requests
Replies: 1 comment
-
I don't think there is value in restoring running workspaces (agents). But yes backing up the persistent storage of all workspaces and the Coder DB itself with Valero looks promising. So I am interested in a use case where we don't have to make any changes in Coder and Valero can work independently. We can probably do an integration guide on how to configure Valero to perform backups of Coder DB and Coder workspaces. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Understanding the Coder Agent Authentication Issue with Velero Backup/Restore
When a Workspace and it's resources are restored, i.e. with Velero Backup, the Workspace is running but is not associate correctly and has lost the connection to coderd.
Velero backup supports different methods of backup, i.e. Full:
Based on the guidance, Coder only suggests backup of the PostgreSQL database: https://coder.com/docs/admin/infrastructure/validated-architectures#disaster-recovery
But for instance related recovery, a smaller subset would include only Workspaces, so you want to restore only ONE Workspace since (reasons vary but a partial backup is logical when addressing a limited problem):
Status seen in Workspace pod
The issue occurs when restoring a Coder workspace with Velero but is likely to happen with any partial restore.
The expectation to restart the Workspace is valid, and in the restored state, the 'Retry' button is not enabled.
Root Cause Analysis (of the error and Coder GitHub code)
The key files involved in agent authentication:
coderd/workspaceagents.go
- Handles agent authentication and RPCagent/agent.go
- The agent connection logicprovisionersdk/proto/provisioner.go
- Defines the provisioning job statesClaude's Suggestions:
I recommend modifying the
coderd/workspaceagents.go
file to enable a "recovery mode" for agents that have been restored from backup:Alternative Approach
Another potentially more secure approach would be to add a recovery endpoint specifically for restored agents:
Implementation Notes
This solution allows legitimate restored agents to reconnect while maintaining security. The agent would need to attempt the normal authentication flow first, and if that fails, try the recovery mechanism.
Beta Was this translation helpful? Give feedback.
All reactions