Description
Background
Frequently, customers' workspace builds will fail from flakey issues. This costs significant developer time, as most users expect to open the dashboard to an already-started workspace (triggered by autostart). Thus, they gain no benefits from the scheduling tool and must wait while the workspace builds again, sometimes completing after multiple attempts.
In many cases, workspace builds can take from 10-30 minutes. This issue can easily cost hours of developer time each week, given that almost all templates depend on external services that cannot guarantee 100% uptime.
From the developer's perspective, there's an idle period when the control plane could be attempting to restart the workspace and let the transient issue naturally resolve. Admins and users could later audit the failed builds via asynchronous Notifications.
Many customers have requested an automated retry system to reduce developer time lost to this issue.
The problem is most impactful on failed autostart, but we should consider automated retries for all builds if some templates take >= 1 hour. A developer may kick off the start job and expect to come back to a ready environment. We should allow template admins to minimize the idle time by not restricting auto-retries to automatic builds. This also could help with build cleanup.
Proposal
We add a "Retry Failed Builds" option for templates that can be triggered on all builds, or only automated starts. This would be configured (opt-in) in template settings by setting a maximum number of retries to attempt.
When toggled-on, the control plane will automatically attempt the start build upon failure N
times.