Description
Problem
While reading #15429 @mafredri noticed that the current dormancy FailureTTL logic does not take into account the case of a workspace with a previously failed 'Stop' transition.
Here is the current logic:
// isEligibleForFailedStop returns true if the workspace is eligible to be stopped
// due to a failed build.
func isEligibleForFailedStop(build database.WorkspaceBuild, job database.ProvisionerJob, templateSchedule schedule.TemplateScheduleOptions, currentTick time.Time) bool {
// If the template has specified a failure TLL.
return templateSchedule.FailureTTL > 0 &&
// And the job resulted in failure.
job.JobStatus == database.ProvisionerJobStatusFailed &&
build.Transition == database.WorkspaceTransitionStart &&
// And sufficient time has elapsed since the job has completed.
job.CompletedAt.Valid &&
currentTick.Sub(job.CompletedAt.Time) > templateSchedule.FailureTTL
}
From the above, any workspaces whose last provisioner job failed and whose last transition was not 'start' will not be eligible for FailureTTL stop.
I was able to replicate this behaviour by creating a template that would always fail to stop correctly:
resource "null_resource" "dontstopmenow" {
count = data.coder_workspace.me.start_count == 0 ? 1 : 0
provisioner "local-exec" {
command = "/bin/false" # I'm having a good time
}
}
I did notice that the workspace deletion did still kick in, though I set fairly aggresive parameters for failureTTL, time til dormant, and dormant deletion.
This does mean that a workspace in a failed stop state will have whatever resources failed to be destroyed by the stop transition continue to exist until the delete rolls around. (And if the template remains broken, the delete may have to simply orphan resources.)
Proposed Solution
Remove the check on the latest build transition. Any workspace build that failed more than FailureTTL ago should be stopped if the template schedule FailureTTL is set.
However, this may mean 'infinite' attempts to stop the workspace if the template is simply broken.
We may need to place a cap on the maximum consecutive attempts to automatically stop a failed workspace.