fix: don't create autostart workspace builds with no available provisioners #19067

cstyan · 2025-07-29T01:00:34Z

This should fix #17941

With just the test addition the test will fail since the workspace build is triggered, we attempt to move the workspace into the start state: require.Equal(t, codersdk.WorkspaceTransitionStop, workspace.LatestBuild.Transition)

    lifecycle_executor_test.go:1718: 
                Error Trace:    /home/coder/coder/coderd/autobuild/lifecycle_executor_test.go:1718
                Error:          Not equal: 
                                expected: "stop"
                                actual  : "start"
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,2 +1,2 @@
                                -(codersdk.WorkspaceTransition) (len=4) "stop"
                                +(codersdk.WorkspaceTransition) (len=5) "start"
                                 
                Test:           TestExecutorAutostartSkipsWhenNoProvisionersAvailable

With the 2nd commit we skip the calls to wsbuilder.New and builder.Build and so the transition is never scheduled.

coder@automatic-builds-test:~/coder/coderd/autobuild$ go test -run TestExecutorAutostartSkipsWhenNoProvisionersAvailable
PASS
ok      github.com/coder/coder/v2/coderd/autobuild      1.134s
coder@automatic-builds-test:~/coder/coderd/autobuild$ go test -run TestExecutorAutostartSkipsWhenNoProvisionersAvailable -v | grep autostart
    t.go:106: 2025-07-28 22:11:09.405 [warn]  coderd.autobuild: skipping autostart - no available provisioners  workspace_id=219f4901-dfd0-4cdc-8cf2-258861adfd18  workspace_name=dazzling-jemison3-89l

However, many tests don't fully setup provisioner or wait for the provisioner + DB to be fully available, so the third commit adds a somewhat smelly way to skip the provisioner check in tests that don't need to test that functionality.

deansheather · 2025-07-31T02:47:00Z

coderd/autobuild/lifecycle_executor.go

+	}
+
+	// Check if any provisioners are active (not stale)
+	now := dbtime.Now()


This should be using the passed in time from runOnce rather than getting it's own time

deansheather · 2025-07-31T02:47:38Z

coderd/autobuild/lifecycle_executor.go

+	if testing.Testing() {
+		staleInterval = TestingStaleInterval
+	}


If you use the passed in time from runOnce I don't think changing the interval would be necessary, you could write the test in a way that the default interval works fine

deansheather · 2025-07-31T02:49:21Z

coderd/autobuild/lifecycle_executor.go

@@ -52,6 +56,9 @@ type Executor struct {
 	experiments           codersdk.Experiments

 	metrics executorMetrics
+
+	// Skip provisioner availability check (should only be true for *some* tests)


It says it should only be true for some tests, but it seems it's true for almost every test using coderdtest. How many tests are actually broken by this change? If it's only a few, I'd rather we just fix those tests (e.g. by adding provisioners) rather than add a skip flag.

provisioner is not available Signed-off-by: Callum Styan <callumstyan@gmail.com>

transition so that we don't accumulate autostart jobs that will eventually have to be cleaned up by the reaper routine Signed-off-by: Callum Styan <callumstyan@gmail.com>

DB/provisioner to be available, so add a way to skip the check Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>

…ndling This commit addresses the requirements to: 1. Comment out provisionerCloser.Close() in TestExecutorAutostartSkipsWhenNoProvisionersAvailable 2. Remove the need for SkipProvisionerCheck by refactoring tests Changes: - Remove SkipProvisionerCheck field and logic from Executor - Remove SkipProvisionerCheck option from coderdtest.Options - Add proper provisioner daemon tags to match template requirements - Add helper functions for waiting for provisioner availability - Update tests to use real provisioner availability checks - Fix timing issues in provisioner staleness tests - Comment out provisioner close call to test proper behavior The test now correctly validates that provisioners remain available when not explicitly closed, and uses the real hasAvailableProvisioners logic instead of bypassing it. Signed-off-by: Callum Styan <callumstyan@gmail.com>

autobuild path have a valid provisioner available and pass an appropriate time.Time to the channel for triggering the build based on whether the provisioner is supposed to be valid or stale Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan · 2025-08-07T20:33:33Z

Okay @deansheather thanks for your patience, I think this is finally ready for another review. Took quite a bit of digging with the help of Blink to find all the bits and pieces that had to be updated to get the tests working without the skipProvisionerCheck flag. I've done my own run through of the changes twice now but I may have still missed some small stuff, nit pick away 👍

The TL; DR of the changes is that we in tests which deal with autobuilds we need:

to create the test in a way where we also return the handle to the database instance
ensure the test creates a provisioner with the relevant tags for the workspace that will be used within that test case (in most cases this is just the default owner: "", scope: organization but there is a mechanism for using specific tags in at least one of the tests)
based on whether the autobuild should happen or not, we need to ensure we can retrieve the relevant provisioner and sest the scheduled tick time for the autobuild appropriately (should the provisioner be valid, or should it be stale)

coderd/autobuild/lifecycle_executor.go

coderd/autobuild/lifecycle_executor_test.go

deansheather · 2025-08-08T02:52:11Z

enterprise/coderd/workspaces_test.go

-		ticker <- build.Job.CompletedAt.Add(failureTTL * 2)
+		tickTime := build.Job.CompletedAt.Add(failureTTL * 2)
+
+		p, err := coderdtest.GetProvisionerForWorkspace(t, db, time.Now(), ws.OrganizationID, database.ProvisionerJob{})


provisionerdserver.StaleInterval is 90s, so is this necessary? Most tests have a timeout of <30s.

I'm not sure what this comment was referring to honestly.

In a lot of the autobuild tests, you query the provisionerd and force update it to be current with coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime) right before it ticks, even if the test isn't doing anything funky with the timings. Since the stale interval is 90s, I don't think most tests need this change to be made to them no? Unless there's another reason, like our fake provisioner daemons not setting times correctly?

coderd/coderdtest/coderdtest.go

TestExecutorAutostartSkipsWhenNoProvisionersAvailable Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>

any before that Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>

deansheather · 2025-08-12T03:02:38Z

coderd/autobuild/lifecycle_executor_test.go

+	p, err := coderdtest.GetProvisionerForTags(db, time.Now(), workspace.OrganizationID, map[string]string{})
+	require.NoError(t, err)
 	// When: the autobuild executor ticks after the scheduled time
 	go func() {
-		tickCh <- sched.Next(workspace.LatestBuild.CreatedAt)
+		tickTime := sched.Next(workspace.LatestBuild.CreatedAt)
+		coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime)
+		tickCh <- tickTime


W.R.T the last comment from the previous review, this is what I'm referring to. Is this change (and other similar changes) necessary?

Yes, AFAICT it is. Otherwise the test is flakey, it would only pass if you run the test within StaleInterval seconds of the top of any hour.

When one of these tests starts, the provisioner has a LastSeenAt time of time.Now() (sometimes + 1 second or so), but sched.Next(workspace.LatestBuild.CreatedAt) will return a value that is the top of the hour.

For example: right now if I comment out the UpdateProvisionerLastSeenAt line here, in hasValidProvisioner the pd.LastSeenAt time is 2025-08-12 16:34:02.908042 +0000 UTC but t from the tickCh is 2025-08-12 17:00:00 +0000 UTC. So in a real running deployment of Coder that provisioner would be considered stale.

This is the exact thing we were trying to solve for with the addition of hasValidProvisioner. So any test case which utilized the autobuild functionality but needed to continue passing needed to update the LastSeenAt value. In the test I have added, which should see the provisioner as stale and then assert that there were no autobuild transitions initiated, we ensure the LastSeenAt time is far enough in the past that the provisioner is stale.

coderd/autobuild/lifecycle_executor_test.go

deansheather · 2025-08-12T03:33:43Z

coderd/autobuild/lifecycle_executor_test.go

+	// Ensure the provisioner is stale
+	staleTime := sched.Next(workspace.LatestBuild.CreatedAt).Add((-1 * provisionerdserver.StaleInterval) + -10*time.Second)
+	coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, staleTime)
+
+	// Trigger autobuild
+	tickCh <- sched.Next(workspace.LatestBuild.CreatedAt)


This is racey since the provisioner daemon is still running and can update it's stats. It'd be less prone to racing if the provisioner daemon was closed.

db, pubsub := dbtestutil.New(t) client, _, api := coderdtest.NewWithAPI(t, &coderdtest.Options{ Database: db, Pubsub: pubsub, IncludeProvisionerDaemon: false, }) daemon1Closer := coderdtest.NewTaggedProvisionerDaemon(t, api, "name", tags) t.Cleanup(func() { _ = daemon1Closer.Close() }) // create the workspace daemon1Closer.Close() // <-- this is the important line, prevents the race I described next := sched.Next(workspace.LatestBuild.CreatedAt) staleTime := next.Add(-1 * provisionerdserver.StaleInterval - 10 *time.Second) coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, staleTime) tickCh <- next // Assertions... // Create a new provisioner daemon with the same tags like you just did before, then ensure it works

made these changes, lmk if this test still looks off to you

deansheather · 2025-08-12T03:37:32Z

coderd/coderdtest/coderdtest.go

+func MustWaitForAnyProvisioner(t *testing.T, db database.Store) {
+	t.Helper()
+	ctx := ctxWithProvisionerPermissions(testutil.Context(t, testutil.WaitShort))
+	require.Eventually(t, func() bool {
+		daemons, err := db.GetProvisionerDaemons(ctx)
+		return err == nil && len(daemons) > 0
+	}, testutil.WaitShort, testutil.IntervalFast)
+}


Is this function necessary? In all of your tests there is an admin client there also, so you could just use the WithClient variant of this function instead.

I don't have much insight into why, but yes the tests which are currently using MustWaitForAnyProvisioner fail if they're changed to MustWaitForAnyProvisionerWithClient. I can do a bit more digging on that today.

I decided to change things so that we just have MustWaitForAnyProvisioner, but we could have gone either way.

The test which was using MustWaitForAnyProvisioner was calling CreateFirstUser after the MustWaitForAnyProvisioner call, so the client did not have an admin session token yet and would 401. The test which was calling MustWaitForAnyProvisionerWithClient was just calling corderdtest.New instead of coderdtest.NewWithDatabase.

coderd/coderdtest/coderdtest.go

deansheather · 2025-08-12T03:38:54Z

enterprise/coderd/workspaces_test.go

+		p, err := coderdtest.GetProvisionerForTags(db, time.Now(), ws.OrganizationID, nil)
+		require.NoError(t, err)
+		coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime)
+		ticker <- tickTime


Also related to the comment from my previous review, is this necessary?

Yes, for the same reason as above.

new test Signed-off-by: Callum Styan <callumstyan@gmail.com>

Signed-off-by: Callum Styan <callumstyan@gmail.com>

github-actions bot assigned cstyan Jul 29, 2025

cstyan changed the title ~~fix: don't create autostart workspace builds if there is no available provisioners for the job~~ fix: don't create autostart workspace builds with no available provisioners Jul 29, 2025

deansheather self-requested a review July 30, 2025 17:04

deansheather reviewed Jul 31, 2025

View reviewed changes

cstyan added 5 commits August 5, 2025 19:54

add test that fails since we still trigger autobuilds even if a

314bb66

provisioner is not available Signed-off-by: Callum Styan <callumstyan@gmail.com>

add check for available provisioners before we try to do a workspace

310511e

transition so that we don't accumulate autostart jobs that will eventually have to be cleaned up by the reaper routine Signed-off-by: Callum Styan <callumstyan@gmail.com>

fix provisioner check for tests which don't fully set up/wait for the

1256877

DB/provisioner to be available, so add a way to skip the check Signed-off-by: Callum Styan <callumstyan@gmail.com>

use the runOnce time

26f5001

Signed-off-by: Callum Styan <callumstyan@gmail.com>

cstyan force-pushed the callum-autostart-no-provisioner branch from 9971794 to 78a8f5e Compare August 5, 2025 20:01

deansheather reviewed Aug 8, 2025

View reviewed changes

cstyan added 7 commits August 11, 2025 19:49

significantly simplify

0bc4cce

TestExecutorAutostartSkipsWhenNoProvisionersAvailable Signed-off-by: Callum Styan <callumstyan@gmail.com>

more refactor and clean up

9a3f534

Signed-off-by: Callum Styan <callumstyan@gmail.com>

fix lint

a98ef1d

Signed-off-by: Callum Styan <callumstyan@gmail.com>

hasValidProvisioner only needs tags

e48c724

Signed-off-by: Callum Styan <callumstyan@gmail.com>

more cleanup of function params

883c748

Signed-off-by: Callum Styan <callumstyan@gmail.com>

already waiting for the workspaces provisioner, don't need to wait for

3c90c6d

any before that Signed-off-by: Callum Styan <callumstyan@gmail.com>

fix print statement missed in earlier commit

5a6493c

Signed-off-by: Callum Styan <callumstyan@gmail.com>

deansheather reviewed Aug 12, 2025

View reviewed changes

cstyan added 2 commits August 12, 2025 17:30

address more review feedback; mostly around provisioner closers in the

c5f0d41

new test Signed-off-by: Callum Styan <callumstyan@gmail.com>

remove MustWaitForAnyProvisionerWithClient

224acfc

Signed-off-by: Callum Styan <callumstyan@gmail.com>

fix: don't create autostart workspace builds with no available provisioners #19067

Are you sure you want to change the base?

fix: don't create autostart workspace builds with no available provisioners #19067

Conversation

cstyan commented Jul 29, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cstyan commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!