Skip to content

fix: don't create autostart workspace builds with no available provisioners #19067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

cstyan
Copy link
Contributor

@cstyan cstyan commented Jul 29, 2025

This should fix #17941

With just the test addition the test will fail since the workspace build is triggered, we attempt to move the workspace into the start state: require.Equal(t, codersdk.WorkspaceTransitionStop, workspace.LatestBuild.Transition)

    lifecycle_executor_test.go:1718: 
                Error Trace:    /home/coder/coder/coderd/autobuild/lifecycle_executor_test.go:1718
                Error:          Not equal: 
                                expected: "stop"
                                actual  : "start"
                            
                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,2 +1,2 @@
                                -(codersdk.WorkspaceTransition) (len=4) "stop"
                                +(codersdk.WorkspaceTransition) (len=5) "start"
                                 
                Test:           TestExecutorAutostartSkipsWhenNoProvisionersAvailable

With the 2nd commit we skip the calls to wsbuilder.New and builder.Build and so the transition is never scheduled.

coder@automatic-builds-test:~/coder/coderd/autobuild$ go test -run TestExecutorAutostartSkipsWhenNoProvisionersAvailable
PASS
ok      github.com/coder/coder/v2/coderd/autobuild      1.134s
coder@automatic-builds-test:~/coder/coderd/autobuild$ go test -run TestExecutorAutostartSkipsWhenNoProvisionersAvailable -v | grep autostart
    t.go:106: 2025-07-28 22:11:09.405 [warn]  coderd.autobuild: skipping autostart - no available provisioners  workspace_id=219f4901-dfd0-4cdc-8cf2-258861adfd18  workspace_name=dazzling-jemison3-89l

However, many tests don't fully setup provisioner or wait for the provisioner + DB to be fully available, so the third commit adds a somewhat smelly way to skip the provisioner check in tests that don't need to test that functionality.

@cstyan cstyan changed the title fix: don't create autostart workspace builds if there is no available provisioners for the job fix: don't create autostart workspace builds with no available provisioners Jul 29, 2025
@deansheather deansheather self-requested a review July 30, 2025 17:04
}

// Check if any provisioners are active (not stale)
now := dbtime.Now()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be using the passed in time from runOnce rather than getting it's own time

Comment on lines 148 to 141
if testing.Testing() {
staleInterval = TestingStaleInterval
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use the passed in time from runOnce I don't think changing the interval would be necessary, you could write the test in a way that the default interval works fine

@@ -52,6 +56,9 @@ type Executor struct {
experiments codersdk.Experiments

metrics executorMetrics

// Skip provisioner availability check (should only be true for *some* tests)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It says it should only be true for some tests, but it seems it's true for almost every test using coderdtest. How many tests are actually broken by this change? If it's only a few, I'd rather we just fix those tests (e.g. by adding provisioners) rather than add a skip flag.

cstyan added 5 commits August 5, 2025 19:54
provisioner is not available

Signed-off-by: Callum Styan <callumstyan@gmail.com>
transition so that we don't accumulate autostart jobs that will
eventually have to be cleaned up by the reaper routine

Signed-off-by: Callum Styan <callumstyan@gmail.com>
DB/provisioner to be available, so add a way to skip the check

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
…ndling

This commit addresses the requirements to:
1. Comment out provisionerCloser.Close() in TestExecutorAutostartSkipsWhenNoProvisionersAvailable
2. Remove the need for SkipProvisionerCheck by refactoring tests

Changes:
- Remove SkipProvisionerCheck field and logic from Executor
- Remove SkipProvisionerCheck option from coderdtest.Options
- Add proper provisioner daemon tags to match template requirements
- Add helper functions for waiting for provisioner availability
- Update tests to use real provisioner availability checks
- Fix timing issues in provisioner staleness tests
- Comment out provisioner close call to test proper behavior

The test now correctly validates that provisioners remain available
when not explicitly closed, and uses the real hasAvailableProvisioners
logic instead of bypassing it.

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan cstyan force-pushed the callum-autostart-no-provisioner branch from 9971794 to 78a8f5e Compare August 5, 2025 20:01
autobuild path have a valid provisioner available and pass an
appropriate time.Time to the channel for triggering the build based on
whether the provisioner is supposed to be valid or stale

Signed-off-by: Callum Styan <callumstyan@gmail.com>
@cstyan
Copy link
Contributor Author

cstyan commented Aug 7, 2025

Okay @deansheather thanks for your patience, I think this is finally ready for another review. Took quite a bit of digging with the help of Blink to find all the bits and pieces that had to be updated to get the tests working without the skipProvisionerCheck flag. I've done my own run through of the changes twice now but I may have still missed some small stuff, nit pick away 👍

The TL; DR of the changes is that we in tests which deal with autobuilds we need:

  • to create the test in a way where we also return the handle to the database instance
  • ensure the test creates a provisioner with the relevant tags for the workspace that will be used within that test case (in most cases this is just the default owner: "", scope: organization but there is a mechanism for using specific tags in at least one of the tests)
  • based on whether the autobuild should happen or not, we need to ensure we can retrieve the relevant provisioner and sest the scheduled tick time for the autobuild appropriately (should the provisioner be valid, or should it be stale)

ticker <- build.Job.CompletedAt.Add(failureTTL * 2)
tickTime := build.Job.CompletedAt.Add(failureTTL * 2)

p, err := coderdtest.GetProvisionerForWorkspace(t, db, time.Now(), ws.OrganizationID, database.ProvisionerJob{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

provisionerdserver.StaleInterval is 90s, so is this necessary? Most tests have a timeout of <30s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what this comment was referring to honestly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a lot of the autobuild tests, you query the provisionerd and force update it to be current with coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime) right before it ticks, even if the test isn't doing anything funky with the timings. Since the stale interval is 90s, I don't think most tests need this change to be made to them no? Unless there's another reason, like our fake provisioner daemons not setting times correctly?

cstyan added 7 commits August 11, 2025 19:49
TestExecutorAutostartSkipsWhenNoProvisionersAvailable

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
any before that

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Comment on lines +59 to +65
p, err := coderdtest.GetProvisionerForTags(db, time.Now(), workspace.OrganizationID, map[string]string{})
require.NoError(t, err)
// When: the autobuild executor ticks after the scheduled time
go func() {
tickCh <- sched.Next(workspace.LatestBuild.CreatedAt)
tickTime := sched.Next(workspace.LatestBuild.CreatedAt)
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime)
tickCh <- tickTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W.R.T the last comment from the previous review, this is what I'm referring to. Is this change (and other similar changes) necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, AFAICT it is. Otherwise the test is flakey, it would only pass if you run the test within StaleInterval seconds of the top of any hour.

When one of these tests starts, the provisioner has a LastSeenAt time of time.Now() (sometimes + 1 second or so), but sched.Next(workspace.LatestBuild.CreatedAt) will return a value that is the top of the hour.

For example: right now if I comment out the UpdateProvisionerLastSeenAt line here, in hasValidProvisioner the pd.LastSeenAt time is 2025-08-12 16:34:02.908042 +0000 UTC but t from the tickCh is 2025-08-12 17:00:00 +0000 UTC. So in a real running deployment of Coder that provisioner would be considered stale.

This is the exact thing we were trying to solve for with the addition of hasValidProvisioner. So any test case which utilized the autobuild functionality but needed to continue passing needed to update the LastSeenAt value. In the test I have added, which should see the provisioner as stale and then assert that there were no autobuild transitions initiated, we ensure the LastSeenAt time is far enough in the past that the provisioner is stale.

Comment on lines +1703 to +1708
// Ensure the provisioner is stale
staleTime := sched.Next(workspace.LatestBuild.CreatedAt).Add((-1 * provisionerdserver.StaleInterval) + -10*time.Second)
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, staleTime)

// Trigger autobuild
tickCh <- sched.Next(workspace.LatestBuild.CreatedAt)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racey since the provisioner daemon is still running and can update it's stats. It'd be less prone to racing if the provisioner daemon was closed.

db, pubsub := dbtestutil.New(t)
client, _, api := coderdtest.NewWithAPI(t, &coderdtest.Options{
  Database: db,
  Pubsub: pubsub,
  IncludeProvisionerDaemon: false,
})

daemon1Closer := coderdtest.NewTaggedProvisionerDaemon(t, api, "name", tags)
t.Cleanup(func() {
  _ = daemon1Closer.Close()
})

// create the workspace

daemon1Closer.Close() // <-- this is the important line, prevents the race I described

next := sched.Next(workspace.LatestBuild.CreatedAt)
staleTime := next.Add(-1 * provisionerdserver.StaleInterval - 10 *time.Second)
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, staleTime)
tickCh <- next

// Assertions...

// Create a new provisioner daemon with the same tags like you just did before, then ensure it works

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made these changes, lmk if this test still looks off to you

Comment on lines +1646 to +1653
func MustWaitForAnyProvisioner(t *testing.T, db database.Store) {
t.Helper()
ctx := ctxWithProvisionerPermissions(testutil.Context(t, testutil.WaitShort))
require.Eventually(t, func() bool {
daemons, err := db.GetProvisionerDaemons(ctx)
return err == nil && len(daemons) > 0
}, testutil.WaitShort, testutil.IntervalFast)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function necessary? In all of your tests there is an admin client there also, so you could just use the WithClient variant of this function instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much insight into why, but yes the tests which are currently using MustWaitForAnyProvisioner fail if they're changed to MustWaitForAnyProvisionerWithClient. I can do a bit more digging on that today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to change things so that we just have MustWaitForAnyProvisioner, but we could have gone either way.

The test which was using MustWaitForAnyProvisioner was calling CreateFirstUser after the MustWaitForAnyProvisioner call, so the client did not have an admin session token yet and would 401. The test which was calling MustWaitForAnyProvisionerWithClient was just calling corderdtest.New instead of coderdtest.NewWithDatabase.

Comment on lines +649 to +652
p, err := coderdtest.GetProvisionerForTags(db, time.Now(), ws.OrganizationID, nil)
require.NoError(t, err)
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime)
ticker <- tickTime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to the comment from my previous review, is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for the same reason as above.

cstyan added 2 commits August 12, 2025 17:30
new test

Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatic builds should not be created if there are no provisioners with matching tags
2 participants