-
Notifications
You must be signed in to change notification settings - Fork 961
fix: don't create autostart workspace builds with no available provisioners #19067
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
// Check if any provisioners are active (not stale) | ||
now := dbtime.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be using the passed in time from runOnce
rather than getting it's own time
if testing.Testing() { | ||
staleInterval = TestingStaleInterval | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use the passed in time from runOnce
I don't think changing the interval would be necessary, you could write the test in a way that the default interval works fine
@@ -52,6 +56,9 @@ type Executor struct { | |||
experiments codersdk.Experiments | |||
|
|||
metrics executorMetrics | |||
|
|||
// Skip provisioner availability check (should only be true for *some* tests) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It says it should only be true for some tests, but it seems it's true for almost every test using coderdtest
. How many tests are actually broken by this change? If it's only a few, I'd rather we just fix those tests (e.g. by adding provisioners) rather than add a skip flag.
provisioner is not available Signed-off-by: Callum Styan <callumstyan@gmail.com>
transition so that we don't accumulate autostart jobs that will eventually have to be cleaned up by the reaper routine Signed-off-by: Callum Styan <callumstyan@gmail.com>
DB/provisioner to be available, so add a way to skip the check Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
…ndling This commit addresses the requirements to: 1. Comment out provisionerCloser.Close() in TestExecutorAutostartSkipsWhenNoProvisionersAvailable 2. Remove the need for SkipProvisionerCheck by refactoring tests Changes: - Remove SkipProvisionerCheck field and logic from Executor - Remove SkipProvisionerCheck option from coderdtest.Options - Add proper provisioner daemon tags to match template requirements - Add helper functions for waiting for provisioner availability - Update tests to use real provisioner availability checks - Fix timing issues in provisioner staleness tests - Comment out provisioner close call to test proper behavior The test now correctly validates that provisioners remain available when not explicitly closed, and uses the real hasAvailableProvisioners logic instead of bypassing it. Signed-off-by: Callum Styan <callumstyan@gmail.com>
9971794
to
78a8f5e
Compare
autobuild path have a valid provisioner available and pass an appropriate time.Time to the channel for triggering the build based on whether the provisioner is supposed to be valid or stale Signed-off-by: Callum Styan <callumstyan@gmail.com>
Okay @deansheather thanks for your patience, I think this is finally ready for another review. Took quite a bit of digging with the help of Blink to find all the bits and pieces that had to be updated to get the tests working without the The TL; DR of the changes is that we in tests which deal with autobuilds we need:
|
enterprise/coderd/workspaces_test.go
Outdated
ticker <- build.Job.CompletedAt.Add(failureTTL * 2) | ||
tickTime := build.Job.CompletedAt.Add(failureTTL * 2) | ||
|
||
p, err := coderdtest.GetProvisionerForWorkspace(t, db, time.Now(), ws.OrganizationID, database.ProvisionerJob{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
provisionerdserver.StaleInterval
is 90s, so is this necessary? Most tests have a timeout of <30s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what this comment was referring to honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a lot of the autobuild tests, you query the provisionerd and force update it to be current with coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime)
right before it ticks, even if the test isn't doing anything funky with the timings. Since the stale interval is 90s, I don't think most tests need this change to be made to them no? Unless there's another reason, like our fake provisioner daemons not setting times correctly?
TestExecutorAutostartSkipsWhenNoProvisionersAvailable Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
any before that Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
p, err := coderdtest.GetProvisionerForTags(db, time.Now(), workspace.OrganizationID, map[string]string{}) | ||
require.NoError(t, err) | ||
// When: the autobuild executor ticks after the scheduled time | ||
go func() { | ||
tickCh <- sched.Next(workspace.LatestBuild.CreatedAt) | ||
tickTime := sched.Next(workspace.LatestBuild.CreatedAt) | ||
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime) | ||
tickCh <- tickTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W.R.T the last comment from the previous review, this is what I'm referring to. Is this change (and other similar changes) necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, AFAICT it is. Otherwise the test is flakey, it would only pass if you run the test within StaleInterval
seconds of the top of any hour.
When one of these tests starts, the provisioner has a LastSeenAt
time of time.Now()
(sometimes + 1 second or so), but sched.Next(workspace.LatestBuild.CreatedAt)
will return a value that is the top of the hour.
For example: right now if I comment out the UpdateProvisionerLastSeenAt
line here, in hasValidProvisioner
the pd.LastSeenAt
time is 2025-08-12 16:34:02.908042 +0000 UTC
but t
from the tickCh
is 2025-08-12 17:00:00 +0000 UTC
. So in a real running deployment of Coder that provisioner would be considered stale.
This is the exact thing we were trying to solve for with the addition of hasValidProvisioner
. So any test case which utilized the autobuild functionality but needed to continue passing needed to update the LastSeenAt
value. In the test I have added, which should see the provisioner as stale and then assert that there were no autobuild transitions initiated, we ensure the LastSeenAt
time is far enough in the past that the provisioner is stale.
// Ensure the provisioner is stale | ||
staleTime := sched.Next(workspace.LatestBuild.CreatedAt).Add((-1 * provisionerdserver.StaleInterval) + -10*time.Second) | ||
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, staleTime) | ||
|
||
// Trigger autobuild | ||
tickCh <- sched.Next(workspace.LatestBuild.CreatedAt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is racey since the provisioner daemon is still running and can update it's stats. It'd be less prone to racing if the provisioner daemon was closed.
db, pubsub := dbtestutil.New(t)
client, _, api := coderdtest.NewWithAPI(t, &coderdtest.Options{
Database: db,
Pubsub: pubsub,
IncludeProvisionerDaemon: false,
})
daemon1Closer := coderdtest.NewTaggedProvisionerDaemon(t, api, "name", tags)
t.Cleanup(func() {
_ = daemon1Closer.Close()
})
// create the workspace
daemon1Closer.Close() // <-- this is the important line, prevents the race I described
next := sched.Next(workspace.LatestBuild.CreatedAt)
staleTime := next.Add(-1 * provisionerdserver.StaleInterval - 10 *time.Second)
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, staleTime)
tickCh <- next
// Assertions...
// Create a new provisioner daemon with the same tags like you just did before, then ensure it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made these changes, lmk if this test still looks off to you
func MustWaitForAnyProvisioner(t *testing.T, db database.Store) { | ||
t.Helper() | ||
ctx := ctxWithProvisionerPermissions(testutil.Context(t, testutil.WaitShort)) | ||
require.Eventually(t, func() bool { | ||
daemons, err := db.GetProvisionerDaemons(ctx) | ||
return err == nil && len(daemons) > 0 | ||
}, testutil.WaitShort, testutil.IntervalFast) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this function necessary? In all of your tests there is an admin client there also, so you could just use the WithClient variant of this function instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much insight into why, but yes the tests which are currently using MustWaitForAnyProvisioner
fail if they're changed to MustWaitForAnyProvisionerWithClient
. I can do a bit more digging on that today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to change things so that we just have MustWaitForAnyProvisioner
, but we could have gone either way.
The test which was using MustWaitForAnyProvisioner
was calling CreateFirstUser
after the MustWaitForAnyProvisioner
call, so the client did not have an admin session token yet and would 401. The test which was calling MustWaitForAnyProvisionerWithClient
was just calling corderdtest.New
instead of coderdtest.NewWithDatabase
.
p, err := coderdtest.GetProvisionerForTags(db, time.Now(), ws.OrganizationID, nil) | ||
require.NoError(t, err) | ||
coderdtest.UpdateProvisionerLastSeenAt(t, db, p.ID, tickTime) | ||
ticker <- tickTime |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also related to the comment from my previous review, is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the same reason as above.
new test Signed-off-by: Callum Styan <callumstyan@gmail.com>
Signed-off-by: Callum Styan <callumstyan@gmail.com>
This should fix #17941
With just the test addition the test will fail since the workspace build is triggered, we attempt to move the workspace into the
start
state:require.Equal(t, codersdk.WorkspaceTransitionStop, workspace.LatestBuild.Transition)
With the 2nd commit we skip the calls to
wsbuilder.New
andbuilder.Build
and so the transition is never scheduled.However, many tests don't fully setup provisioner or wait for the provisioner + DB to be fully available, so the third commit adds a somewhat smelly way to skip the provisioner check in tests that don't need to test that functionality.