-
Notifications
You must be signed in to change notification settings - Fork 887
feat: add lifecycle.Executor to manage autostart and autostop #1183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## main #1183 +/- ##
==========================================
- Coverage 67.08% 67.06% -0.02%
==========================================
Files 288 288
Lines 18773 19079 +306
Branches 241 241
==========================================
+ Hits 12593 12795 +202
- Misses 4906 4979 +73
- Partials 1274 1305 +31
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we intend on making this distributed? eg. if I have 5 coderd replicas, what stops them from queueing multiple workspace builds at the same time?
Kyle and I discussed briefly in Slack -- essentially the "best" way to do this would be to set up a distributed leader lock and only have the 'leader' coderd. But with the current implementation that (ab?)uses one big transaction in which to perform all the autostarts, there shouldn't be any serious issues at small N. A further optimization would essentially involve consolidating the entire logic of creating a new worksapce build with the same parameters as the last build into a single query. This would be a big scary query though, so the complexity likely isn't worth it right now. Probably the best approach right now is to add a unit test that ensures that two coderd instances running concurrently against the same database don't end up causing problems here. |
a0a10f5
to
011ea51
Compare
Just want to mention proper fault tolerant leadership stuff is typically solved with consensus algos like Raft or Paxos. Probably 100% overkill, but if we decide to have a "Leader" coderd, you do have to handle leaders failing, being slow, and being elected out. It's a fun problem that gets complicated super quick if you 100% cannot tolerate 2 concurrent leaders. If you just do a pg lock and first come first serve, that's going to work fine and let pg determine who the "leader" is. |
Yep, absolutely. Also, those consensus algos typically are used when there are multiple independent instances with separate persistent storage. Since all coderd replicas will be using the same database, just using a PG leader lock makes sense in this case. |
a8a50a4
to
db03c74
Compare
def317c
to
ed18e04
Compare
b39fde2
to
364a27c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and minor suggestions, but overall a really clean PR!
} | ||
|
||
func mustProvisionWorkspace(t *testing.T, client *codersdk.Client) codersdk.Workspace { | ||
t.Helper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of helper functions, kept the tests focused on the relevant parts! 😎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a comment on the package naming, as I'd hate for this beautiful package to be eventually bloated due to naming conflicts!
|
||
var validTransition database.WorkspaceTransition | ||
var sched *schedule.Schedule | ||
switch latestBuild.Transition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because builds are idempotent, we shouldn't need to switch off the last transition. We should be able to just start or stop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even so, it's probably no harm for this thing to make a best-effort at being idempotent as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes, but I mean going off the last transition could lead to unexpected results if I'm reading correctly.
eg. I'm in the stopped state but the workspace was updated, now my autostop wouldn't trigger to update my workspace. I'm not sure if this is a problem, but the behavior could be odd to customers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I wrote a test to see what happens right now in this state:
- Given: we have a stopped workspace with autostart enabled
- Also given: the TemplateVersion of the Template used by the workspace changes
In this case, we do trigger an autostart, but with the last template version used by the workspace.
As this is an automated process, I think this makes sense and we would want the user to manually update the template version if the workspace is outdated. In future, we may want to allow users to opt-in to automatically updating workspaces to the latest version.
}) | ||
} | ||
|
||
// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't be opposed to exposing a function like: database.CreateWorkspaceBuild
. I'm not a massive fan either, because it certainly conflicts with database.InsertWorkspaceBuild
. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pausing extracting a function here because we only have two datapoints right now; I'd like to see at least one more before doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(if-minor): Just one suggestion: ticket > todo. Entirely up to you, won't hold up on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vapurrmaid good call: #1401
"github.com/stretchr/testify/require" | ||
) | ||
|
||
func Test_Executor_Autostart_OK(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We haven't the _
idiom in other places in the codebase. I'm impartial as to whether it's good or not, but we could remove them for consistency in this case!
@@ -0,0 +1,219 @@ | |||
package executor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about calling the lifecycle package autobuild
or cronbuild
? I'm concerned about calling it lifecycle
, since that term could be interpreted very broadly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autobuild
is better and does what it says on the tin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment(in-support): autobuild
seems intuitive/useful to me.
coderd/coderdtest/coderdtest.go
Outdated
slogtest.Make(t, nil).Named("lifecycle.executor").Leveled(slog.LevelDebug), | ||
options.LifecycleTicker, | ||
) | ||
go lifecycleExecutor.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this Run
function is always called in a goroutine, could we do that automatically for the caller in New()
?
} | ||
|
||
// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor. | ||
func doBuild(ctx context.Context, store database.Store, workspace database.Workspace, trans database.WorkspaceTransition, priorHistory database.WorkspaceBuild, priorJob database.ProvisionerJob) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smallest nit possible: We could drop the do
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well done ✔️
cli/server.go
Outdated
lifecyclePoller := time.NewTicker(time.Minute) | ||
defer lifecyclePoller.Stop() | ||
lifecycleExecutor := executor.New(cmd.Context(), options.Database, logger, lifecyclePoller.C) | ||
go lifecycleExecutor.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment(general): Your code looks great. I'm feeling a bit lost in understanding all that's happening in this "Start a Coder server"
RunE
because it's growing quite large with some branching and Goroutines. I'm wondering if that is just a me thing, or if maybe the complexity here is growing large. To understand what I mean, this diff in isolation looks harmless. In the grand scheme of initializing everything within this RunE
, though, it's harder to really evaluate.
I'm under the impression we could extract out some initialization steps to named functions and call them in a pipeline, but maybe that's just a me thing and not the general preferred approach for Go + Cobra. Do you have any comments to that effect?
Regardless, this comment is not meant for your code/this PR, just a general observation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'd be in favour of extracting some initialization logic to make things slightly more concise, just would want to be careful to avoid too many layers of indirection.
@@ -0,0 +1,219 @@ | |||
package executor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment(in-support): autobuild
seems intuitive/useful to me.
}) | ||
} | ||
|
||
// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion(if-minor): Just one suggestion: ticket > todo. Entirely up to you, won't hold up on that.
@@ -17,3 +17,5 @@ coverage/ | |||
out/ | |||
storybook-static/ | |||
test-results/ | |||
|
|||
**/*.swp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: Thank you for thinking of these files. We should ensure that we add matching comments to each 3 of these files to consider adding entries in all 3. It's easy to miss/forget.
This PR adds a package lifecycle and an Executor implementation that attempts to schedule a build of workspaces with autostart configured. - lifecycle.Executor takes a chan time.Time in its constructor (e.g. time.Tick(time.Minute)) - Whenever a value is received from this channel, it executes one iteration of looping through the workspaces and triggering lifecycle operations. - When the context passed to the executor is Done, it exits. - Only workspaces that meet the following criteria will have a lifecycle operation applied to them: - Workspace has a valid and non-empty autostart or autostop schedule (either) - Workspace's last build was successful - The following transitions will be applied depending on the current workspace state: - If the workspace is currently running, it will be stopped. - If the workspace is currently stopped, it will be started. - Otherwise, nothing will be done. - Workspace builds will be created with the same parameters and template version as the last successful build (for example, template version)
What:
This PR adds a package
lifecycle
and anExecutor
implementation that attempts to schedule a build of workspaces with autostart configured.lifecycle.Executor
takes achan time.Time
in its constructor (e.g.time.Tick(time.Minute)
)What was done:
Implement autostop logic
Add unit tests to verify that workspaces in liminal states and/or soft-deleted workspaces do not get touched by this.
Exclude workspaces without autostart or autostop enabled in the initial query
Move
provisionerJobStatus
literally anywhere else.Wire up the AutostartExecutor to the real coderd
Renamed the
autostart/lifecycle
package tolifecycle/executor
cause that name was bugging me.Possible improvements / "Future Work":
Closes #271.