feat: add lifecycle.Executor to manage autostart and autostop #1183

johnstcn · 2022-04-26T13:09:38Z

What:

This PR adds a package lifecycle and an Executor implementation that attempts to schedule a build of workspaces with autostart configured.

lifecycle.Executor takes a chan time.Time in its constructor (e.g. time.Tick(time.Minute))
Whenever a value is received from this channel, it executes one iteration of looping through the workspaces and triggering lifecycle operations.
When the context passed to the executor is Done, it exits.
Only workspaces that meet the following criteria will have a lifecycle operation applied to them:
- Workspace has a valid and non-empty autostart or autostop schedule (either)
- Workspace's last build was successful
The following transitions will be applied depending on the current workspace state:
- If the workspace is currently running, it will be stopped.
- If the workspace is currently stopped, it will be started.
- Otherwise, nothing will be done.

What was done:

Implement autostop logic
Add unit tests to verify that workspaces in liminal states and/or soft-deleted workspaces do not get touched by this.
Exclude workspaces without autostart or autostop enabled in the initial query
Move provisionerJobStatus literally anywhere else.
Wire up the AutostartExecutor to the real coderd
Renamed the autostart/lifecycle package to lifecycle/executor cause that name was bugging me.

Possible improvements / "Future Work":

Make Test_Executor_Run/AlreadyRunning not wait ten seconds somehow:
- This requires some plumbing -- we'd need some way of hooking more directly into provisionerd, which doesn't seem easy right now.
DRY the workspace build triggering logic as it's essentially copy-pasted from the workspaces handler:
- Leaving this for now as we only use this in two places. Once we need to do this elsewhere, we can revisit.
Add a unit test for racy dueling coderd instances:
- Leaving this for now, as doing our periodic checks in a single transaction should be sufficient to ensure consistency.

Closes #271.

coderd/autostart/lifecycle/lifecycle_executor.go

codecov · 2022-04-26T13:12:11Z

Codecov Report

Merging #1183 (7627372) into main (2df92e6) will decrease coverage by 0.01%.
The diff coverage is 74.10%.

@@            Coverage Diff             @@
##             main    #1183      +/-   ##
==========================================
- Coverage   67.08%   67.06%   -0.02%     
==========================================
  Files         288      288              
  Lines       18773    19079     +306     
  Branches      241      241              
==========================================
+ Hits        12593    12795     +202     
- Misses       4906     4979      +73     
- Partials     1274     1305      +31

Flag	Coverage Δ
unittest-go-macos-latest	`54.21% <65.62%> (+0.10%)`	⬆️
unittest-go-postgres-	`65.54% <71.42%> (-0.02%)`	⬇️
unittest-go-ubuntu-latest	`56.64% <65.62%> (+0.22%)`	⬆️
unittest-go-windows-2022	`52.63% <65.62%> (+0.19%)`	⬆️
unittest-js	`74.24% <ø> (ø)`

Impacted Files	Coverage Δ
cli/autostart.go	`75.29% <ø> (ø)`
cli/autostop.go	`75.00% <ø> (ø)`
coderd/autobuild/schedule/schedule.go	`90.90% <ø> (ø)`
coderd/workspaces.go	`58.51% <ø> (ø)`
coderd/database/queries.sql.go	`77.78% <61.29%> (-0.13%)`	⬇️
coderd/autobuild/executor/lifecycle_executor.go	`71.60% <71.60%> (ø)`
cli/server.go	`58.27% <100.00%> (+0.78%)`	⬆️
coderd/coderdtest/coderdtest.go	`98.93% <100.00%> (+0.07%)`	⬆️
coderd/audit/backends/postgres.go	`38.46% <0.00%> (-30.77%)`	⬇️
codersdk/provisionerdaemons.go	`61.97% <0.00%> (-5.64%)`	⬇️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2df92e6...7627372. Read the comment docs.

coderd/coderdtest/coderdtest.go

coderd/database/databasefake/databasefake.go

coderd/autostart/lifecycle/lifecycle_executor.go

kylecarbs

How do we intend on making this distributed? eg. if I have 5 coderd replicas, what stops them from queueing multiple workspace builds at the same time?

johnstcn · 2022-04-26T16:24:26Z

How do we intend on making this distributed? eg. if I have 5 coderd replicas, what stops them from queueing multiple workspace builds at the same time?

Kyle and I discussed briefly in Slack -- essentially the "best" way to do this would be to set up a distributed leader lock and only have the 'leader' coderd.

But with the current implementation that (ab?)uses one big transaction in which to perform all the autostarts, there shouldn't be any serious issues at small N.

A further optimization would essentially involve consolidating the entire logic of creating a new worksapce build with the same parameters as the last build into a single query. This would be a big scary query though, so the complexity likely isn't worth it right now.

Probably the best approach right now is to add a unit test that ensures that two coderd instances running concurrently against the same database don't end up causing problems here.

Emyrk · 2022-04-27T00:42:28Z

Kyle and I discussed briefly in Slack -- essentially the "best" way to do this would be to set up a distributed leader lock and only have the 'leader' coderd.

Just want to mention proper fault tolerant leadership stuff is typically solved with consensus algos like Raft or Paxos. Probably 100% overkill, but if we decide to have a "Leader" coderd, you do have to handle leaders failing, being slow, and being elected out. It's a fun problem that gets complicated super quick if you 100% cannot tolerate 2 concurrent leaders.

If you just do a pg lock and first come first serve, that's going to work fine and let pg determine who the "leader" is.

coderd/autostart/lifecycle/lifecycle_executor.go

johnstcn · 2022-04-27T09:07:50Z

Just want to mention proper fault tolerant leadership stuff is typically solved with consensus algos like Raft or Paxos. Probably 100% overkill, but if we decide to have a "Leader" coderd, you do have to handle leaders failing, being slow, and being elected out. It's a fun problem that gets complicated super quick if you 100% cannot tolerate 2 concurrent leaders.

Yep, absolutely. Also, those consensus algos typically are used when there are multiple independent instances with separate persistent storage. Since all coderd replicas will be using the same database, just using a PG leader lock makes sense in this case.

mafredri

Some questions and minor suggestions, but overall a really clean PR!

coderd/coderdtest/coderdtest.go

coderd/lifecycle/executor/lifecycle_executor.go

mafredri · 2022-05-11T08:56:00Z

coderd/lifecycle/executor/lifecycle_executor_test.go

+}
+
+func mustProvisionWorkspace(t *testing.T, client *codersdk.Client) codersdk.Workspace {
+	t.Helper()


Nice use of helper functions, kept the tests focused on the relevant parts! 😎

coderd/lifecycle/executor/lifecycle_executor.go

kylecarbs

LGTM! Just a comment on the package naming, as I'd hate for this beautiful package to be eventually bloated due to naming conflicts! ↔️↔️↔️

kylecarbs · 2022-05-10T20:57:44Z

coderd/lifecycle/executor/lifecycle_executor.go

+
+			var validTransition database.WorkspaceTransition
+			var sched *schedule.Schedule
+			switch latestBuild.Transition {


Because builds are idempotent, we shouldn't need to switch off the last transition. We should be able to just start or stop.

Even so, it's probably no harm for this thing to make a best-effort at being idempotent as well.

Ahh yes, but I mean going off the last transition could lead to unexpected results if I'm reading correctly.

eg. I'm in the stopped state but the workspace was updated, now my autostop wouldn't trigger to update my workspace. I'm not sure if this is a problem, but the behavior could be odd to customers.

So I wrote a test to see what happens right now in this state:

Given: we have a stopped workspace with autostart enabled

Also given: the TemplateVersion of the Template used by the workspace changes

In this case, we do trigger an autostart, but with the last template version used by the workspace.

As this is an automated process, I think this makes sense and we would want the user to manually update the template version if the workspace is outdated. In future, we may want to allow users to opt-in to automatically updating workspaces to the latest version.

kylecarbs · 2022-05-10T20:59:31Z

coderd/lifecycle/executor/lifecycle_executor.go

+	})
+}
+
+// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor.


I wouldn't be opposed to exposing a function like: database.CreateWorkspaceBuild. I'm not a massive fan either, because it certainly conflicts with database.InsertWorkspaceBuild. 🤷

I'm pausing extracting a function here because we only have two datapoints right now; I'd like to see at least one more before doing so.

I like that!

suggestion(if-minor): Just one suggestion: ticket > todo. Entirely up to you, won't hold up on that.

@vapurrmaid good call: #1401

kylecarbs · 2022-05-10T21:00:35Z

coderd/lifecycle/executor/lifecycle_executor_test.go

+	"github.com/stretchr/testify/require"
+)
+
+func Test_Executor_Autostart_OK(t *testing.T) {


We haven't the _ idiom in other places in the codebase. I'm impartial as to whether it's good or not, but we could remove them for consistency in this case!

kylecarbs · 2022-05-11T13:54:26Z

coderd/lifecycle/executor/lifecycle_executor.go

@@ -0,0 +1,219 @@
+package executor


What do you think about calling the lifecycle package autobuild or cronbuild? I'm concerned about calling it lifecycle, since that term could be interpreted very broadly.

autobuild is better and does what it says on the tin.

comment(in-support): autobuild seems intuitive/useful to me.

kylecarbs · 2022-05-11T13:55:06Z

coderd/coderdtest/coderdtest.go

+		slogtest.Make(t, nil).Named("lifecycle.executor").Leveled(slog.LevelDebug),
+		options.LifecycleTicker,
+	)
+	go lifecycleExecutor.Run()


Since this Run function is always called in a goroutine, could we do that automatically for the caller in New()?

kylecarbs · 2022-05-11T13:55:50Z

coderd/lifecycle/executor/lifecycle_executor.go

+}
+
+// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor.
+func doBuild(ctx context.Context, store database.Store, workspace database.Workspace, trans database.WorkspaceTransition, priorHistory database.WorkspaceBuild, priorJob database.ProvisionerJob) error {


smallest nit possible: We could drop the do here

mafredri

LGTM!

greyscaled

well done ✔️

greyscaled · 2022-05-11T17:55:18Z

cli/server.go

+			lifecyclePoller := time.NewTicker(time.Minute)
+			defer lifecyclePoller.Stop()
+			lifecycleExecutor := executor.New(cmd.Context(), options.Database, logger, lifecyclePoller.C)
+			go lifecycleExecutor.Run()


Comment(general): Your code looks great. I'm feeling a bit lost in understanding all that's happening in this "Start a Coder server" RunE because it's growing quite large with some branching and Goroutines. I'm wondering if that is just a me thing, or if maybe the complexity here is growing large. To understand what I mean, this diff in isolation looks harmless. In the grand scheme of initializing everything within this RunE, though, it's harder to really evaluate.

I'm under the impression we could extract out some initialization steps to named functions and call them in a pipeline, but maybe that's just a me thing and not the general preferred approach for Go + Cobra. Do you have any comments to that effect?

Regardless, this comment is not meant for your code/this PR, just a general observation.

Agreed, I'd be in favour of extracting some initialization logic to make things slightly more concise, just would want to be careful to avoid too many layers of indirection.

greyscaled · 2022-05-11T17:59:25Z

coderd/lifecycle/executor/lifecycle_executor.go

@@ -0,0 +1,219 @@
+package executor


comment(in-support): autobuild seems intuitive/useful to me.

greyscaled · 2022-05-11T18:01:37Z

coderd/lifecycle/executor/lifecycle_executor.go

+	})
+}
+
+// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor.


suggestion(if-minor): Just one suggestion: ticket > todo. Entirely up to you, won't hold up on that.

greyscaled · 2022-05-11T18:15:04Z

site/.prettierignore

@@ -17,3 +17,5 @@ coverage/
 out/
 storybook-static/
 test-results/
+
+**/*.swp


Praise: Thank you for thinking of these files. We should ensure that we add matching comments to each 3 of these files to consider adding entries in all 3. It's easy to miss/forget.

This PR adds a package lifecycle and an Executor implementation that attempts to schedule a build of workspaces with autostart configured. - lifecycle.Executor takes a chan time.Time in its constructor (e.g. time.Tick(time.Minute)) - Whenever a value is received from this channel, it executes one iteration of looping through the workspaces and triggering lifecycle operations. - When the context passed to the executor is Done, it exits. - Only workspaces that meet the following criteria will have a lifecycle operation applied to them: - Workspace has a valid and non-empty autostart or autostop schedule (either) - Workspace's last build was successful - The following transitions will be applied depending on the current workspace state: - If the workspace is currently running, it will be stopped. - If the workspace is currently stopped, it will be started. - Otherwise, nothing will be done. - Workspace builds will be created with the same parameters and template version as the last successful build (for example, template version)

johnstcn requested review from Emyrk, kylecarbs and greyscaled April 26, 2022 13:09

johnstcn self-assigned this Apr 26, 2022

johnstcn commented Apr 26, 2022

View reviewed changes

coderd/autostart/lifecycle/lifecycle_executor.go Outdated Show resolved Hide resolved

kylecarbs reviewed Apr 26, 2022

View reviewed changes

coderd/coderdtest/coderdtest.go Outdated Show resolved Hide resolved

kylecarbs reviewed Apr 26, 2022

View reviewed changes

coderd/database/databasefake/databasefake.go Outdated Show resolved Hide resolved

kylecarbs reviewed Apr 26, 2022

View reviewed changes

coderd/autostart/lifecycle/lifecycle_executor.go Outdated Show resolved Hide resolved

kylecarbs reviewed Apr 26, 2022

View reviewed changes

johnstcn force-pushed the cj/gh-271/sched-autostart branch from a0a10f5 to 011ea51 Compare April 26, 2022 21:44

greyscaled reviewed Apr 27, 2022

View reviewed changes

coderd/autostart/lifecycle/lifecycle_executor.go Outdated Show resolved Hide resolved

johnstcn force-pushed the cj/gh-271/sched-autostart branch 4 times, most recently from a8a50a4 to db03c74 Compare April 30, 2022 15:13

johnstcn force-pushed the cj/gh-271/sched-autostart branch 2 times, most recently from def317c to ed18e04 Compare May 10, 2022 07:28

johnstcn and others added 9 commits May 10, 2022 20:47

feat: add lifecycle.Executor to autostart workspaces.

a145d6d

refactor: do not expose Store in coderdtest.Options

8f401ca

fixup! refactor: do not expose Store in coderdtest.Options

6d8f5fe

stop accessing db directly, only query workspaces with autostart enabled

cfd0d1e

refactor unit tests, add tests for autostop

ce63810

make the new tests pass with some refactoring

579f362

gitignore *.swp

6e88f67

remove unused methods

2b1a383

fixup! remove unused methods

f31588e

johnstcn added 3 commits May 10, 2022 20:47

add test to ensure workspaces are not autostarted before time

abc0854

wire up executor to coderd

e53946a

fix: executor: skip workspaces whose last build was not successful

364a27c

johnstcn force-pushed the cj/gh-271/sched-autostart branch from b39fde2 to 364a27c Compare May 10, 2022 20:50

johnstcn marked this pull request as ready for review May 10, 2022 20:51

johnstcn requested a review from a team as a code owner May 10, 2022 20:51

johnstcn requested a review from a team May 10, 2022 20:51

mafredri reviewed May 11, 2022

View reviewed changes

johnstcn commented May 11, 2022

View reviewed changes

coderd/lifecycle/executor/lifecycle_executor.go Outdated Show resolved Hide resolved

johnstcn added 2 commits May 11, 2022 12:13

address PR comments

e96414f

add goleak TestMain

b5bf50e

johnstcn mentioned this pull request May 11, 2022

General use scheduler for jobs like Auto ON/OFF #271

Closed

johnstcn added 3 commits May 11, 2022 12:39

fmt

d37cc2b

mustTransitionWorkspace should return the updated workspace

d11f5d7

remove usage of require.Eventually/Never which is flaky on Windows

f6388b4

kylecarbs approved these changes May 11, 2022

View reviewed changes

johnstcn added 4 commits May 11, 2022 16:22

make lifecycle executor spawn a new goroutine automatically

fd0f8a3

rename unit tests

7b6f2e1

s/doBuild/build

a7143bd

rename parent package lifecycle to autobuild

7d9b696

mafredri approved these changes May 11, 2022

View reviewed changes

greyscaled approved these changes May 11, 2022

View reviewed changes

add unit test for behaviour with an updated template

5cba737

johnstcn mentioned this pull request May 11, 2022

Extract helper function for creating a new workspace build #1401

Closed

add ticket to reference TODO

7627372

johnstcn merged commit f4da5d4 into main May 11, 2022

johnstcn deleted the cj/gh-271/sched-autostart branch May 11, 2022 22:03

misskniss added the V2 BETA label May 15, 2022

misskniss added this to the V2 Beta milestone May 15, 2022

feat: add lifecycle.Executor to manage autostart and autostop #1183

feat: add lifecycle.Executor to manage autostart and autostop #1183

Uh oh!

Conversation

johnstcn commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What:

What was done:

Possible improvements / "Future Work":

Uh oh!

Uh oh!

codecov bot commented Apr 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kylecarbs left a comment

Choose a reason for hiding this comment

Uh oh!

johnstcn commented Apr 26, 2022

Uh oh!

Emyrk commented Apr 27, 2022

Uh oh!

Uh oh!

johnstcn commented Apr 27, 2022

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylecarbs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greyscaled May 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

greyscaled left a comment

Choose a reason for hiding this comment

johnstcn commented Apr 26, 2022 •

edited

Loading

codecov bot commented Apr 26, 2022 •

edited

Loading

greyscaled May 11, 2022 •

edited

Loading

greyscaled May 11, 2022 •

edited

Loading