Skip to content

feat: add lifecycle.Executor to manage autostart and autostop #1183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
May 11, 2022

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Apr 26, 2022

What:

This PR adds a package lifecycle and an Executor implementation that attempts to schedule a build of workspaces with autostart configured.

  • lifecycle.Executor takes a chan time.Time in its constructor (e.g. time.Tick(time.Minute))
  • Whenever a value is received from this channel, it executes one iteration of looping through the workspaces and triggering lifecycle operations.
  • When the context passed to the executor is Done, it exits.
  • Only workspaces that meet the following criteria will have a lifecycle operation applied to them:
    • Workspace has a valid and non-empty autostart or autostop schedule (either)
    • Workspace's last build was successful
  • The following transitions will be applied depending on the current workspace state:
    • If the workspace is currently running, it will be stopped.
    • If the workspace is currently stopped, it will be started.
    • Otherwise, nothing will be done.

What was done:

  • Implement autostop logic

  • Add unit tests to verify that workspaces in liminal states and/or soft-deleted workspaces do not get touched by this.

  • Exclude workspaces without autostart or autostop enabled in the initial query

  • Move provisionerJobStatus literally anywhere else.

  • Wire up the AutostartExecutor to the real coderd

  • Renamed the autostart/lifecycle package to lifecycle/executor cause that name was bugging me.

Possible improvements / "Future Work":

  • Make Test_Executor_Run/AlreadyRunning not wait ten seconds somehow:
    • This requires some plumbing -- we'd need some way of hooking more directly into provisionerd, which doesn't seem easy right now.
  • DRY the workspace build triggering logic as it's essentially copy-pasted from the workspaces handler:
    • Leaving this for now as we only use this in two places. Once we need to do this elsewhere, we can revisit.
  • Add a unit test for racy dueling coderd instances:
    • Leaving this for now, as doing our periodic checks in a single transaction should be sufficient to ensure consistency.

Closes #271.

@johnstcn johnstcn self-assigned this Apr 26, 2022
@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #1183 (7627372) into main (2df92e6) will decrease coverage by 0.01%.
The diff coverage is 74.10%.

@@            Coverage Diff             @@
##             main    #1183      +/-   ##
==========================================
- Coverage   67.08%   67.06%   -0.02%     
==========================================
  Files         288      288              
  Lines       18773    19079     +306     
  Branches      241      241              
==========================================
+ Hits        12593    12795     +202     
- Misses       4906     4979      +73     
- Partials     1274     1305      +31     
Flag Coverage Δ
unittest-go-macos-latest 54.21% <65.62%> (+0.10%) ⬆️
unittest-go-postgres- 65.54% <71.42%> (-0.02%) ⬇️
unittest-go-ubuntu-latest 56.64% <65.62%> (+0.22%) ⬆️
unittest-go-windows-2022 52.63% <65.62%> (+0.19%) ⬆️
unittest-js 74.24% <ø> (ø)
Impacted Files Coverage Δ
cli/autostart.go 75.29% <ø> (ø)
cli/autostop.go 75.00% <ø> (ø)
coderd/autobuild/schedule/schedule.go 90.90% <ø> (ø)
coderd/workspaces.go 58.51% <ø> (ø)
coderd/database/queries.sql.go 77.78% <61.29%> (-0.13%) ⬇️
coderd/autobuild/executor/lifecycle_executor.go 71.60% <71.60%> (ø)
cli/server.go 58.27% <100.00%> (+0.78%) ⬆️
coderd/coderdtest/coderdtest.go 98.93% <100.00%> (+0.07%) ⬆️
coderd/audit/backends/postgres.go 38.46% <0.00%> (-30.77%) ⬇️
codersdk/provisionerdaemons.go 61.97% <0.00%> (-5.64%) ⬇️
... and 32 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2df92e6...7627372. Read the comment docs.

Copy link
Member

@kylecarbs kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we intend on making this distributed? eg. if I have 5 coderd replicas, what stops them from queueing multiple workspace builds at the same time?

@johnstcn
Copy link
Member Author

How do we intend on making this distributed? eg. if I have 5 coderd replicas, what stops them from queueing multiple workspace builds at the same time?

Kyle and I discussed briefly in Slack -- essentially the "best" way to do this would be to set up a distributed leader lock and only have the 'leader' coderd.

But with the current implementation that (ab?)uses one big transaction in which to perform all the autostarts, there shouldn't be any serious issues at small N.

A further optimization would essentially involve consolidating the entire logic of creating a new worksapce build with the same parameters as the last build into a single query. This would be a big scary query though, so the complexity likely isn't worth it right now.

Probably the best approach right now is to add a unit test that ensures that two coderd instances running concurrently against the same database don't end up causing problems here.

@johnstcn johnstcn force-pushed the cj/gh-271/sched-autostart branch from a0a10f5 to 011ea51 Compare April 26, 2022 21:44
@Emyrk
Copy link
Member

Emyrk commented Apr 27, 2022

Kyle and I discussed briefly in Slack -- essentially the "best" way to do this would be to set up a distributed leader lock and only have the 'leader' coderd.

Just want to mention proper fault tolerant leadership stuff is typically solved with consensus algos like Raft or Paxos. Probably 100% overkill, but if we decide to have a "Leader" coderd, you do have to handle leaders failing, being slow, and being elected out. It's a fun problem that gets complicated super quick if you 100% cannot tolerate 2 concurrent leaders.

If you just do a pg lock and first come first serve, that's going to work fine and let pg determine who the "leader" is.

@johnstcn
Copy link
Member Author

Just want to mention proper fault tolerant leadership stuff is typically solved with consensus algos like Raft or Paxos. Probably 100% overkill, but if we decide to have a "Leader" coderd, you do have to handle leaders failing, being slow, and being elected out. It's a fun problem that gets complicated super quick if you 100% cannot tolerate 2 concurrent leaders.

Yep, absolutely. Also, those consensus algos typically are used when there are multiple independent instances with separate persistent storage. Since all coderd replicas will be using the same database, just using a PG leader lock makes sense in this case.

@johnstcn johnstcn force-pushed the cj/gh-271/sched-autostart branch 4 times, most recently from a8a50a4 to db03c74 Compare April 30, 2022 15:13
@johnstcn johnstcn force-pushed the cj/gh-271/sched-autostart branch 2 times, most recently from def317c to ed18e04 Compare May 10, 2022 07:28
@johnstcn johnstcn force-pushed the cj/gh-271/sched-autostart branch from b39fde2 to 364a27c Compare May 10, 2022 20:50
@johnstcn johnstcn marked this pull request as ready for review May 10, 2022 20:51
@johnstcn johnstcn requested a review from a team as a code owner May 10, 2022 20:51
@johnstcn johnstcn requested a review from a team May 10, 2022 20:51
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions and minor suggestions, but overall a really clean PR!

}

func mustProvisionWorkspace(t *testing.T, client *codersdk.Client) codersdk.Workspace {
t.Helper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of helper functions, kept the tests focused on the relevant parts! 😎

Copy link
Member

@kylecarbs kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a comment on the package naming, as I'd hate for this beautiful package to be eventually bloated due to naming conflicts! ↔️↔️↔️


var validTransition database.WorkspaceTransition
var sched *schedule.Schedule
switch latestBuild.Transition {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because builds are idempotent, we shouldn't need to switch off the last transition. We should be able to just start or stop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even so, it's probably no harm for this thing to make a best-effort at being idempotent as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, but I mean going off the last transition could lead to unexpected results if I'm reading correctly.

eg. I'm in the stopped state but the workspace was updated, now my autostop wouldn't trigger to update my workspace. I'm not sure if this is a problem, but the behavior could be odd to customers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I wrote a test to see what happens right now in this state:

  • Given: we have a stopped workspace with autostart enabled
  • Also given: the TemplateVersion of the Template used by the workspace changes

In this case, we do trigger an autostart, but with the last template version used by the workspace.

As this is an automated process, I think this makes sense and we would want the user to manually update the template version if the workspace is outdated. In future, we may want to allow users to opt-in to automatically updating workspaces to the latest version.

})
}

// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be opposed to exposing a function like: database.CreateWorkspaceBuild. I'm not a massive fan either, because it certainly conflicts with database.InsertWorkspaceBuild. 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pausing extracting a function here because we only have two datapoints right now; I'd like to see at least one more before doing so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(if-minor): Just one suggestion: ticket > todo. Entirely up to you, won't hold up on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vapurrmaid good call: #1401

"github.com/stretchr/testify/require"
)

func Test_Executor_Autostart_OK(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't the _ idiom in other places in the codebase. I'm impartial as to whether it's good or not, but we could remove them for consistency in this case!

@@ -0,0 +1,219 @@
package executor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about calling the lifecycle package autobuild or cronbuild? I'm concerned about calling it lifecycle, since that term could be interpreted very broadly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

autobuild is better and does what it says on the tin.

Copy link
Contributor

@greyscaled greyscaled May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment(in-support): autobuild seems intuitive/useful to me.

slogtest.Make(t, nil).Named("lifecycle.executor").Leveled(slog.LevelDebug),
options.LifecycleTicker,
)
go lifecycleExecutor.Run()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this Run function is always called in a goroutine, could we do that automatically for the caller in New()?

}

// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor.
func doBuild(ctx context.Context, store database.Store, workspace database.Workspace, trans database.WorkspaceTransition, priorHistory database.WorkspaceBuild, priorJob database.ProvisionerJob) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smallest nit possible: We could drop the do here

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@greyscaled greyscaled left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well done ✔️

cli/server.go Outdated
lifecyclePoller := time.NewTicker(time.Minute)
defer lifecyclePoller.Stop()
lifecycleExecutor := executor.New(cmd.Context(), options.Database, logger, lifecyclePoller.C)
go lifecycleExecutor.Run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment(general): Your code looks great. I'm feeling a bit lost in understanding all that's happening in this "Start a Coder server" RunE because it's growing quite large with some branching and Goroutines. I'm wondering if that is just a me thing, or if maybe the complexity here is growing large. To understand what I mean, this diff in isolation looks harmless. In the grand scheme of initializing everything within this RunE, though, it's harder to really evaluate.

I'm under the impression we could extract out some initialization steps to named functions and call them in a pipeline, but maybe that's just a me thing and not the general preferred approach for Go + Cobra. Do you have any comments to that effect?

Regardless, this comment is not meant for your code/this PR, just a general observation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'd be in favour of extracting some initialization logic to make things slightly more concise, just would want to be careful to avoid too many layers of indirection.

@@ -0,0 +1,219 @@
package executor
Copy link
Contributor

@greyscaled greyscaled May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment(in-support): autobuild seems intuitive/useful to me.

})
}

// TODO(cian): this function duplicates most of api.postWorkspaceBuilds. Refactor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(if-minor): Just one suggestion: ticket > todo. Entirely up to you, won't hold up on that.

@@ -17,3 +17,5 @@ coverage/
out/
storybook-static/
test-results/

**/*.swp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: Thank you for thinking of these files. We should ensure that we add matching comments to each 3 of these files to consider adding entries in all 3. It's easy to miss/forget.

@johnstcn johnstcn merged commit f4da5d4 into main May 11, 2022
@johnstcn johnstcn deleted the cj/gh-271/sched-autostart branch May 11, 2022 22:03
@misskniss misskniss added this to the V2 Beta milestone May 15, 2022
kylecarbs pushed a commit that referenced this pull request Jun 10, 2022
This PR adds a package lifecycle and an Executor implementation that attempts to schedule a build of workspaces with autostart configured.

- lifecycle.Executor takes a chan time.Time in its constructor (e.g. time.Tick(time.Minute))
- Whenever a value is received from this channel, it executes one iteration of looping through the workspaces and triggering lifecycle operations.
- When the context passed to the executor is Done, it exits.
- Only workspaces that meet the following criteria will have a lifecycle operation applied to them:
  - Workspace has a valid and non-empty autostart or autostop schedule (either)
  - Workspace's last build was successful
- The following transitions will be applied depending on the current workspace state:
  - If the workspace is currently running, it will be stopped.
  - If the workspace is currently stopped, it will be started.
  - Otherwise, nothing will be done.
- Workspace builds will be created with the same parameters and template version as the last successful build (for example, template version)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

General use scheduler for jobs like Auto ON/OFF
6 participants