Workspace Prebuilds #16969
dannykopping
started this conversation in
RFCs
Workspace Prebuilds
#16969
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Note
We invite your participation on this feature proposal. Please keep comments substantive. We'd especially love feedback on ways in which this feature may be useful to you and/or where you feel this RFC falls short.
Problem Statement
Customers often use public clouds for workspace provisioning, but these clouds can face resource constraints (especially for GPUs), causing slow builds. Startup scripts that clone large monorepos or install many dependencies also add delays. This poor first-touch experience - sometimes up to 15 minutes - hurts customers’ internal adoption and, subsequently, our sales.
We need a way to pre-provision workspaces so provisioning time is reduced to seconds.
User Stories
As a developer, I want to create workspaces near instantly, in order to start delivering value as soon as possible
As a developer, I want workspace creation to be fast, in order to have short-lived / ephemeral workspaces for quick experiments or code-reviews
As an operator, I want to provision workspaces preemptively so that developers can create workspaces within 60 seconds, to keep them in flow
As an operator, I am willing to trade off increased infrastructure spend to improve developers’ productivity, but I need to control this spend
As an operator, I want to view a template’s prebuilt workspaces for troubleshooting purposes
As an operator, I want my users to have a fast first experience with workspace provisioning, in order to reduce any inertia in their onboarding process
As an operator, I want metrics or other insights, in order to assess how prebuilds are being used
Requirements
Initial Functional Requirements
coder_parameter
values to produce different prebuilt workspace “flavors” (see Workspace Presets #16304)Initial Non-functional Requirements
Basic Flow
terraform apply
coderd
terraform apply
is invoked again with new ownership metadata & parameters chosen in point 3 (”third pass”)UX & Design
Integration with Workspace Presets
Workspace presets allow operators to specify sets of parameters to simplify workspace builds for users. In order to build a workspace, all required parameters MUST be provided.
If we piggy-back on Workspace Presets, we can use them to define which “flavors” of workspaces operators want to prebuild (i.e. small/medium/large - each with their own combination of parameters). Each preset can also have its own number of prebuilt instances; some presets might be more popular than others.
This has the nice property that presets can be used without prebuilds (i.e.
instances=0
), and enabling prebuilds is as simple as defining the number of instances.Persistence
The above
coder_workspace_preset
resources will be captured during the template import process and inserted into the database. Each template version will have its own associated preset entries.Prebuilds themselves can be stored in the
workspaces
table; they are workspaces after all. Prebuilds will be identified only by their ownership. If they are owned by the prebuilds user, then they are by definition a prebuild.It’s important to note that presets are stored against a
template_version
.Matching Logic
When a user requests a new workspace and a preset is chosen, the UUID of the chosen preset is used to compare against any available prebuilds which also use that preset UUID.
A prebuild will be ONLY considered available if its
lifecycle_state
isready
, and its preset UUID matches.Invalidation
New workspaces always use the latest template version. Therefore when a new template version is promoted to the active version, all existing prebuilds must be destroyed.
The proposed usage above shows that an
invalidate_after_secs
attribute can be set. The use-case for this is for workspaces which clone a monorepo: incremental updates (i.e. delta between prebuilt state and current state) will work up to a certain point, but after a certain period of time it might be preferable to just build a new prebuild.We could also expose an API to invalidate all prebuilds for a preset if operators need that degree of control; i.e. a new AMI is built.
Provisioning
A nice property of our current design is that if no prebuilds are available, a new workspace will be provisioned synchronously. Failing to build prebuilds will not block users, it’ll just fall back to the existing behavior of imperative provisioning of workspace resources (graceful degradation).
Reconciliation Loop
We will build a reconciliation loop which will reconcile all templates’ prebuilds.
This needs to be triggered under the following scenarios:
coderd
startupThe control loop will invoke a reconciliation of template states on an interval, but can also be “nudged” when the above scenarios occur to reduce waiting time.
Once the number of desired vs actual prebuilds for the given template is determined, this mechanism will enqueue a number of provisioner jobs to either create new, or destroy outdated/extraneous, prebuilds to satisfy the desired count.
NOTE:
We need to use an advisory lock (per template) when performing this reconciliation to prevent multiple
coderd
replicas from performing this same action simultaneously. Multiplecoderd
replicas could attempt to perform this reconciliation simultaneously.Ownership
We will create a “Prebuild Owner” user and have it own all prebuilt workspaces.
We will build a mechanism to “claim” a prebuild.
Prebuilds are workspaces, except they are owned by the prebuilds user; in fact, this is all that defines a prebuild. Once a prebuild is matched, it will be atomically assigned to the requestor.
No advisory lock is needed for this action;
SELECT ... FOR UPDATE SKIP LOCKED
will protect a prebuild from being eligible for assignment to multiple users simultaneously.Build Phases
Each workspace will have 3 workspace builds (”phases”).
1st phase: provisioning of the prebuild itself. This will require us to stub out identity datasources (see Constraints).
2nd phase: workspace build following the workspace creation request. If an available prebuild is matched (see Matching Logic), the ownership (i.e.
owner_id
field in theworkspaces
table) will be atomically changed to the initiator of the request.3rd phase: prepare the workspace using the new ownership identity. We will invoke another
terraform apply
but now the identity datasources will have legitimate values injected, which may cause some resources to be modified (see Failure Modes). Once the build succeeds, we will need to reinitialise the agent on the prebuilt workspace with a new (updated) manifest. See Agent Reinitialization for more details.start
transition initiated, since we don’t really want to retry (i.e.stop
→start
) - as this would destroy and recreate all workspace resources, obviating the point of prebuildsstart
is initiated on an already running workspace.Failure Modes
Should the 1st phase fail, the Reconciliation Loop will leave these prebuilds in their failed state. We don’t want to provision potentially many additional resources by retrying, so an operator will either need to manually restart the prebuild (via normal workspace controls) or delete it; the latter case will cause a new prebuild to be provisioned.
The 2nd phase will occur atomically; if it fails for whatever reason, the prebuild will still be available for claiming later.
If the 3rd phase fails, the workspace build will need to be manually retried; at this point it is technically no longer a prebuild, and will not be under the purview of the Reconciliation Loop.
Conditionalized Templates & Startup Scripts
Operators may require a way to conditionalize how a template behaves when it’s provisioning a prebuild vs a regular build.
Currently we use a
start_count
value on thecoder_workspace
datasource to discriminate between astart
andstop
transition. Similarly, we will expose aprebuild_count
attribute on thecoder_workspace
resource (remember, a prebuild is a workspace) which will be set to1
when building the prebuild in phase 1.For example, a template admin could choose to only execute a script on the prebuild:
Startup scripts can also be defined in the
coder_agent
resource, and these cannot take advantage of thecount
technique above. To ameliorate this limitation, we will need to support a newprebuild_startup_script
field. We don’t need to define aprebuild_startup_script_behavior
equivalent because SSH will be disabled, which this behavior interacts with.Agent Reinitialization
The agent will need to reinitialize once it has been assigned a new identity (and possibly some of its attributes are updated like env or startup scripts).
Once build phase 3 completes, the agent will need to be notified that its manifest has been updated. The agent API will be notified via pubsub (on a per-workspace channel), and will then push an update to the agent.
Once the agent receives its new manifest, it will use it to reinitialize itself.
Observability
We should expose Prometheus metrics for (with partitioning in brackets):
Autoscaling
Given that prebuilt instances will be consuming (potentially very expensive) cloud resources, operators will need a mechanism to 0 outside working hours.
For the initial phase, we will expose an autoscaling field under
coder_workspace_preset
:The solution above is designed to mirror the autostart scheduling.
The crontab format will already be familiar to operators, and will be intuitive to understand. The design above allows operators to specify a default number of instances, and then to scale that number dynamically based on one or more schedules. This can either be used to start from 0 and scale up (as the example above demonstrates), or the inverse; whichever the operator prefers.
This design would also allow for validation at template import time, where we will detect scheduling conflicts (i.e. if multiple schedules overlap and produce different values).
This will require a simple ticker to evaluate when the current time matches the crontab expression of a schedule, and to trigger the appropriate reconciliation in the Reconciliation Loop.
Constraints
coder_workspace
andcoder_workspace_owner
(”identity”) data-sources, both of which rely on a workspace being owned by a usercoder_parameter
sIn order to allow prebuilding of workspaces, we have to side-step constraint 2. Consider the following snippet from this template:
If we were to create a prebuilt workspace, what would we provide to the
data.coder_workspace_owner.me.name
anddata.coder_workspace.me.name
values? Changing thisname
attribute forces a replacement of the resource, and therefore makes the prebuild irrelevant.To counteract this, we will:
Inject known “stub” values into the above data-sources before a real identity is associated with this workspace
data.coder_workspace_owner.me.name
:coder_prebuild_owner_${UUID}
data.coder_workspace.me.name
:coder_prebuild_${UUID}
Create/reuse a linter which can detect known-bad values for
name
, and show a warning to the template authorname
is not the only attribute which can cause a replacement; each provider and each resource has its own behavior. Consequently, we will need to add provider-specific checks for other resource attributes to further assist template authorsTo achieve this, we could either use
tfsec
’s custom checks, or query the plan file using JMESPath expressions.Workarounds for existing templates:
Using
ignore_changes
:The above would result in a replacement, but simply adding:
This will instruct terraform to disregard changes to this attribute.
Onboarding
We added Workspace Build Timings which provided insight into speed issues, it didn’t offer any solutions to this particular problem.
We could use the timings graph to prompt users to try prebuilds.
Infrastructure Cost Concerns
Prebuilds will drain infrastructure spend, and we have to make that trade-off known to customers. Initially we can just highlight this in the documentation, but later we might want to provide a calculator to determine if prebuilds are worth the cost.
Beta Was this translation helpful? Give feedback.
All reactions