You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As a user, I would expect to see all aspects of workspace provisioning to show up in the timings panel. Currently two important aspects are missing: compute instance boot time and agent download time - both of which could add seconds or even minutes to my workspace startup time, meaning as a user my experience is worse than what is being measured.
Since v2.17, Coder has provided timing information for workspace builds.
We currently capture:
terraform initialization, graphing, planning, and applying
agent connection, and startup script execution
As a reminder, this is how workspaces start up:
terraform applies the template, creating resources
some form of compute (VM, container) resource is provisioned and boots up
this script downloads the agent binary from coderd
the agent binary starts up and connects to coderd
Timings are not currently captured for steps 2 and 4 (3 is not captured either, but it's not worth measuring).
Both of these steps could introduce serious latency from a user's perspective, so we have to capture them.
Additionally, if we have this new information we can use it to enhance the determination of workspace states. We could introduce new states like BOOTING UP and AGENT DOWNLOADING, which would go a long way to helping users understand what's happening with their workspaces.
This view is not terribly helpful right now.
Implementation ideas:
The binary which the agent downloads is bundled into the coder binary, and accessed via a file handler:
Based on this, we could pass along the value of CODER_WORKSPACE_BUILD_ID when downloading the agent, and track download attempts against this record. We need to use the build ID and not the workspace ID since we need these timings on a per-build (technically per-provisioner-job) level like other timings.
Knowing when this request was made will allow us to calculate (without precision but close enough):
The bootstrap script will retry to download the agent binary if it fails, so we need to consider these in the timings. In both cases, we should use the time of the first attempt to download the agent binary, since this is a good proxy metric for when the compute instance has first booted and also represents the full time taken to download the agent (including retries).
Along with the timings we can also have a query which returns the number of download attempts which could be added somewhere in the UI, maybe even the tooltip of the download timings.
NOTE: It might not be worth it to measure each individual download attempt. We'd either have to hook the file server, or send another request from the bootstrap script (or some extra metadata in each request) to capture the download failed times. We can probably leave this out for now since it's probably not that useful; it can be a future enhancement.
The text was updated successfully, but these errors were encountered:
Additionally, if we have this new information, we can use it to enhance the determination of workspace states. We could introduce new states like BOOTING UP and AGENT DOWNLOADING, which would go a long way to helping users understand what's happening with their workspaces.
This would also contribute to #15423, specifically the new state information.
Bottom-line upfront:
As a user, I would expect to see all aspects of workspace provisioning to show up in the timings panel. Currently two important aspects are missing: compute instance boot time and agent download time - both of which could add seconds or even minutes to my workspace startup time, meaning as a user my experience is worse than what is being measured.
Since v2.17, Coder has provided timing information for workspace builds.
We currently capture:
As a reminder, this is how workspaces start up:
bootstrap_linux.sh
)coderd
coderd
Timings are not currently captured for steps 2 and 4 (3 is not captured either, but it's not worth measuring).
Both of these steps could introduce serious latency from a user's perspective, so we have to capture them.
Additionally, if we have this new information we can use it to enhance the determination of workspace states. We could introduce new states like
BOOTING UP
andAGENT DOWNLOADING
, which would go a long way to helping users understand what's happening with their workspaces.This view is not terribly helpful right now.
Implementation ideas:
The binary which the agent downloads is bundled into the
coder
binary, and accessed via a file handler:coder/site/site.go
Line 114 in 2ace044
We inject the workspace metadata into the environment of the compute instance:
coder/provisioner/terraform/provision.go
Lines 257 to 263 in 9520da3
Based on this, we could pass along the value of
CODER_WORKSPACE_BUILD_ID
when downloading the agent, and track download attempts against this record. We need to use the build ID and not the workspace ID since we need these timings on a per-build (technically per-provisioner-job) level like other timings.Knowing when this request was made will allow us to calculate (without precision but close enough):
2: compute boot time =
(_first_ agent binary download attempt time) - (terraform apply end time)
4: agent download time =
(agent connection start time) - (_first_ agent binary download attempt time)
The bootstrap script will retry to download the agent binary if it fails, so we need to consider these in the timings. In both cases, we should use the time of the first attempt to download the agent binary, since this is a good proxy metric for when the compute instance has first booted and also represents the full time taken to download the agent (including retries).
Along with the timings we can also have a query which returns the number of download attempts which could be added somewhere in the UI, maybe even the tooltip of the download timings.
NOTE: It might not be worth it to measure each individual download attempt. We'd either have to hook the file server, or send another request from the bootstrap script (or some extra metadata in each request) to capture the download failed times. We can probably leave this out for now since it's probably not that useful; it can be a future enhancement.
The text was updated successfully, but these errors were encountered: