Skip to content

Workspace timings are incomplete #16535

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dannykopping opened this issue Feb 12, 2025 · 1 comment
Open

Workspace timings are incomplete #16535

dannykopping opened this issue Feb 12, 2025 · 1 comment
Assignees
Labels
observability Issues related to observability (metrics, dashboards, alerts, opentelemetry)

Comments

@dannykopping
Copy link
Contributor

dannykopping commented Feb 12, 2025

Bottom-line upfront:

As a user, I would expect to see all aspects of workspace provisioning to show up in the timings panel. Currently two important aspects are missing: compute instance boot time and agent download time - both of which could add seconds or even minutes to my workspace startup time, meaning as a user my experience is worse than what is being measured.


Since v2.17, Coder has provided timing information for workspace builds.

We currently capture:

  • terraform initialization, graphing, planning, and applying
  • agent connection, and startup script execution

As a reminder, this is how workspaces start up:

  1. terraform applies the template, creating resources
  2. some form of compute (VM, container) resource is provisioned and boots up
  3. that compute executes the agent init script (example: bootstrap_linux.sh)
  4. this script downloads the agent binary from coderd
  5. the agent binary starts up and connects to coderd

Timings are not currently captured for steps 2 and 4 (3 is not captured either, but it's not worth measuring).

Both of these steps could introduce serious latency from a user's perspective, so we have to capture them.


Additionally, if we have this new information we can use it to enhance the determination of workspace states. We could introduce new states like BOOTING UP and AGENT DOWNLOADING, which would go a long way to helping users understand what's happening with their workspaces.

Image

This view is not terribly helpful right now.


Implementation ideas:

The binary which the agent downloads is bundled into the coder binary, and accessed via a file handler:

mux.Handle("/bin/", http.StripPrefix("/bin", http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) {

We inject the workspace metadata into the environment of the compute instance:

"CODER_WORKSPACE_ID="+metadata.GetWorkspaceId(),
"CODER_WORKSPACE_OWNER_ID="+metadata.GetWorkspaceOwnerId(),
"CODER_WORKSPACE_OWNER_SESSION_TOKEN="+metadata.GetWorkspaceOwnerSessionToken(),
"CODER_WORKSPACE_TEMPLATE_ID="+metadata.GetTemplateId(),
"CODER_WORKSPACE_TEMPLATE_NAME="+metadata.GetTemplateName(),
"CODER_WORKSPACE_TEMPLATE_VERSION="+metadata.GetTemplateVersion(),
"CODER_WORKSPACE_BUILD_ID="+metadata.GetWorkspaceBuildId(),

Based on this, we could pass along the value of CODER_WORKSPACE_BUILD_ID when downloading the agent, and track download attempts against this record. We need to use the build ID and not the workspace ID since we need these timings on a per-build (technically per-provisioner-job) level like other timings.

Knowing when this request was made will allow us to calculate (without precision but close enough):

2: compute boot time = (_first_ agent binary download attempt time) - (terraform apply end time)
4: agent download time = (agent connection start time) - (_first_ agent binary download attempt time)

The bootstrap script will retry to download the agent binary if it fails, so we need to consider these in the timings. In both cases, we should use the time of the first attempt to download the agent binary, since this is a good proxy metric for when the compute instance has first booted and also represents the full time taken to download the agent (including retries).

Along with the timings we can also have a query which returns the number of download attempts which could be added somewhere in the UI, maybe even the tooltip of the download timings.

NOTE: It might not be worth it to measure each individual download attempt. We'd either have to hook the file server, or send another request from the bootstrap script (or some extra metadata in each request) to capture the download failed times. We can probably leave this out for now since it's probably not that useful; it can be a future enhancement.

@dannykopping dannykopping added the observability Issues related to observability (metrics, dashboards, alerts, opentelemetry) label Feb 12, 2025
@matifali
Copy link
Member

Additionally, if we have this new information, we can use it to enhance the determination of workspace states. We could introduce new states like BOOTING UP and AGENT DOWNLOADING, which would go a long way to helping users understand what's happening with their workspaces.

This would also contribute to #15423, specifically the new state information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
observability Issues related to observability (metrics, dashboards, alerts, opentelemetry)
Projects
None yet
Development

No branches or pull requests

3 participants