Workspace timings are incomplete #16535

dannykopping · 2025-02-12T08:29:33Z

Bottom-line upfront:

As a user, I would expect to see all aspects of workspace provisioning to show up in the timings panel. Currently two important aspects are missing: compute instance boot time and agent download time - both of which could add seconds or even minutes to my workspace startup time, meaning as a user my experience is worse than what is being measured.

Since v2.17, Coder has provided timing information for workspace builds.

We currently capture:

terraform initialization, graphing, planning, and applying
agent connection, and startup script execution

As a reminder, this is how workspaces start up:

terraform applies the template, creating resources
some form of compute (VM, container) resource is provisioned and boots up
that compute executes the agent init script (example: bootstrap_linux.sh)
this script downloads the agent binary from coderd
the agent binary starts up and connects to coderd

Timings are not currently captured for steps 2 and 4 (3 is not captured either, but it's not worth measuring).

Both of these steps could introduce serious latency from a user's perspective, so we have to capture them.

Additionally, if we have this new information we can use it to enhance the determination of workspace states. We could introduce new states like BOOTING UP and AGENT DOWNLOADING, which would go a long way to helping users understand what's happening with their workspaces.

This view is not terribly helpful right now.

Implementation ideas:

The binary which the agent downloads is bundled into the coder binary, and accessed via a file handler:

coder/site/site.go

Line 114 in 2ace044

    
           mux.Handle("/bin/", http.StripPrefix("/bin", http.HandlerFunc(func(rw http.ResponseWriter, r *http.Request) {

We inject the workspace metadata into the environment of the compute instance:

coder/provisioner/terraform/provision.go

Lines 257 to 263 in 9520da3

    
           "CODER_WORKSPACE_ID="+metadata.GetWorkspaceId(), 
        
           "CODER_WORKSPACE_OWNER_ID="+metadata.GetWorkspaceOwnerId(), 
        
           "CODER_WORKSPACE_OWNER_SESSION_TOKEN="+metadata.GetWorkspaceOwnerSessionToken(), 
        
           "CODER_WORKSPACE_TEMPLATE_ID="+metadata.GetTemplateId(), 
        
           "CODER_WORKSPACE_TEMPLATE_NAME="+metadata.GetTemplateName(), 
        
           "CODER_WORKSPACE_TEMPLATE_VERSION="+metadata.GetTemplateVersion(), 
        
           "CODER_WORKSPACE_BUILD_ID="+metadata.GetWorkspaceBuildId(),

Based on this, we could pass along the value of CODER_WORKSPACE_BUILD_ID when downloading the agent, and track download attempts against this record. We need to use the build ID and not the workspace ID since we need these timings on a per-build (technically per-provisioner-job) level like other timings.

Knowing when this request was made will allow us to calculate (without precision but close enough):

2: compute boot time = (_first_ agent binary download attempt time) - (terraform apply end time)
4: agent download time = (agent connection start time) - (_first_ agent binary download attempt time)

The bootstrap script will retry to download the agent binary if it fails, so we need to consider these in the timings. In both cases, we should use the time of the first attempt to download the agent binary, since this is a good proxy metric for when the compute instance has first booted and also represents the full time taken to download the agent (including retries).

Along with the timings we can also have a query which returns the number of download attempts which could be added somewhere in the UI, maybe even the tooltip of the download timings.

NOTE: It might not be worth it to measure each individual download attempt. We'd either have to hook the file server, or send another request from the bootstrap script (or some extra metadata in each request) to capture the download failed times. We can probably leave this out for now since it's probably not that useful; it can be a future enhancement.

The text was updated successfully, but these errors were encountered:

matifali · 2025-02-12T08:39:41Z

Additionally, if we have this new information, we can use it to enhance the determination of workspace states. We could introduce new states like BOOTING UP and AGENT DOWNLOADING, which would go a long way to helping users understand what's happening with their workspaces.

This would also contribute to #15423, specifically the new state information.

dannykopping added the observability Issues related to observability (metrics, dashboards, alerts, opentelemetry) label Feb 12, 2025

mtojek assigned johnstcn Apr 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace timings are incomplete #16535

Workspace timings are incomplete #16535

dannykopping commented Feb 12, 2025 •

edited

Loading

matifali commented Feb 12, 2025

Workspace timings are incomplete #16535

Workspace timings are incomplete #16535

Comments

dannykopping commented Feb 12, 2025 • edited Loading

matifali commented Feb 12, 2025

dannykopping commented Feb 12, 2025 •

edited

Loading