investigation: identify sources of workspace startup latency #22

dannykopping · 2024-07-23T11:02:31Z

From last week's All Hands, Ammar highlighted the need for shorter workspace startup times to unlock some new use-cases such as coder run.

First, I want to enumerate the phases of workspace creation, and visualize/capture how the time was spent at each stage (if we can get down to the provider / resource level, that would be ideal).
- Even if we optimize as much as we can, if a template uses some super slow cloud API it'll negatively affect startup time - so at least we should give users/operator insight into where the time is being spent
- We already have distributed traces in place in our provisioners, so we may just be able to use that data.
I want to investigate the use of tf-exec, instead of spawning new terraform processes in provisioner/terraform/executor.go. Given the uncertain state of Terraform, we can look at using OpenTofu's fork - however there is some question about whether this library will be maintained.
Lastly I want to look at ways we can tune terraform to run faster
- plugin cache: on my machine, I was able to shave 10s off a very basic Docker template by using the plugin cache; we already make use of this automatically but it only works on Linux and it appears to not be operational on dogfood at first glance, or least not used in the apply stage
- parallelism: terraform has a -parallelism flag, which is set to 10. I imagine that terraform will mostly be blocked on network requests, which means we can bump up the parallelism quite a lot. terraform is not a very CPU-intensive program, and therefore having 10s or even 100s of goroutines handling API requests should not impede performance but in fact improve it
  
  Provisioners use barely any CPU (CPU time in this graph is cumulative across all CPU modes) (source)

The text was updated successfully, but these errors were encountered:

dannykopping · 2024-08-07T13:01:04Z

A progress update:

I've knocked together https://github.com/coder/coder/compare/dk/provision-detailed-apply, which is a quick-and-dirty implementation to capture and store terraform plan & apply timings. This will help us determine which resources slow workspace builds down.

This does not cover Terraform init timings, which is another source of startup latency.

dannykopping · 2024-08-09T12:23:57Z

Findings

Here are my findings of the last couple days' research. I spent the first 2 days looking through the source to understand how all the pieces fit together and identify possible areas for improvement.

Provisioning Speed

Workspace provisioning comprises at least 3 distinct stages during which latency can creep in.

Initialization (terraform init: download of providers, modules)
Build (terraform apply)
Agent registration & setup

Our workspace builds on dogfood are inconsistent and often slow. This has been documented previously in coder/coder#13691, and also maps to my experience. We do implement terraform caching, but:

caches are local to each provisionerd/coderd pod
we deploy updates to dogfood often - restarting these pods & obliterating the cache
each provisioner has to build up its own cache
we only fully cache about half of workspace builds! ¹

We need an external cache which will not be lost on every deploy, and so do our customers (arguably the community at large needs it too).

To this end I quickly hacked up https://github.com/coder/terracache which is a TLS-intercepting and caching proxy which we can stand up alongside our coderd/provisionerd pods. It's just an in-memory cache for now, but it should speed up builds significantly.

Observability

Even if we make our builds as fast as possible, they're only as fast as the slowest provider's API.

When terraform apply runs, it can produce JSON logs of each step in the build process like follows:

{"@level":"info","@message":"docker_container.workspace[0]: Plan to create","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.111417+02:00","change":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create"},"type":"planned_change"}
{"@level":"info","@message":"docker_container.workspace[0]: Creating...","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.233404+02:00","hook":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create"},"type":"apply_start"}
{"@level":"info","@message":"docker_container.workspace[0]: Creation errored after 0s","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.245458+02:00","hook":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create","elapsed_seconds":0},"type":"apply_errored"}
{"@level":"error","@message":"Error: Unable to create container: Error response from daemon: Conflict. The container name \"/coder-default-default\" is already in use by container \"2a0e1298adb8fe63d2ed0cb9ab0a1c85a135043518696866c89d767ce9b9a28f\". You have to remove (or rename) that container to be able to reuse that name.","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.256653+02:00","diagnostic":{"severity":"error","summary":"Unable to create container: Error response from daemon: Conflict. The container name \"/coder-default-default\" is already in use by container \"2a0e1298adb8fe63d2ed0cb9ab0a1c85a135043518696866c89d767ce9b9a28f\". You have to remove (or rename) that container to be able to reuse that name.","detail":"","address":"docker_container.workspace[0]","range":{"filename":"main.tf","start":{"line":148,"column":41,"byte":4695},"end":{"line":148,"column":42,"byte":4696}},"snippet":{"context":"resource \"docker_container\" \"workspace\"","code":"resource \"docker_container\" \"workspace\" {","start_line":148,"highlight_start_offset":40,"highlight_end_offset":41,"values":[]}},"type":"diagnostic"}

Using these JSON logs, we could provider users and operators with a waterfall chart of all the stages in the terraform apply and how long each took.

I've implemented the basic framework for this feature (backend only, lol) in https://github.com/coder/coder/compare/dk/provision-detailed-apply. See #22 (comment) above for a view of the data in a new table named provisioner_job_timings.

We could also turn this data into per-provider Prometheus metrics so operators could see which providers are the slowest and consider alternatives if provisioning speed is very important to them.

Additionally, we could incorporate the timings from the other stages into this graph to visualize the full process including time spent waiting to be scheduled on a provisioner, etc.

¹

WITH no_cache_cte AS (
  SELECT COUNT(no_cache.*) AS no_cache FROM (
    SELECT job_id FROM provisioner_job_logs WHERE (
      output LIKE '%Downloading https://registry%'  -- modules
      OR output LIKE '%- Installing%'               -- providers
    )
    GROUP BY job_id
  ) AS no_cache
),
total_cte AS (
  SELECT COUNT(*) AS total FROM (
    SELECT job_id FROM provisioner_job_logs
    GROUP BY job_id
  ) AS all_entries
)
SELECT 
  (1 - (CAST(no_cache_cte.no_cache AS DECIMAL(10,2)) / CAST(total_cte.total AS DECIMAL(10,2)))) * 100 AS cached
FROM 
  no_cache_cte, 
  total_cte;

bpmct · 2024-08-09T12:33:35Z

I've noticed dogfood executes Terraform slower than other deployments (e.g. my local deployment). For example, it often hangs on "Terraform"

Any idea why?

dannykopping · 2024-08-09T12:34:31Z

Lemme know if the above comment doesn't answer that question for ya; looks like we posted at the same time @bpmct

bpmct · 2024-08-09T12:41:02Z

Ah whoops, reading now. terracache and the visualization looks extremely promising!

we deploy updates to dogfood often - restarting these pods & obliterating the cache

Have you explored whether it is also a resource/db utilization problem? Admittedly, this is not a perfect comparison (the dev build fails), but I notice in general there are more "hangs" from provisioning in dev (e.g. "at the Terraform" step).

Screen.Recording.2024-08-09.at.7.39.39.AM.mov

dannykopping · 2024-08-09T12:57:34Z

This could be for a number of reasons, which might be difficult to isolate.
Are you able to replicate the problem reliably?
It could be a resource contention problem, insufficient provisioners or compute resources assigned to them, a slow provider API, or something else.

In any case, we should see this problem exacerbated once we make provisioning faster, which would make it easier to solve if there is indeed an issue within our scope.

I'll look at getting terracache into dogfood next week, and we can touch base. Sound good?

bpmct · 2024-08-09T15:26:51Z

Are you able to replicate the problem reliably?

Yeah, pretty much every build I do on dev feels slower than by builds on any other deployment.

I'll look at getting terracache into dogfood next week, and we can touch base. Sound good?

🫡

johnstcn · 2024-08-27T13:13:46Z

coder/coder#14452 is required for us to deploy terracache to dogfood.

dannykopping added the spike Short investigation task label Jul 23, 2024

dannykopping self-assigned this Jul 23, 2024

dannykopping mentioned this issue Aug 15, 2024

feat: add provisioning timings to understand slow build times coder/coder#14274

Merged

dannykopping closed this as completed Aug 23, 2024

johnstcn mentioned this issue Oct 30, 2024

feature: expose provisioner job timings in prometheus coder/coder#15284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

investigation: identify sources of workspace startup latency #22

investigation: identify sources of workspace startup latency #22

dannykopping commented Jul 23, 2024

dannykopping commented Aug 7, 2024

dannykopping commented Aug 9, 2024 •

edited

Loading

bpmct commented Aug 9, 2024 •

edited

Loading

dannykopping commented Aug 9, 2024

bpmct commented Aug 9, 2024

dannykopping commented Aug 9, 2024

bpmct commented Aug 9, 2024 •

edited

Loading

johnstcn commented Aug 27, 2024

investigation: identify sources of workspace startup latency #22

investigation: identify sources of workspace startup latency #22

Comments

dannykopping commented Jul 23, 2024

dannykopping commented Aug 7, 2024

dannykopping commented Aug 9, 2024 • edited Loading

Findings

Provisioning Speed

Observability

bpmct commented Aug 9, 2024 • edited Loading

dannykopping commented Aug 9, 2024

bpmct commented Aug 9, 2024

dannykopping commented Aug 9, 2024

bpmct commented Aug 9, 2024 • edited Loading

johnstcn commented Aug 27, 2024

dannykopping commented Aug 9, 2024 •

edited

Loading

bpmct commented Aug 9, 2024 •

edited

Loading

bpmct commented Aug 9, 2024 •

edited

Loading