Skip to content

investigation: identify sources of workspace startup latency #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dannykopping opened this issue Jul 23, 2024 · 8 comments
Closed

investigation: identify sources of workspace startup latency #22

dannykopping opened this issue Jul 23, 2024 · 8 comments
Assignees
Labels
spike Short investigation task

Comments

@dannykopping
Copy link
Collaborator

From last week's All Hands, Ammar highlighted the need for shorter workspace startup times to unlock some new use-cases such as coder run.

image

  • First, I want to enumerate the phases of workspace creation, and visualize/capture how the time was spent at each stage (if we can get down to the provider / resource level, that would be ideal).
    • Even if we optimize as much as we can, if a template uses some super slow cloud API it'll negatively affect startup time - so at least we should give users/operator insight into where the time is being spent
    • We already have distributed traces in place in our provisioners, so we may just be able to use that data.
  • I want to investigate the use of tf-exec, instead of spawning new terraform processes in provisioner/terraform/executor.go. Given the uncertain state of Terraform, we can look at using OpenTofu's fork - however there is some question about whether this library will be maintained.
  • Lastly I want to look at ways we can tune terraform to run faster
    • plugin cache: on my machine, I was able to shave 10s off a very basic Docker template by using the plugin cache; we already make use of this automatically but it only works on Linux and it appears to not be operational on dogfood at first glance, or least not used in the apply stage
      image
    • parallelism: terraform has a -parallelism flag, which is set to 10. I imagine that terraform will mostly be blocked on network requests, which means we can bump up the parallelism quite a lot. terraform is not a very CPU-intensive program, and therefore having 10s or even 100s of goroutines handling API requests should not impede performance but in fact improve it
      image
      Provisioners use barely any CPU (CPU time in this graph is cumulative across all CPU modes) (source)
@dannykopping dannykopping added the spike Short investigation task label Jul 23, 2024
@dannykopping dannykopping self-assigned this Jul 23, 2024
@dannykopping
Copy link
Collaborator Author

A progress update:

I've knocked together https://github.com/coder/coder/compare/dk/provision-detailed-apply, which is a quick-and-dirty implementation to capture and store terraform plan & apply timings. This will help us determine which resources slow workspace builds down.

image

This does not cover Terraform init timings, which is another source of startup latency.

@dannykopping
Copy link
Collaborator Author

dannykopping commented Aug 9, 2024

Findings

Here are my findings of the last couple days' research. I spent the first 2 days looking through the source to understand how all the pieces fit together and identify possible areas for improvement.

Provisioning Speed

Workspace provisioning comprises at least 3 distinct stages during which latency can creep in.

  • Initialization (terraform init: download of providers, modules)
  • Build (terraform apply)
  • Agent registration & setup

Our workspace builds on dogfood are inconsistent and often slow. This has been documented previously in coder/coder#13691, and also maps to my experience. We do implement terraform caching, but:

  • caches are local to each provisionerd/coderd pod
  • we deploy updates to dogfood often - restarting these pods & obliterating the cache
  • each provisioner has to build up its own cache
  • we only fully cache about half of workspace builds! 1

We need an external cache which will not be lost on every deploy, and so do our customers (arguably the community at large needs it too).

To this end I quickly hacked up https://github.com/coder/terracache which is a TLS-intercepting and caching proxy which we can stand up alongside our coderd/provisionerd pods. It's just an in-memory cache for now, but it should speed up builds significantly.


Observability

Even if we make our builds as fast as possible, they're only as fast as the slowest provider's API.

When terraform apply runs, it can produce JSON logs of each step in the build process like follows:

{"@level":"info","@message":"docker_container.workspace[0]: Plan to create","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.111417+02:00","change":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create"},"type":"planned_change"}
{"@level":"info","@message":"docker_container.workspace[0]: Creating...","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.233404+02:00","hook":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create"},"type":"apply_start"}
{"@level":"info","@message":"docker_container.workspace[0]: Creation errored after 0s","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.245458+02:00","hook":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create","elapsed_seconds":0},"type":"apply_errored"}
{"@level":"error","@message":"Error: Unable to create container: Error response from daemon: Conflict. The container name \"/coder-default-default\" is already in use by container \"2a0e1298adb8fe63d2ed0cb9ab0a1c85a135043518696866c89d767ce9b9a28f\". You have to remove (or rename) that container to be able to reuse that name.","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.256653+02:00","diagnostic":{"severity":"error","summary":"Unable to create container: Error response from daemon: Conflict. The container name \"/coder-default-default\" is already in use by container \"2a0e1298adb8fe63d2ed0cb9ab0a1c85a135043518696866c89d767ce9b9a28f\". You have to remove (or rename) that container to be able to reuse that name.","detail":"","address":"docker_container.workspace[0]","range":{"filename":"main.tf","start":{"line":148,"column":41,"byte":4695},"end":{"line":148,"column":42,"byte":4696}},"snippet":{"context":"resource \"docker_container\" \"workspace\"","code":"resource \"docker_container\" \"workspace\" {","start_line":148,"highlight_start_offset":40,"highlight_end_offset":41,"values":[]}},"type":"diagnostic"}

Using these JSON logs, we could provider users and operators with a waterfall chart of all the stages in the terraform apply and how long each took.

image

I've implemented the basic framework for this feature (backend only, lol) in https://github.com/coder/coder/compare/dk/provision-detailed-apply. See #22 (comment) above for a view of the data in a new table named provisioner_job_timings.

We could also turn this data into per-provider Prometheus metrics so operators could see which providers are the slowest and consider alternatives if provisioning speed is very important to them.

Additionally, we could incorporate the timings from the other stages into this graph to visualize the full process including time spent waiting to be scheduled on a provisioner, etc.


1

WITH no_cache_cte AS (
  SELECT COUNT(no_cache.*) AS no_cache FROM (
    SELECT job_id FROM provisioner_job_logs WHERE (
      output LIKE '%Downloading https://registry%'  -- modules
      OR output LIKE '%- Installing%'               -- providers
    )
    GROUP BY job_id
  ) AS no_cache
),
total_cte AS (
  SELECT COUNT(*) AS total FROM (
    SELECT job_id FROM provisioner_job_logs
    GROUP BY job_id
  ) AS all_entries
)
SELECT 
  (1 - (CAST(no_cache_cte.no_cache AS DECIMAL(10,2)) / CAST(total_cte.total AS DECIMAL(10,2)))) * 100 AS cached
FROM 
  no_cache_cte, 
  total_cte;

@bpmct
Copy link
Member

bpmct commented Aug 9, 2024

I've noticed dogfood executes Terraform slower than other deployments (e.g. my local deployment). For example, it often hangs on "Terraform"

Any idea why?

@dannykopping
Copy link
Collaborator Author

Lemme know if the above comment doesn't answer that question for ya; looks like we posted at the same time @bpmct

@bpmct
Copy link
Member

bpmct commented Aug 9, 2024

Ah whoops, reading now. terracache and the visualization looks extremely promising!

we deploy updates to dogfood often - restarting these pods & obliterating the cache

Have you explored whether it is also a resource/db utilization problem? Admittedly, this is not a perfect comparison (the dev build fails), but I notice in general there are more "hangs" from provisioning in dev (e.g. "at the Terraform" step).

Screen.Recording.2024-08-09.at.7.39.39.AM.mov

@dannykopping
Copy link
Collaborator Author

This could be for a number of reasons, which might be difficult to isolate.
Are you able to replicate the problem reliably?
It could be a resource contention problem, insufficient provisioners or compute resources assigned to them, a slow provider API, or something else.

In any case, we should see this problem exacerbated once we make provisioning faster, which would make it easier to solve if there is indeed an issue within our scope.

I'll look at getting terracache into dogfood next week, and we can touch base. Sound good?

@bpmct
Copy link
Member

bpmct commented Aug 9, 2024

Are you able to replicate the problem reliably?

Yeah, pretty much every build I do on dev feels slower than by builds on any other deployment.

I'll look at getting terracache into dogfood next week, and we can touch base. Sound good?

🫡

@johnstcn
Copy link
Member

coder/coder#14452 is required for us to deploy terracache to dogfood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spike Short investigation task
Projects
None yet
Development

No branches or pull requests

3 participants