-
Notifications
You must be signed in to change notification settings - Fork 0
investigation: identify sources of workspace startup latency #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A progress update: I've knocked together https://github.com/coder/coder/compare/dk/provision-detailed-apply, which is a quick-and-dirty implementation to capture and store terraform This does not cover Terraform |
FindingsHere are my findings of the last couple days' research. I spent the first 2 days looking through the source to understand how all the pieces fit together and identify possible areas for improvement. Provisioning SpeedWorkspace provisioning comprises at least 3 distinct stages during which latency can creep in.
Our workspace builds on dogfood are inconsistent and often slow. This has been documented previously in coder/coder#13691, and also maps to my experience. We do implement terraform caching, but:
We need an external cache which will not be lost on every deploy, and so do our customers (arguably the community at large needs it too). To this end I quickly hacked up https://github.com/coder/terracache which is a TLS-intercepting and caching proxy which we can stand up alongside our coderd/provisionerd pods. It's just an in-memory cache for now, but it should speed up builds significantly. ObservabilityEven if we make our builds as fast as possible, they're only as fast as the slowest provider's API. When {"@level":"info","@message":"docker_container.workspace[0]: Plan to create","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.111417+02:00","change":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create"},"type":"planned_change"}
{"@level":"info","@message":"docker_container.workspace[0]: Creating...","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.233404+02:00","hook":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create"},"type":"apply_start"}
{"@level":"info","@message":"docker_container.workspace[0]: Creation errored after 0s","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.245458+02:00","hook":{"resource":{"addr":"docker_container.workspace[0]","module":"","resource":"docker_container.workspace[0]","implied_provider":"docker","resource_type":"docker_container","resource_name":"workspace","resource_key":0},"action":"create","elapsed_seconds":0},"type":"apply_errored"}
{"@level":"error","@message":"Error: Unable to create container: Error response from daemon: Conflict. The container name \"/coder-default-default\" is already in use by container \"2a0e1298adb8fe63d2ed0cb9ab0a1c85a135043518696866c89d767ce9b9a28f\". You have to remove (or rename) that container to be able to reuse that name.","@module":"terraform.ui","@timestamp":"2024-08-06T17:16:24.256653+02:00","diagnostic":{"severity":"error","summary":"Unable to create container: Error response from daemon: Conflict. The container name \"/coder-default-default\" is already in use by container \"2a0e1298adb8fe63d2ed0cb9ab0a1c85a135043518696866c89d767ce9b9a28f\". You have to remove (or rename) that container to be able to reuse that name.","detail":"","address":"docker_container.workspace[0]","range":{"filename":"main.tf","start":{"line":148,"column":41,"byte":4695},"end":{"line":148,"column":42,"byte":4696}},"snippet":{"context":"resource \"docker_container\" \"workspace\"","code":"resource \"docker_container\" \"workspace\" {","start_line":148,"highlight_start_offset":40,"highlight_end_offset":41,"values":[]}},"type":"diagnostic"} Using these JSON logs, we could provider users and operators with a waterfall chart of all the stages in the I've implemented the basic framework for this feature (backend only, lol) in https://github.com/coder/coder/compare/dk/provision-detailed-apply. See #22 (comment) above for a view of the data in a new table named We could also turn this data into per-provider Prometheus metrics so operators could see which providers are the slowest and consider alternatives if provisioning speed is very important to them. Additionally, we could incorporate the timings from the other stages into this graph to visualize the full process including time spent waiting to be scheduled on a provisioner, etc. 1 WITH no_cache_cte AS (
SELECT COUNT(no_cache.*) AS no_cache FROM (
SELECT job_id FROM provisioner_job_logs WHERE (
output LIKE '%Downloading https://registry%' -- modules
OR output LIKE '%- Installing%' -- providers
)
GROUP BY job_id
) AS no_cache
),
total_cte AS (
SELECT COUNT(*) AS total FROM (
SELECT job_id FROM provisioner_job_logs
GROUP BY job_id
) AS all_entries
)
SELECT
(1 - (CAST(no_cache_cte.no_cache AS DECIMAL(10,2)) / CAST(total_cte.total AS DECIMAL(10,2)))) * 100 AS cached
FROM
no_cache_cte,
total_cte; |
I've noticed dogfood executes Terraform slower than other deployments (e.g. my local deployment). For example, it often hangs on "Terraform" Any idea why? |
Lemme know if the above comment doesn't answer that question for ya; looks like we posted at the same time @bpmct |
Ah whoops, reading now. terracache and the visualization looks extremely promising!
Have you explored whether it is also a resource/db utilization problem? Admittedly, this is not a perfect comparison (the dev build fails), but I notice in general there are more "hangs" from provisioning in dev (e.g. "at the Terraform" step). Screen.Recording.2024-08-09.at.7.39.39.AM.mov |
This could be for a number of reasons, which might be difficult to isolate. In any case, we should see this problem exacerbated once we make provisioning faster, which would make it easier to solve if there is indeed an issue within our scope. I'll look at getting terracache into dogfood next week, and we can touch base. Sound good? |
Yeah, pretty much every build I do on dev feels slower than by builds on any other deployment.
🫡 |
coder/coder#14452 is required for us to deploy |
From last week's All Hands, Ammar highlighted the need for shorter workspace startup times to unlock some new use-cases such as
coder run
.tf-exec
, instead of spawning newterraform
processes inprovisioner/terraform/executor.go
. Given the uncertain state of Terraform, we can look at using OpenTofu's fork - however there is some question about whether this library will be maintained.terraform
to run fasterapply
stageterraform
has a-parallelism
flag, which is set to 10. I imagine thatterraform
will mostly be blocked on network requests, which means we can bump up the parallelism quite a lot.terraform
is not a very CPU-intensive program, and therefore having 10s or even 100s of goroutines handling API requests should not impede performance but in fact improve itProvisioners use barely any CPU (CPU time in this graph is cumulative across all CPU modes) (source)
The text was updated successfully, but these errors were encountered: