investigation: identify sources of workspace startup latency

From last week's All Hands, Ammar highlighted the need for shorter workspace startup times to unlock some new use-cases such as `coder run`.

![image](https://github.com/user-attachments/assets/0c447ded-ba48-4bfc-8ab7-80243a56f0c0)

- First, I want to enumerate the phases of workspace creation, and visualize/capture how the time was spent at each stage (if we can get down to the provider / resource level, that would be ideal).
  - _Even if we optimize as much as we can, if a template uses some super slow cloud API it'll negatively affect startup time - so at least we should give users/operator insight into where the time is being spent_
  - We already have distributed traces in place in our provisioners, so we may just be able to use that data.
- I want to investigate the use of [`tf-exec`](https://github.com/hashicorp/terraform-exec), instead of spawning new `terraform` processes in `provisioner/terraform/executor.go`. Given the uncertain state of Terraform, we can look at using OpenTofu's [fork](https://github.com/opentofu/tofu-exec) - however [there is some question about whether this library will be maintained](https://github.com/opentofu/tofu-exec/pull/2#issuecomment-2139596386).
- Lastly I want to look at ways we can tune `terraform` to run faster
  - plugin cache: on my machine, I was able to shave 10s off a very basic Docker template by using the plugin cache; we already make use of this automatically but [it only works on Linux](https://github.com/coder/coder/blob/3364abecdd8bbbf89a82d073d2b405fc623bb174/provisioner/terraform/executor.go#L40-L50) and it appears to not be operational on dogfood at first glance, or least not used in the `apply` stage  
![image](https://github.com/user-attachments/assets/d59bc1d7-bdd5-4da5-b5cf-d7745f7fb94a)
  - parallelism: `terraform` has a `-parallelism` flag, which is set to 10. I imagine that `terraform` will mostly be blocked on network requests, which means we can bump up the parallelism quite a lot. `terraform` is not a very CPU-intensive program, and therefore having 10s or even 100s of goroutines handling API requests should not impede performance but in fact improve it
 ![image](https://github.com/user-attachments/assets/efb139f9-0258-445d-a9e7-ecb2cba2581e)
 Provisioners use barely any CPU (CPU time in this graph is cumulative across all CPU modes) ([_source_](https://grafana.dev.coder.com/explore?schemaVersion=1&panes=%7B%22usr%22:%7B%22datasource%22:%22prometheus%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22max%28sum%20by%20%28pod%29%20%28rate%28container_cpu_usage_seconds_total%7Bpod%3D~%5C%22coder-provisioner.%2A%5C%22%7D%5B$__rate_interval%5D%29%29%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22per-sec%20usage%22%7D,%7B%22refId%22:%22B%22,%22expr%22:%22max%28max_over_time%28%5Cn%20%20%20%20sum%20by%20%28pod%29%20%28rate%28container_cpu_usage_seconds_total%7Bpod%3D~%5C%22coder-provisioner.%2A%5C%22%7D%5B5m:1m%5D%29%29%5Cn%20%20%20%20%5B1d:1m%5D%5Cn%29%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22high%20watermark%20per%20day%22%7D,%7B%22refId%22:%22C%22,%22expr%22:%22max%28max%20by%20%28pod%29%20%28kube_pod_container_resource_requests%7Bresource%3D%5C%22cpu%5C%22,%20pod%3D~%5C%22coder-provisioner.%2A%5C%22%7D%29%29%22,%22range%22:true,%22instant%22:true,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22prometheus%22%7D,%22editorMode%22:%22code%22,%22legendFormat%22:%22CPUs%20requested%22%7D%5D,%22range%22:%7B%22from%22:%221721127414279%22,%22to%22:%221721732214279%22%7D%7D%7D&orgId=1))


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

investigation: identify sources of workspace startup latency #22

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

investigation: identify sources of workspace startup latency #22

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions