chore: cherry pick for release 2.22 #17842

stirby · 2025-05-15T00:44:50Z

Incomplete, generating changelog.

…7572) Closes coder/vscode-coder#447 Closes coder/jetbrains-coder#543 Closes coder/coder-jetbrains-toolbox#21 This PR adds Coder Connect support to `coder ssh --stdio`. When connecting to a workspace, if `--force-new-tunnel` is not passed, the CLI will first do a DNS lookup for `<agent>.<workspace>.<owner>.<hostname-suffix>`. If an IP address is returned, and it's within the Coder service prefix, the CLI will not create a new tailnet connection to the workspace, and instead dial the SSH server running on port 22 on the workspace directly over TCP. This allows IDE extensions to use the Coder Connect tunnel, without requiring any modifications to the extensions themselves. Additionally, `using_coder_connect` is added to the `sshNetworkStats` file, which the VS Code extension (and maybe Jetbrains?) will be able to read, and indicate to the user that they are using Coder Connect. One advantage of this approach is that running `coder ssh --stdio` on an offline workspace with Coder Connect enabled will have the CLI wait for the workspace to build, the agent to connect (and optionally, for the startup scripts to finish), before finally connecting using the Coder Connect tunnel. As a result, `coder ssh --stdio` has the overhead of looking up the workspace and agent, and checking if they are running. On my device, this meant `coder ssh --stdio <workspace>` was approximately a second slower than just connecting to the workspace directly using `ssh <workspace>.coder` (I would assume anyone serious about their Coder Connect usage would know to just do the latter anyway). To ensure this doesn't come at a significant performance cost, I've also benchmarked this PR. <details> <summary>Benchmark</summary> ## Methodology All tests were completed on `dev.coder.com`, where a Linux workspace running in AWS `us-west1` was created. The machine running Coder Desktop (the 'client') was a Windows VM running in the same AWS region and VPC as the workspace. To test the performance of specifically the SSH connection, a port was forwarded between the client and workspace using: ``` ssh -p 22 -L7001:localhost:7001 <host> ``` where `host` was either an alias for an SSH ProxyCommand that called `coder ssh`, or a Coder Connect hostname. For latency, [`tcping`](https://www.elifulkerson.com/projects/tcping.php) was used against the forwarded port: ``` tcping -n 100 localhost 7001 ``` For throughput, [`iperf3`](https://iperf.fr/iperf-download.php) was used: ``` iperf3 -c localhost -p 7001 ``` where an `iperf3` server was running on the workspace on port 7001. ## Test Cases ### Testcase 1: `coder ssh` `ProxyCommand` that bicopies from Coder Connect This case tests the implementation in this PR, such that we can write a config like: ``` Host codercliconnect ProxyCommand /path/to/coder ssh --stdio workspace ``` With Coder Connect enabled, `ssh -p 22 -L7001:localhost:7001 codercliconnect` will use the Coder Connect tunnel. The results were as follows: **Throughput, 10 tests, back to back:** - Average throughput across all tests: 788.20 Mbits/sec - Minimum average throughput: 731 Mbits/sec - Maximum average throughput: 871 Mbits/sec - Standard Deviation: 38.88 Mbits/sec **Latency, 100 RTTs:** - Average: 0.369ms - Minimum: 0.290ms - Maximum: 0.473ms ### Testcase 2: `ssh` dialing Coder Connect directly without a `ProxyCommand` This is what we assume to be the 'best' way to use Coder Connect **Throughput, 10 tests, back to back:** - Average throughput across all tests: 789.50 Mbits/sec - Minimum average throughput: 708 Mbits/sec - Maximum average throughput: 839 Mbits/sec - Standard Deviation: 39.98 Mbits/sec **Latency, 100 RTTs:** - Average: 0.369ms - Minimum: 0.267ms - Maximum: 0.440ms ### Testcase 3: `coder ssh` `ProxyCommand` that creates its own Tailnet connection in-process This is what normally happens when you run `coder ssh`: **Throughput, 10 tests, back to back:** - Average throughput across all tests: 610.20 Mbits/sec - Minimum average throughput: 569 Mbits/sec - Maximum average throughput: 664 Mbits/sec - Standard Deviation: 27.29 Mbits/sec **Latency, 100 RTTs:** - Average: 0.335ms - Minimum: 0.262ms - Maximum: 0.452ms ## Analysis Performing a two-tailed, unpaired t-test against the throughput of testcases 1 and 2, we find a P value of `0.9450`. This suggests the difference between the data sets is not statistically significant. In other words, there is a 94.5% chance that the difference between the data sets is due to chance. ## Conclusion From the t-test, and by comparison to the status quo (regular `coder ssh`, which uses gvisor, and is noticeably slower), I think it's safe to say any impact on throughput or latency by the `ProxyCommand` performing a bicopy against Coder Connect is negligible. Users are very much unlikely to run into performance issues as a result of using Coder Connect via `coder ssh`, as implemented in this PR. Less scientifically, I ran these same tests on my home network with my Sydney workspace, and both throughput and latency were consistent across testcases 1 and 2. </details> (cherry picked from commit 53ba361)

…17628) The regular network info file creation code also calls `Mkdirall`. Wasn't picked up in manual testing as I already had the `/net` folder in my VSCode. Wasn't picked up in automated testing because we use an in-memory FS, which for some reason does this implicitly. (cherry picked from commit c7fc7b9)

Closes coder/internal#563 The [Coder Connect tunnel](https://github.com/coder/coder/blob/main/vpn/tunnel.go) receives workspace state from the Coder server over a [dRPC stream.](https://github.com/coder/coder/blob/114ba4593b2a82dfd41cdcb7fd6eb70d866e7b86/tailnet/controllers.go#L1029) When first connecting to this stream, the current state of the user's workspaces is received, with subsequent messages being diffs on top of that state. However, if the client disconnects from this stream, such as when the user's device is suspended, and then reconnects later, no mechanism exists for the tunnel to differentiate that message containing the entire initial state from another diff, and so that state is incorrectly applied as a diff. In practice: - Tunnel connects, receives a workspace update containing all the existing workspaces & agents. - Tunnel loses connection, but isn't completely stopped. - All the user's workspaces are restarted, producing a new set of agents. - Tunnel regains connection, and receives a workspace update containing all the existing workspaces & agents. - This initial update is incorrectly applied as a diff, with the Tunnel's state containing both the old & new agents. This PR introduces a solution in which tunnelUpdater, when created, sends a FreshState flag with the WorkspaceUpdate type. This flag is handled in the vpn tunnel in the following fashion: - Preserve existing Agents - Remove current Agents in the tunnel that are not present in the WorkspaceUpdate - Remove unreferenced Workspaces (cherry picked from commit 5f516ed)

Fixes accidental omission from #17527 --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit 6936a7b)

Don't specify the template version for a delete transition, because the prebuilt workspace may have been created using an older template version. If the template version isn't explicitly set, the builder will automatically use the version from the last workspace build - which is the desired behavior. (cherry picked from commit ef11d4f)

PR contains: - fix for claiming & deleting prebuilds with immutable params - unit test for claiming scenario - unit test for deletion scenario The parameter resolver was failing when deleting/claiming prebuilds because a value for a previously-used parameter was provided to the resolver, but since the value was unchanged (it's coming from the preset) it failed in the resolver. The resolver was missing a check to see if the old value != new value; if the values match then there's no mutation of an immutable parameter. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit 98e5611)

Currently we don't have a way to get insight into Postgres connections being exhausted. By using the prometheus' [`DBStats` collector](https://github.com/prometheus/client_golang/blob/main/prometheus/collectors/dbstats_collector.go), we get some insight out-of-the-box. ``` # HELP go_sql_idle_connections The number of idle connections. # TYPE go_sql_idle_connections gauge go_sql_idle_connections{db_name="coder"} 1 # HELP go_sql_in_use_connections The number of connections currently in use. # TYPE go_sql_in_use_connections gauge go_sql_in_use_connections{db_name="coder"} 2 # HELP go_sql_max_idle_closed_total The total number of connections closed due to SetMaxIdleConns. # TYPE go_sql_max_idle_closed_total counter go_sql_max_idle_closed_total{db_name="coder"} 112 # HELP go_sql_max_idle_time_closed_total The total number of connections closed due to SetConnMaxIdleTime. # TYPE go_sql_max_idle_time_closed_total counter go_sql_max_idle_time_closed_total{db_name="coder"} 0 # HELP go_sql_max_lifetime_closed_total The total number of connections closed due to SetConnMaxLifetime. # TYPE go_sql_max_lifetime_closed_total counter go_sql_max_lifetime_closed_total{db_name="coder"} 0 # HELP go_sql_max_open_connections Maximum number of open connections to the database. # TYPE go_sql_max_open_connections gauge go_sql_max_open_connections{db_name="coder"} 10 # HELP go_sql_open_connections The number of established connections both in use and idle. # TYPE go_sql_open_connections gauge go_sql_open_connections{db_name="coder"} 3 # HELP go_sql_wait_count_total The total number of connections waited for. # TYPE go_sql_wait_count_total counter go_sql_wait_count_total{db_name="coder"} 28 # HELP go_sql_wait_duration_seconds_total The total time blocked waiting for a new connection. # TYPE go_sql_wait_duration_seconds_total counter go_sql_wait_duration_seconds_total{db_name="coder"} 0.086936235 ``` `go_sql_wait_count_total` is the metric I'm most interested in gaining, but the others are also very useful. Changing the prefix is easy (`prometheus.WrapRegistererWithPrefix`), but getting rid of the `go_` segment is not quite so easy. I've kept the changeset small for now. **NOTE:** I imported a library to determine the database name from the given conn string. It's [not as simple](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) as one might hope. The database name is used for the `db_name` label. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit c278662)

…n exhaustion (#17648) Database transactions hold onto connections, and `pubsub.Publish` tries to acquire a connection of its own. If the latter is called within a transaction, this can lead to connection exhaustion. I plan two follow-ups to this PR: 1. Make connection counts tuneable https://github.com/coder/coder/blob/main/cli/server.go#L2360-L2376 We will then be able to write tests showing how connection exhaustion occurs. 2. Write a linter/ruleguard to prevent `pubsub.Publish` from being called within a transaction. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit a646478)

`Collect()` is called whenever the `/metrics` endpoint is hit to retrieve metrics. The queries used in prebuilds metrics collection are quite heavy, and we want to avoid having them running concurrently / too often to keep db load down. Here I'm moving towards a background retrieval of the state required to set the metrics, which gets invalidated every interval. Also introduces `coderd_prebuilt_workspaces_metrics_last_updated` which operators can use to determine when these metrics go stale. See #17789 as well. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit b2a1de9)

Avoids two sequential scans of massive tables (`workspace_builds`, `provisioner_jobs`) and uses index scans instead. This new view largely replicates our already optimized query `GetWorkspaces` to fetch the latest build. The original query and the new query were compared against the dogfood database to ensure they return the exact same data in the exact same order (minus the new `workspaces.deleted = false` filter to improve performance even more). The performance is massively improved even without the `workspaces.deleted = false` filter, but it was added to improve it even more. Note: these query times are probably inflated due to high database load on our dogfood environment that this intends to partially resolve. Before: 2,139ms ([explain](https://explain.dalibo.com/plan/997e4fch241b46e6)) After: 33ms ([explain](https://explain.dalibo.com/plan/c888dc223870f181)) Co-authored-by: Cian Johnston <cian@coder.com> --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Mathias Fredriksson <mafredri@gmail.com> Co-authored-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit ef745c0)

The changes in `coder/preview` necessitated the changes in `codersdk/richparameters.go` & `provisioner/terraform/resources.go`. --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> Co-authored-by: Steven Masley <stevenmasley@gmail.com> (cherry picked from commit 3ee95f1)

ethanndickson and others added 10 commits May 14, 2025 18:51

fix: fix prebuild omissions (#17579)

f51e027

Fixes accidental omission from #17527 --------- Signed-off-by: Danny Kopping <dannykopping@gmail.com> (cherry picked from commit 6936a7b)

github-actions bot assigned stirby May 15, 2025

stirby requested a review from Emyrk May 15, 2025 00:52

Emyrk approved these changes May 15, 2025

View reviewed changes

stirby merged commit 3a5c2d7 into release/2.22 May 15, 2025
27 of 31 checks passed

stirby deleted the cherry-pick-r2.22 branch May 15, 2025 01:09

github-actions bot locked and limited conversation to collaborators May 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore: cherry pick for release 2.22 #17842

chore: cherry pick for release 2.22 #17842

Uh oh!

stirby commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!

chore: cherry pick for release 2.22 #17842

chore: cherry pick for release 2.22 #17842

Uh oh!

Conversation

stirby commented May 15, 2025

Uh oh!

Uh oh!

Uh oh!