coder · johnstcn · Feb 12, 2024 · Feb 6, 2024 · Feb 6, 2024 · Feb 6, 2024
diff --git a/docs/manifest.json b/docs/manifest.json
@@ -292,6 +292,11 @@
           "title": "Port Forwarding",
           "description": "Learn how to forward ports in Coder",
           "path": "./networking/port-forwarding.md"
+        },
+        {
+          "title": "STUN and NAT",
+          "description": "Learn how Coder establishes direct connections",
+          "path": "./networking/stun.md"
         }
       ]
     },

diff --git a/docs/networking/index.md b/docs/networking/index.md
@@ -13,6 +13,49 @@ user <-> workspace connections are end-to-end encrypted.
 
 [Tailscale's open source](https://tailscale.com) backs our networking logic.
 
+## Requirements
+
+In order for clients and workspaces to be able to connect:
+
+- All clients and agents must be able to establish a connection to the Coder
+  server (`CODER_ACCESS_URL`) over HTTP/HTTPS.
+- Any reverse proxy or ingress between the Coder control plane and
+  clients/agents must support WebSockets.
+
+In order for clients to be able to establish direct connections:
+
+> **Note:** Direct connections via the web browser are not supported. To improve
+> latency for browser-based applications running inside Coder workspaces in
+> regions far from the Coder control plane, consider deploying one or more
+> [workspace proxies](../admin/workspace-proxies.md).
+
+- The client is connecting using the CLI (e.g. `coder ssh` or
+  `coder port-forward`). Note that the
+  [VSCode extension](https://marketplace.visualstudio.com/items?itemName=coder.coder-remote)
+  and [JetBrains Plugin](https://plugins.jetbrains.com/plugin/19620-coder/), and
+  [`ssh coder.<workspace>`](../cli/config-ssh.md) all utilize the CLI to
+  establish a workspace connection.
+- Either the client or workspace agent are able to discover a reachable
+  `ip:port` of their counterpart. If the agent and client are able to
+  communicate with each other using their locally assigned IP addresses, then a
+  direct connection can be established immediately. Otherwise, the client and
+  agent will contact
+  [the configured STUN servers](../cli/server.md#derp-server-stun-addresses) to
+  try and determine which `ip:port` can be used to communicate with their
+  counterpart. See [STUN and NAT](./stun.md) for more details on how this
+  process works.
+- All outbound UDP traffic must be allowed for both the client and the agent on
+  **all ports** to each others' respective networks.
+  - To establish a direct connection, both agent and client use STUN. This
+    involves sending UDP packets outbound on `udp/3478` to the configured
+    [STUN server](../cli/server.md#--derp-server-stun-addresses). If either the
+    agent or the client are unable to send and receive UDP packets to a STUN
+    server, then direct connections will not be possible.
+  - Both agents and clients will then establish a
+    [WireGuard](https://www.wireguard.com/)️ tunnel and send UDP traffic on
+    ephemeral (high) ports. If a firewall between the client and the agent
+    blocks this UDP traffic, direct connections will not be possible.
+
 ## coder server
 
 Workspaces connect to the coder server via the server's external address, set
@@ -52,6 +95,12 @@ Direct connections are a straight line between the user and workspace, so there
 is no special geo-distribution configuration. To speed up direct connections,
 move the user and workspace closer together.
 
+Establishing a direct connection can be an involved process because both the
+client and workspace agent will likely be behind at least one level of NAT,
+meaning that we need to use STUN to learn the IP address and port under which
+the client and agent can both contact each other. See [STUN and NAT](./stun.md)
+for more information on how this process works.
+
 If a direct connection is not available (e.g. client or server is behind NAT),
 Coder will use a relayed connection. By default,
 [Coder uses Google's public STUN server](../cli/server.md#--derp-server-stun-addresses),

diff --git a/docs/networking/stun.md b/docs/networking/stun.md
@@ -0,0 +1,200 @@
+# STUN and NAT
+
+> [Session Traversal Utilities for NAT (STUN)](https://www.rfc-editor.org/rfc/rfc8489.html)
+> is a protocol used to assist applications in establishing peer-to-peer
+> communications across Network Address Translations (NATs) or firewalls.
+>
+> [Network Address Translation (NAT)](https://en.wikipedia.org/wiki/Network_address_translation)
+> is commonly used in private networks to allow multiple devices to share a
+> single public IP address. The vast majority of home and corporate internet
+> connections use at least one level of NAT.
+
+## Overview
+
+In order for one application to connect to another across a network, the
+connecting application needs to know the IP address and port under which the
+target application is reachable. If both applications reside on the same
+network, then they can most likely connect directly to each other. In the
+context of a Coder workspace agent and client, this is generally not the case,
+as both agent and client will most likely be running in different _private_
+networks (e.g. `192.168.1.0/24`). In this case, at least one of the two will
+need to know an IP address and port under which they can reach their
+counterpart.
+
+This problem is often referred to as NAT traversal, and Coder uses a standard
+protocol named STUN to address this.
+
+Inside of that network, packets from the agent or client will show up as having
+source address `192.168.1.X:12345`. However, outside of this private network,
+the source address will show up differently (for example, `12.3.4.56:54321`). In
+order for the Coder client and agent to establish a direct connection with each
+other, one of them needs to know the `ip:port` pair under which their
+counterpart can be reached. Once communication succeeds in one direction, we can
+inspect the source address of the received packet to determine the return
+address.
+
+At a high level, STUN works like this:
+
+> The below glosses over a lot of the complexity of traversing NATs. For a more
+> in-depth technical explanation, see
+> [How NAT traversal works (tailscale.com)](https://tailscale.com/blog/how-nat-traversal-works).
+
+- **Discovery:** Both the client and agent will send UDP traffic to one or more
+  configured STUN servers. These STUN servers are generally located on the
+  public internet, and respond with the public IP address and port from which
+  the request came.
+- **Coordination:** The client and agent then exchange this information through
+  the Coder server. They will then construct packets that should be able to
+  successfully traverse their counterpart's NATs successfully.
+- **NAT Traversal:** The client and agent then send these crafted packets to
+  their counterpart's public addresses. If all goes well, the NATs on the other
+  end should route these packets to the correct internal address.
+- **Connection:** Once the packets reach the other side, they send a response
+  back to the source `ip:port` from the packet. Again, the NATs should recognize
+  these responses as belonging to an ongoing communication, and forward them
+  accordingly.
+
+At this point, both the client and agent should be able to send traffic directly
+to each other.
+
+## Examples
+
+In this example, both the client and agent are located on the network
+`192.168.21.0/24`. Assuming no firewalls are blocking packets in either
+direction, both client and agent are able to communicate directly with each
+other's locally assigned IP address.
+
+### 1. Direct connections without NAT or STUN
+
+```mermaid
+flowchart LR
+    subgraph corpnet["Private Network\ne.g. Corp. LAN"]
+    A[Client Workstation\n192.168.21.47:38297]
+    C[Workspace Agent\n192.168.21.147:41563]
+    A <--> C
+    end
+```
+
+### 2. Direct connections with one layer of NAT
+
+In this example, client and agent are located on different networks and connect
+to each other over the public Internet. Both client and agent connect to a
+configured STUN server located on the public Internet to determine the public IP
+address and port on which they can be reached.
+
+```mermaid
+flowchart LR
+  subgraph homenet["Network A"]
+    client["Client workstation\n192.168.1.101:38297"]
+    homenat["NAT\n??.??.??.??:?????"]
+  end
+  subgraph internet["Public Internet"]
+    stun1["STUN server"]
+  end
+  subgraph corpnet["Network B"]
+    agent["Workspace agent\n10.21.43.241:56812"]
+    corpnat["NAT\n??.??.??.??:?????"]
+  end
+
+  client --- homenat
+  agent --- corpnat
+  corpnat -- "[I see 12.34.56.7:41563]" --> stun1
+  homenat -- "[I see 65.4.3.21:29187]" --> stun1
+```
+
+They then exchange this information through Coder server, and can then
+communicate directly with each other through their respective NATs.
+
+```mermaid
+flowchart LR
+  subgraph homenet["Network A"]
+    client["Client workstation\n192.168.1.101:38297"]
+    homenat["NAT\n65.4.3.21:29187"]
+  end
+  subgraph corpnet["Network B"]
+    agent["Workspace agent\n10.21.43.241:56812"]
+    corpnat["NAT\n12.34.56.7:41563"]
+  end
+
+  subgraph internet["Public Internet"]
+  end
+
+  client -- "[12.34.56.7:41563]" --- homenat
+  agent -- "[10.21.43.241:56812]" --- corpnat
+  corpnat -- "[65.4.3.21:29187]" --> internet
+  homenat -- "[12.34.56.7:41563]" --> internet
+
+```
+
+### 3. Direct connections with VPN and NAT hairpinning
+
+In this example, the client workstation must use a VPN to connect to the
+corporate network. All traffic from the client will enter through the VPN entry
+node and exit at the VPN exit node inside the corporate network. Traffic from
+the client inside the corporate network will appear to be coming from the IP
+address of the VPN exit node `172.16.1.2`. Traffic from the client to the public
+internet will appear to have the public IP address of the corporate router
+`12.34.56.7`.
+
+The workspace agent is running on a Kubernetes cluster inside the corporate
+network, which is behind its own layer of NAT. To anyone inside the corporate
+network but outside the cluster network, its traffic will appear to be coming
+from `172.16.1.254`. However, traffic from the agent to services on the public
+Internet will also see traffic originating from the public IP address assigned
+to the corporate router. Additionally, the corporate router will most likely
+have a firewall configured to block traffic from the internet to the corporate
+network.
+
+If the client and agent both use the public STUN server, the addresses
+discovered by STUN will both be the public IP address of the corporate router.
+To correctly route the traffic backwards, the corporate router must correctly
+route both:
+
+- Traffic sent from the client to the external IP of the corporate router back
+  to the cluster router, and
+- Traffic sent from the agent to the external IP of the corporate router to the
+  VPN exit node.
+
+This behaviour is known as "hairpinning", and may not be supported in all
+network configurations.
+
+If hairpinning is not supported, deploying an internal STUN server can aid
+establishing direct connections between client and agent. When the agent and
+client query this internal STUN server, they will be able to determine the
+addresses on the corporate network from which their traffic appears to
+originate. Using these internal addresses is much more likely to result in a
+successful direct connection.
+
+```mermaid
+flowchart TD
+  subgraph homenet["Home Network"]
+    client["Client workstation\n192.168.1.101"]
+    homenat["Home Router/NAT\n65.4.3.21"]
+  end
+
+  subgraph internet["Public Internet"]
+    stun1["Public STUN"]
+    vpn1["VPN entry node"]
+  end
+
+  subgraph corpnet["Corp Network 172.16.1.0/24"]
+    corpnat["Corp Router/NAT\n172.16.1.1\n12.34.56.7"]
+    vpn2["VPN exit node\n172.16.1.2"]
+    stun2["Private STUN"]
+
+    subgraph cluster["Cluster Network 10.11.12.0/16"]
+      clusternat["Cluster Router/NAT\n10.11.12.1\n172.16.1.254"]
+      agent["Workspace agent\n10.11.12.34"]
+    end
+  end
+
+  vpn1 === vpn2
+  vpn2 --> stun2
+  client === homenat
+  homenat === vpn1
+  homenat x-.-x stun1
+  agent --- clusternat
+  clusternat --- corpnat
+  corpnat --> stun1
+  corpnat --> stun2
+```