|
| 1 | +# Scale Coder |
| 2 | + |
| 3 | +December 20, 2024 |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +This best practice guide helps you prepare a low-scale Coder deployment so that |
| 8 | +it can be scaled up to a high-scale deployment as use grows, and keep it |
| 9 | +operating smoothly with a high number of active users and workspaces. |
| 10 | + |
| 11 | +## Observability |
| 12 | + |
| 13 | +Observability is one of the most important aspects to a scalable Coder |
| 14 | +deployment. |
| 15 | + |
| 16 | +Identify potential bottlenecks before they negatively affect the end-user |
| 17 | +experience. It will also allow you to empirically verify that modifications you |
| 18 | +make to your deployment to increase capacity have their intended effects. |
| 19 | + |
| 20 | +- Capture log output from Coder Server instances and external provisioner |
| 21 | + daemons and store them in a searchable log store. |
| 22 | + |
| 23 | + - For example: Loki, CloudWatch Logs, etc. |
| 24 | + |
| 25 | + - Retain logs for a minimum of thirty days, ideally ninety days. This allows |
| 26 | + you to look back to see when anomalous behaviors began. |
| 27 | + |
| 28 | +- Metrics: |
| 29 | + |
| 30 | + - Capture infrastructure metrics like CPU, memory, open files, and network I/O |
| 31 | + for all Coder Server, external provisioner daemon, workspace proxy, and |
| 32 | + PostgreSQL instances. |
| 33 | + |
| 34 | + - Capture metrics from Coder Server and external provisioner daemons via |
| 35 | + Prometheus. |
| 36 | + |
| 37 | + - On Coder Server |
| 38 | + |
| 39 | + - Enable Prometheus metrics: |
| 40 | + |
| 41 | + ```yaml |
| 42 | + CODER_PROMETHEUS_ENABLE=true |
| 43 | + ``` |
| 44 | + |
| 45 | + - Enable database metrics: |
| 46 | + |
| 47 | + ```yaml |
| 48 | + CODER_PROMETHEUS_COLLECT_DB_METRICS=true |
| 49 | + ``` |
| 50 | + |
| 51 | + - Configure agent stats to avoid large cardinality: |
| 52 | + |
| 53 | + ```yaml |
| 54 | + CODER_PROMETHEUS_AGGREGATE_AGENT_STATS_BY=agent_name |
| 55 | + ``` |
| 56 | + |
| 57 | + - To disable Agent stats: |
| 58 | + |
| 59 | + ```yaml |
| 60 | + CODER_PROMETHEUS_COLLECT_AGENT_STATS=false |
| 61 | + ``` |
| 62 | + |
| 63 | + - Retain metric time series for at least six months. This allows you to see |
| 64 | + performance trends relative to user growth. |
| 65 | + |
| 66 | + - Integrate metrics with an observability dashboard, for example, Grafana. |
| 67 | + |
| 68 | +### Key metrics |
| 69 | + |
| 70 | +**CPU and Memory Utilization** |
| 71 | + |
| 72 | +- Monitor the utilization as a fraction of the available resources on the |
| 73 | + instance. Its utilization will vary with use throughout the day and over the |
| 74 | + course of the week. Monitor the trends, paying special attention to the daily |
| 75 | + and weekly peak utilization. Use long-term trends to plan infrastructure |
| 76 | + upgrades. |
| 77 | + |
| 78 | +**Tail latency of Coder Server API requests** |
| 79 | + |
| 80 | +- Use the `coderd_api_request_latencies_seconds` metric. |
| 81 | +- High tail latency can indicate Coder Server or the PostgreSQL database is |
| 82 | + being starved for resources. |
| 83 | + |
| 84 | +**Tail latency of database queries** |
| 85 | + |
| 86 | +- Use the `coderd_db_query_latencies_seconds` metric. |
| 87 | +- High tail latency can indicate the PostgreSQL database is low in resources. |
| 88 | + |
| 89 | +Configure alerting based on these metrics to ensure you surface problems before |
| 90 | +end users notice them. |
| 91 | + |
| 92 | +## Coder Server |
| 93 | + |
| 94 | +### Locality |
| 95 | + |
| 96 | +If increased availability of the Coder API is a concern, deploy at least three |
| 97 | +instances. Spread the instances across nodes (e.g. via anti-affinity rules in |
| 98 | +Kubernetes), and/or in different availability zones of the same geographic |
| 99 | +region. |
| 100 | + |
| 101 | +Do not deploy in different geographic regions. Coder Servers need to be able to |
| 102 | +communicate with one another directly with low latency, under 10ms. Note that |
| 103 | +this is for the availability of the Coder API – workspaces will not be fault |
| 104 | +tolerant unless they are explicitly built that way at the template level. |
| 105 | + |
| 106 | +Deploy Coder Server instances as geographically close to PostgreSQL as possible. |
| 107 | +Low-latency communication (under 10ms) with Postgres is essential for Coder |
| 108 | +Server's performance. |
| 109 | + |
| 110 | +### Scaling |
| 111 | + |
| 112 | +Coder Server can be scaled both vertically for bigger instances and horizontally |
| 113 | +for more instances. |
| 114 | + |
| 115 | +Aim to keep the number of Coder Server instances relatively small, preferably |
| 116 | +under ten instances, and opt for vertical scale over horizontal scale after |
| 117 | +meeting availability requirements. |
| 118 | + |
| 119 | +Coder's |
| 120 | +[validated architectures](../../admin/infrastructure/validated-architectures.md) |
| 121 | +give specific sizing recommendations for various user scales. These are a useful |
| 122 | +starting point, but very few deployments will remain stable at a predetermined |
| 123 | +user level over the long term, so monitoring and adjusting of resources is |
| 124 | +recommended. |
| 125 | + |
| 126 | +We don't recommend that you autoscale the Coder Servers. Instead, scale the |
| 127 | +deployment for peak weekly usage. |
| 128 | + |
| 129 | +Although Coder Server persists no internal state, it operates as a proxy for end |
| 130 | +users to their workspaces in two capacities: |
| 131 | + |
| 132 | +1. As an HTTP proxy when they access workspace applications in their browser via |
| 133 | + the Coder Dashboard |
| 134 | + |
| 135 | +1. As a DERP proxy when establishing tunneled connections via CLI tools |
| 136 | + (`coder ssh`, `coder port-forward`, etc.) and desktop IDEs. |
| 137 | + |
| 138 | +Stopping a Coder Server instance will (momentarily) disconnect any users |
| 139 | +currently connecting through that instance. Adding a new instance is not |
| 140 | +disruptive, but removing instances and upgrades should be performed during a |
| 141 | +maintenance window to minimize disruption. |
| 142 | + |
| 143 | +## Provisioner daemons |
| 144 | + |
| 145 | +### Locality |
| 146 | + |
| 147 | +We recommend you disable provisioner daemons within your Coder Server: |
| 148 | + |
| 149 | +```yaml |
| 150 | +CODER_PROVISIONER_DAEMONS=0 |
| 151 | +``` |
| 152 | + |
| 153 | +Run one or more |
| 154 | +[provisioner daemon deployments external to Coder Server](../../admin/provisioners.md). |
| 155 | +This allows you to scale them independently of the Coder Server. |
| 156 | + |
| 157 | +We recommend deploying provisioner daemons within the same cluster as the |
| 158 | +workspaces they will provision or are hosted in. |
| 159 | + |
| 160 | +- This gives them a low-latency connection to the APIs they will use to |
| 161 | + provision workspaces and can speed builds. |
| 162 | + |
| 163 | +- It allows provisioner daemons to use in-cluster mechanisms (for example |
| 164 | + Kubernetes service account tokens, AWS IAM Roles, etc.) to authenticate with |
| 165 | + the infrastructure APIs. |
| 166 | + |
| 167 | +- If you deploy workspaces in multiple clusters, run multiple provisioner daemon |
| 168 | + deployments and use template tags to select the correct set of provisioner |
| 169 | + daemons. |
| 170 | + |
| 171 | +- Provisioner daemons need to be able to connect to Coder Server, but this need |
| 172 | + not be a low-latency connection. |
| 173 | + |
| 174 | +Provisioner daemons make no direct connections to the PostgreSQL database, so |
| 175 | +there's no need for locality to the Postgres database. |
| 176 | + |
| 177 | +### Scaling |
| 178 | + |
| 179 | +Each provisioner daemon instance can handle a single workspace build job at a |
| 180 | +time. Therefore, the number of provisioner daemon instances within a tagged |
| 181 | +deployment equals the maximum number of simultaneous builds your Coder |
| 182 | +deployment can handle. |
| 183 | + |
| 184 | +If users experience unacceptably long queues for workspace builds to start, |
| 185 | +consider increasing the number of provisioner daemon instances in the affected |
| 186 | +cluster. |
| 187 | + |
| 188 | +You may wish to automatically scale the number of provisioner daemon instances |
| 189 | +throughout the day to meet demand. If you stop instances with `SIGHUP`, they |
| 190 | +will complete their current build job and exit. `SIGINT` will cancel the current |
| 191 | +job, which will result in a failed build. Ensure your autoscaler waits long |
| 192 | +enough for your build jobs to complete before forcibly killing the provisioner |
| 193 | +daemon process. |
| 194 | + |
| 195 | +If deploying in Kubernetes, we recommend a single provisioner daemon per pod. On |
| 196 | +a virtual machine (VM), you can deploy multiple provisioner daemons, ensuring |
| 197 | +each has a unique `CODER_CACHE_DIRECTORY` value. |
| 198 | + |
| 199 | +Coder's |
| 200 | +[validated architectures](../../admin/infrastructure/validated-architectures.md) |
| 201 | +give specific sizing recommendations for various user scales. Since the |
| 202 | +complexity of builds varies significantly depending on the workspace template, |
| 203 | +consider this a starting point. Monitor queue times and build times to adjust |
| 204 | +the number and size of your provisioner daemon instances. |
| 205 | + |
| 206 | +## PostgreSQL |
| 207 | + |
| 208 | +PostgreSQL is the primary persistence layer for all of Coder's deployment data. |
| 209 | +We also use `LISTEN` and `NOTIFY` to coordinate between different instances of |
| 210 | +Coder Server. |
| 211 | + |
| 212 | +### Locality |
| 213 | + |
| 214 | +Coder Server instances must have low-latency connections (under 10ms) to |
| 215 | +PostgreSQL. If you use multiple PostgreSQL replicas in a clustered config, these |
| 216 | +must also be low-latency with respect to one another. |
| 217 | + |
| 218 | +### Scaling |
| 219 | + |
| 220 | +Prefer scaling PostgreSQL vertically rather than horizontally for best |
| 221 | +performance. Coder's |
| 222 | +[validated architectures](../../admin/infrastructure/validated-architectures.md) |
| 223 | +give specific sizing recommendations for various user scales. |
| 224 | + |
| 225 | +## Workspace proxies |
| 226 | + |
| 227 | +Workspace proxies proxy HTTP traffic from end users to workspaces for Coder apps |
| 228 | +defined in the templates, and HTTP ports opened by the workspace. By default |
| 229 | +they also include a DERP Proxy. |
| 230 | + |
| 231 | +### Locality |
| 232 | + |
| 233 | +We recommend each geographic cluster of workspaces have an associated deployment |
| 234 | +of workspace proxies. This ensures that users always have a near-optimal proxy |
| 235 | +path. |
| 236 | + |
| 237 | +### Scaling |
| 238 | + |
| 239 | +Workspace proxy load is determined by the amount of traffic they proxy. We |
| 240 | +recommend you monitor CPU, memory, and network I/O utilization to decide when to |
| 241 | +resize the number of proxy instances. |
| 242 | + |
| 243 | +We do not recommend autoscaling the workspace proxies because many applications |
| 244 | +use long-lived connections such as websockets, which would be disrupted by |
| 245 | +stopping the proxy. We recommend you scale for peak demand and scale down or |
| 246 | +upgrade during a maintenance window. |
| 247 | + |
| 248 | +## Workspaces |
| 249 | + |
| 250 | +Workspaces represent the vast majority of resources in most Coder deployments. |
| 251 | +Because they are defined by templates, there is no one-size-fits-all advice for |
| 252 | +scaling. |
| 253 | + |
| 254 | +### Hard and soft cluster limits |
| 255 | + |
| 256 | +All Infrastructure as a Service (IaaS) clusters have limits to what can be |
| 257 | +simultaneously provisioned. These could be hard limits, based on the physical |
| 258 | +size of the cluster, especially in the case of a private cloud, or soft limits, |
| 259 | +based on configured limits in your public cloud account. |
| 260 | + |
| 261 | +It is important to be aware of these limits and monitor Coder workspace resource |
| 262 | +utilization against the limits, so that a new influx of users doesn't encounter |
| 263 | +failed builds. Monitoring these is outside the scope of Coder, but we recommend |
| 264 | +that you set up dashboards and alerts for each kind of limited resource. |
| 265 | + |
| 266 | +As you approach soft limits, you might be able to justify an increase to keep |
| 267 | +growing. |
| 268 | + |
| 269 | +As you approach hard limits, you will need to consider deploying to additional |
| 270 | +cluster(s). |
| 271 | + |
| 272 | +### Workspaces per node |
| 273 | + |
| 274 | +Many development workloads are "spiky" in their CPU and memory requirements, for |
| 275 | +example, peaking during build/test and then ebbing while editing code. This |
| 276 | +leads to an opportunity to efficiently use compute resources by packing multiple |
| 277 | +workspaces onto a single node. This can lead to better experience (more CPU and |
| 278 | +memory available during brief bursts) and lower cost. |
| 279 | + |
| 280 | +However, it needs to be considered against several trade-offs. |
| 281 | + |
| 282 | +- There are residual probabilities of "noisy neighbor" problems negatively |
| 283 | + affecting end users. The probabilities increase with the amount of |
| 284 | + oversubscription of CPU and memory resources. |
| 285 | + |
| 286 | +- If the shared nodes are a provisioned resource, for example, Kubernetes nodes |
| 287 | + running on VMs in a public cloud, then it can sometimes be a challenge to |
| 288 | + effectively autoscale down. |
| 289 | + |
| 290 | + - For example, if half the workspaces are stopped overnight, and there are ten |
| 291 | + workspaces per node, it's unlikely that all ten workspaces on the node are |
| 292 | + among the stopped ones. |
| 293 | + |
| 294 | + - You can mitigate this by lowering the number of workspaces per node, or |
| 295 | + using autostop policies to stop more workspaces during off-peak hours. |
| 296 | + |
| 297 | +- If you do overprovision workspaces onto nodes, keep them in a separate node |
| 298 | + pool and schedule Coder control plane (Coder Server, PostgreSQL, workspace |
| 299 | + proxies) components on a different node pool to avoid resource spikes |
| 300 | + affecting them. |
| 301 | + |
| 302 | +Coder customers have had success with both: |
| 303 | + |
| 304 | +- One workspace per AWS VM |
| 305 | +- Lots of workspaces on Kubernetes nodes for efficiency |
| 306 | + |
| 307 | +### Cost control |
| 308 | + |
| 309 | +- Use quotas to discourage users from creating many workspaces they don't need |
| 310 | + simultaneously. |
| 311 | + |
| 312 | +- Label workspace cloud resources by user, team, organization, or your own |
| 313 | + labelling conventions to track usage at different granularities. |
| 314 | + |
| 315 | +- Use autostop requirements to bring off-peak utilization down. |
| 316 | + |
| 317 | +## Networking |
| 318 | + |
| 319 | +Set up your network so that most users can get direct, peer-to-peer connections |
| 320 | +to their workspaces. This drastically reduces the load on Coder Server and |
| 321 | +workspace proxy instances. |
0 commit comments