Skip to content

Commit a248f8f

Browse files
committed
add scaling best practice doc
1 parent bcb15aa commit a248f8f

File tree

3 files changed

+332
-8
lines changed

3 files changed

+332
-8
lines changed

docs/manifest.json

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -761,16 +761,21 @@
761761
"description": "Guides to help you make the most of your Coder experience",
762762
"path": "./tutorials/best-practices/index.md",
763763
"children": [
764-
{
765-
"title": "Security - best practices",
766-
"description": "Make your Coder deployment more secure",
767-
"path": "./tutorials/best-practices/security-best-practices.md"
768-
},
769764
{
770765
"title": "Organizations - best practices",
771766
"description": "How to make the best use of Coder Organizations",
772767
"path": "./tutorials/best-practices/organizations.md"
773768
},
769+
{
770+
"title": "Scale Coder",
771+
"description": "How to prepare a Coder deployment for scale",
772+
"path": "./tutorials/best-practices/scale-coder.md"
773+
},
774+
{
775+
"title": "Security - best practices",
776+
"description": "Make your Coder deployment more secure",
777+
"path": "./tutorials/best-practices/security-best-practices.md"
778+
},
774779
{
775780
"title": "Speed up your workspaces",
776781
"description": "Speed up your Coder templates and workspaces",
Lines changed: 321 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
# Scale Coder
2+
3+
December 20, 2024
4+
5+
---
6+
7+
This best practice guide helps you prepare a low-scale Coder deployment so that
8+
it can be scaled up to a high-scale deployment as use grows, and keep it
9+
operating smoothly with a high number of active users and workspaces.
10+
11+
## Observability
12+
13+
Observability is one of the most important aspects to a scalable Coder
14+
deployment.
15+
16+
Identify potential bottlenecks before they negatively affect the end-user
17+
experience. It will also allow you to empirically verify that modifications you
18+
make to your deployment to increase capacity have their intended effects.
19+
20+
- Capture log output from Coder Server instances and external provisioner
21+
daemons and store them in a searchable log store.
22+
23+
- For example: Loki, CloudWatch Logs, etc.
24+
25+
- Retain logs for a minimum of thirty days, ideally ninety days. This allows
26+
you to look back to see when anomalous behaviors began.
27+
28+
- Metrics:
29+
30+
- Capture infrastructure metrics like CPU, memory, open files, and network I/O
31+
for all Coder Server, external provisioner daemon, workspace proxy, and
32+
PostgreSQL instances.
33+
34+
- Capture metrics from Coder Server and external provisioner daemons via
35+
Prometheus.
36+
37+
- On Coder Server
38+
39+
- Enable Prometheus metrics:
40+
41+
```yaml
42+
CODER_PROMETHEUS_ENABLE=true
43+
```
44+
45+
- Enable database metrics:
46+
47+
```yaml
48+
CODER_PROMETHEUS_COLLECT_DB_METRICS=true
49+
```
50+
51+
- Configure agent stats to avoid large cardinality:
52+
53+
```yaml
54+
CODER_PROMETHEUS_AGGREGATE_AGENT_STATS_BY=agent_name
55+
```
56+
57+
- To disable Agent stats:
58+
59+
```yaml
60+
CODER_PROMETHEUS_COLLECT_AGENT_STATS=false
61+
```
62+
63+
- Retain metric time series for at least six months. This allows you to see
64+
performance trends relative to user growth.
65+
66+
- Integrate metrics with an observability dashboard, for example, Grafana.
67+
68+
### Key metrics
69+
70+
**CPU and Memory Utilization**
71+
72+
- Monitor the utilization as a fraction of the available resources on the
73+
instance. Its utilization will vary with use throughout the day and over the
74+
course of the week. Monitor the trends, paying special attention to the daily
75+
and weekly peak utilization. Use long-term trends to plan infrastructure
76+
upgrades.
77+
78+
**Tail latency of Coder Server API requests**
79+
80+
- Use the `coderd_api_request_latencies_seconds` metric.
81+
- High tail latency can indicate Coder Server or the PostgreSQL database is
82+
being starved for resources.
83+
84+
**Tail latency of database queries**
85+
86+
- Use the `coderd_db_query_latencies_seconds` metric.
87+
- High tail latency can indicate the PostgreSQL database is low in resources.
88+
89+
Configure alerting based on these metrics to ensure you surface problems before
90+
end users notice them.
91+
92+
## Coder Server
93+
94+
### Locality
95+
96+
If increased availability of the Coder API is a concern, deploy at least three
97+
instances. Spread the instances across nodes (e.g. via anti-affinity rules in
98+
Kubernetes), and/or in different availability zones of the same geographic
99+
region.
100+
101+
Do not deploy in different geographic regions. Coder Servers need to be able to
102+
communicate with one another directly with low latency, under 10ms. Note that
103+
this is for the availability of the Coder API – workspaces will not be fault
104+
tolerant unless they are explicitly built that way at the template level.
105+
106+
Deploy Coder Server instances as geographically close to PostgreSQL as possible.
107+
Low-latency communication (under 10ms) with Postgres is essential for Coder
108+
Server's performance.
109+
110+
### Scaling
111+
112+
Coder Server can be scaled both vertically for bigger instances and horizontally
113+
for more instances.
114+
115+
Aim to keep the number of Coder Server instances relatively small, preferably
116+
under ten instances, and opt for vertical scale over horizontal scale after
117+
meeting availability requirements.
118+
119+
Coder's
120+
[validated architectures](../../admin/infrastructure/validated-architectures.md)
121+
give specific sizing recommendations for various user scales. These are a useful
122+
starting point, but very few deployments will remain stable at a predetermined
123+
user level over the long term, so monitoring and adjusting of resources is
124+
recommended.
125+
126+
We don't recommend that you autoscale the Coder Servers. Instead, scale the
127+
deployment for peak weekly usage.
128+
129+
Although Coder Server persists no internal state, it operates as a proxy for end
130+
users to their workspaces in two capacities:
131+
132+
1. As an HTTP proxy when they access workspace applications in their browser via
133+
the Coder Dashboard
134+
135+
1. As a DERP proxy when establishing tunneled connections via CLI tools
136+
(`coder ssh`, `coder port-forward`, etc.) and desktop IDEs.
137+
138+
Stopping a Coder Server instance will (momentarily) disconnect any users
139+
currently connecting through that instance. Adding a new instance is not
140+
disruptive, but removing instances and upgrades should be performed during a
141+
maintenance window to minimize disruption.
142+
143+
## Provisioner daemons
144+
145+
### Locality
146+
147+
We recommend you disable provisioner daemons within your Coder Server:
148+
149+
```yaml
150+
CODER_PROVISIONER_DAEMONS=0
151+
```
152+
153+
Run one or more
154+
[provisioner daemon deployments external to Coder Server](../../admin/provisioners.md).
155+
This allows you to scale them independently of the Coder Server.
156+
157+
We recommend deploying provisioner daemons within the same cluster as the
158+
workspaces they will provision or are hosted in.
159+
160+
- This gives them a low-latency connection to the APIs they will use to
161+
provision workspaces and can speed builds.
162+
163+
- It allows provisioner daemons to use in-cluster mechanisms (for example
164+
Kubernetes service account tokens, AWS IAM Roles, etc.) to authenticate with
165+
the infrastructure APIs.
166+
167+
- If you deploy workspaces in multiple clusters, run multiple provisioner daemon
168+
deployments and use template tags to select the correct set of provisioner
169+
daemons.
170+
171+
- Provisioner daemons need to be able to connect to Coder Server, but this need
172+
not be a low-latency connection.
173+
174+
Provisioner daemons make no direct connections to the PostgreSQL database, so
175+
there's no need for locality to the Postgres database.
176+
177+
### Scaling
178+
179+
Each provisioner daemon instance can handle a single workspace build job at a
180+
time. Therefore, the number of provisioner daemon instances within a tagged
181+
deployment equals the maximum number of simultaneous builds your Coder
182+
deployment can handle.
183+
184+
If users experience unacceptably long queues for workspace builds to start,
185+
consider increasing the number of provisioner daemon instances in the affected
186+
cluster.
187+
188+
You may wish to automatically scale the number of provisioner daemon instances
189+
throughout the day to meet demand. If you stop instances with `SIGHUP`, they
190+
will complete their current build job and exit. `SIGINT` will cancel the current
191+
job, which will result in a failed build. Ensure your autoscaler waits long
192+
enough for your build jobs to complete before forcibly killing the provisioner
193+
daemon process.
194+
195+
If deploying in Kubernetes, we recommend a single provisioner daemon per pod. On
196+
a virtual machine (VM), you can deploy multiple provisioner daemons, ensuring
197+
each has a unique `CODER_CACHE_DIRECTORY` value.
198+
199+
Coder's
200+
[validated architectures](../../admin/infrastructure/validated-architectures.md)
201+
give specific sizing recommendations for various user scales. Since the
202+
complexity of builds varies significantly depending on the workspace template,
203+
consider this a starting point. Monitor queue times and build times to adjust
204+
the number and size of your provisioner daemon instances.
205+
206+
## PostgreSQL
207+
208+
PostgreSQL is the primary persistence layer for all of Coder's deployment data.
209+
We also use `LISTEN` and `NOTIFY` to coordinate between different instances of
210+
Coder Server.
211+
212+
### Locality
213+
214+
Coder Server instances must have low-latency connections (under 10ms) to
215+
PostgreSQL. If you use multiple PostgreSQL replicas in a clustered config, these
216+
must also be low-latency with respect to one another.
217+
218+
### Scaling
219+
220+
Prefer scaling PostgreSQL vertically rather than horizontally for best
221+
performance. Coder's
222+
[validated architectures](../../admin/infrastructure/validated-architectures.md)
223+
give specific sizing recommendations for various user scales.
224+
225+
## Workspace proxies
226+
227+
Workspace proxies proxy HTTP traffic from end users to workspaces for Coder apps
228+
defined in the templates, and HTTP ports opened by the workspace. By default
229+
they also include a DERP Proxy.
230+
231+
### Locality
232+
233+
We recommend each geographic cluster of workspaces have an associated deployment
234+
of workspace proxies. This ensures that users always have a near-optimal proxy
235+
path.
236+
237+
### Scaling
238+
239+
Workspace proxy load is determined by the amount of traffic they proxy. We
240+
recommend you monitor CPU, memory, and network I/O utilization to decide when to
241+
resize the number of proxy instances.
242+
243+
We do not recommend autoscaling the workspace proxies because many applications
244+
use long-lived connections such as websockets, which would be disrupted by
245+
stopping the proxy. We recommend you scale for peak demand and scale down or
246+
upgrade during a maintenance window.
247+
248+
## Workspaces
249+
250+
Workspaces represent the vast majority of resources in most Coder deployments.
251+
Because they are defined by templates, there is no one-size-fits-all advice for
252+
scaling.
253+
254+
### Hard and soft cluster limits
255+
256+
All Infrastructure as a Service (IaaS) clusters have limits to what can be
257+
simultaneously provisioned. These could be hard limits, based on the physical
258+
size of the cluster, especially in the case of a private cloud, or soft limits,
259+
based on configured limits in your public cloud account.
260+
261+
It is important to be aware of these limits and monitor Coder workspace resource
262+
utilization against the limits, so that a new influx of users doesn't encounter
263+
failed builds. Monitoring these is outside the scope of Coder, but we recommend
264+
that you set up dashboards and alerts for each kind of limited resource.
265+
266+
As you approach soft limits, you might be able to justify an increase to keep
267+
growing.
268+
269+
As you approach hard limits, you will need to consider deploying to additional
270+
cluster(s).
271+
272+
### Workspaces per node
273+
274+
Many development workloads are "spiky" in their CPU and memory requirements, for
275+
example, peaking during build/test and then ebbing while editing code. This
276+
leads to an opportunity to efficiently use compute resources by packing multiple
277+
workspaces onto a single node. This can lead to better experience (more CPU and
278+
memory available during brief bursts) and lower cost.
279+
280+
However, it needs to be considered against several trade-offs.
281+
282+
- There are residual probabilities of "noisy neighbor" problems negatively
283+
affecting end users. The probabilities increase with the amount of
284+
oversubscription of CPU and memory resources.
285+
286+
- If the shared nodes are a provisioned resource, for example, Kubernetes nodes
287+
running on VMs in a public cloud, then it can sometimes be a challenge to
288+
effectively autoscale down.
289+
290+
- For example, if half the workspaces are stopped overnight, and there are ten
291+
workspaces per node, it's unlikely that all ten workspaces on the node are
292+
among the stopped ones.
293+
294+
- You can mitigate this by lowering the number of workspaces per node, or
295+
using autostop policies to stop more workspaces during off-peak hours.
296+
297+
- If you do overprovision workspaces onto nodes, keep them in a separate node
298+
pool and schedule Coder control plane (Coder Server, PostgreSQL, workspace
299+
proxies) components on a different node pool to avoid resource spikes
300+
affecting them.
301+
302+
Coder customers have had success with both:
303+
304+
- One workspace per AWS VM
305+
- Lots of workspaces on Kubernetes nodes for efficiency
306+
307+
### Cost control
308+
309+
- Use quotas to discourage users from creating many workspaces they don't need
310+
simultaneously.
311+
312+
- Label workspace cloud resources by user, team, organization, or your own
313+
labelling conventions to track usage at different granularities.
314+
315+
- Use autostop requirements to bring off-peak utilization down.
316+
317+
## Networking
318+
319+
Set up your network so that most users can get direct, peer-to-peer connections
320+
to their workspaces. This drastically reduces the load on Coder Server and
321+
workspace proxy instances.

site/src/modules/workspaces/WorkspaceTiming/Chart/XAxis.tsx

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -121,9 +121,7 @@ export const XGrid: FC<XGridProps> = ({ columns, ...htmlProps }) => {
121121

122122
// A dashed line is used as a background image to create the grid.
123123
// Using it as a background simplifies replication along the Y axis.
124-
const dashedLine = (
125-
color: string,
126-
) => `<svg width="2" height="446" viewBox="0 0 2 446" fill="none" xmlns="http://www.w3.org/2000/svg">
124+
const dashedLine = (color: string) => `<svg width="2" height="446" viewBox="0 0 2 446" fill="none" xmlns="http://www.w3.org/2000/svg">
127125
<path fill-rule="evenodd" clip-rule="evenodd" d="M1.75 440.932L1.75 446L0.75 446L0.75 440.932L1.75 440.932ZM1.75 420.659L1.75 430.795L0.749999 430.795L0.749999 420.659L1.75 420.659ZM1.75 400.386L1.75 410.523L0.749998 410.523L0.749998 400.386L1.75 400.386ZM1.75 380.114L1.75 390.25L0.749998 390.25L0.749997 380.114L1.75 380.114ZM1.75 359.841L1.75 369.977L0.749997 369.977L0.749996 359.841L1.75 359.841ZM1.75 339.568L1.75 349.705L0.749996 349.705L0.749995 339.568L1.75 339.568ZM1.74999 319.295L1.74999 329.432L0.749995 329.432L0.749994 319.295L1.74999 319.295ZM1.74999 299.023L1.74999 309.159L0.749994 309.159L0.749994 299.023L1.74999 299.023ZM1.74999 278.75L1.74999 288.886L0.749993 288.886L0.749993 278.75L1.74999 278.75ZM1.74999 258.477L1.74999 268.614L0.749992 268.614L0.749992 258.477L1.74999 258.477ZM1.74999 238.204L1.74999 248.341L0.749991 248.341L0.749991 238.204L1.74999 238.204ZM1.74999 217.932L1.74999 228.068L0.74999 228.068L0.74999 217.932L1.74999 217.932ZM1.74999 197.659L1.74999 207.795L0.74999 207.795L0.749989 197.659L1.74999 197.659ZM1.74999 177.386L1.74999 187.523L0.749989 187.523L0.749988 177.386L1.74999 177.386ZM1.74999 157.114L1.74999 167.25L0.749988 167.25L0.749987 157.114L1.74999 157.114ZM1.74999 136.841L1.74999 146.977L0.749987 146.977L0.749986 136.841L1.74999 136.841ZM1.74999 116.568L1.74999 126.705L0.749986 126.705L0.749986 116.568L1.74999 116.568ZM1.74998 96.2955L1.74999 106.432L0.749985 106.432L0.749985 96.2955L1.74998 96.2955ZM1.74998 76.0228L1.74998 86.1591L0.749984 86.1591L0.749984 76.0228L1.74998 76.0228ZM1.74998 55.7501L1.74998 65.8864L0.749983 65.8864L0.749983 55.7501L1.74998 55.7501ZM1.74998 35.4774L1.74998 45.6137L0.749982 45.6137L0.749982 35.4774L1.74998 35.4774ZM1.74998 15.2047L1.74998 25.341L0.749982 25.341L0.749981 15.2047L1.74998 15.2047ZM1.74998 -4.37114e-08L1.74998 5.0683L0.749981 5.0683L0.749981 0L1.74998 -4.37114e-08Z" fill="${color}"/>
128126
</svg>`;
129127

0 commit comments

Comments
 (0)