-
Notifications
You must be signed in to change notification settings - Fork 901
docs: provide hardware recommendations for reference architectures #12534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
2dc5f2f
d2e1f42
842ed58
46f3dc2
59654fd
894cddb
fa1215f
43812e6
f68ed34
2987193
1a4dfb9
4721204
233866f
17e5431
ab95ddd
0937f36
67c4604
11dbdd7
776d4c6
6a87a93
813688e
cf29c26
18bd4d2
d36e893
066d6ff
d774ed5
395d300
9ae4b61
701a205
088395a
34c4903
13dee4c
bb26800
40def43
19ea381
627e26f
8d87b34
d0c9fd6
a34ae19
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Reference Architecture: up to 1,000 users | ||
|
||
The 1,000 users architecture is designed to cover a wide range of workflows. | ||
Examples of subjects that might utilize this architecture include medium-sized | ||
tech startups, educational units, or small to mid-sized enterprises. | ||
|
||
**Target load**: API: up to 180 RPS | ||
|
||
**High Availability**: non-essential for small deployments | ||
|
||
## Hardware recommendations | ||
|
||
### Coderd nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- | | ||
| Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` | | ||
|
||
**Footnotes**: | ||
|
||
- For small deployments (ca. 100 users, 10 concurrent workspace builds), it is | ||
acceptable to deploy provisioners on `coderd` nodes. | ||
|
||
### Provisioner nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | | ||
| Up to 1,000 | 8 vCPU, 32 GB memory | 2 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
**Footnotes**: | ||
|
||
- An external provisioner is deployed as Kubernetes pod. | ||
|
||
### Workspace nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | ----------------------- | ---------------- | ------------ | ----------------- | | ||
| Up to 1,000 | 8 vCPU, 32 GB memory | 64 / 16 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | | ||
|
||
**Footnotes**: | ||
|
||
- Assumed that a workspace user needs 2 GB memory to perform | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Maximum number of Kubernetes workspace pods per node: 256 | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Reference Architecture: up to 2,000 users | ||
|
||
In the 2,000 users architecture, there is a moderate increase in traffic, | ||
suggesting a growing user base or expanding operations. This setup is | ||
well-suited for mid-sized companies experiencing growth or for universities | ||
seeking to accommodate their expanding user populations. | ||
|
||
Users can be evenly distributed between 2 regions or be attached to different | ||
clusters. | ||
|
||
**Target load**: API: up to 300 RPS | ||
|
||
**High Availability**: The mode is _disabled_, but administrators may consider | ||
enabling it for deployment reliability. | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Hardware recommendations | ||
|
||
### Coderd nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | ||
| Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Provisioner nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | | ||
| Up to 2,000 | 8 vCPU, 32 GB memory | 4 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | | ||
|
||
**Footnotes**: | ||
|
||
- An external provisioner is deployed as Kubernetes pod. | ||
- It is not recommended to run provisioner daemons on `coderd` nodes. | ||
- Consider separating provisioners into different namespaces in favor of | ||
zero-trust or multi-cloud deployments. | ||
|
||
### Workspace nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | | ||
| Up to 2,000 | 8 vCPU, 32 GB memory | 128 / 16 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | | ||
|
||
**Footnotes**: | ||
|
||
- Assumed that a workspace user needs 2 GB memory to perform | ||
- Maximum number of Kubernetes workspace pods per node: 256 | ||
- Nodes can be distributed in 2 regions, not necessarily evenly split, depending | ||
on developer team sizes |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
# Reference Architecture: up to 3,000 users | ||
|
||
The 3,000 users architecture targets large-scale enterprises, possibly with | ||
on-premises network and cloud deployments. | ||
|
||
**Target load**: API: up to 550 RPS | ||
|
||
**High Availability**: Typically, such scale requires a fully-managed HA | ||
PostgreSQL service, and all Coder observability features enabled for operational | ||
purposes. | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Hardware recommendations | ||
|
||
### Coderd nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- | | ||
| Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` | | ||
|
||
### Provisioner nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | | ||
| Up to 3,000 | 8 vCPU, 32 GB memory | 8 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | | ||
|
||
**Footnotes**: | ||
|
||
- An external provisioner is deployed as Kubernetes pod. | ||
- It is strongly discouraged to run provisioner daemons on `coderd` nodes. | ||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Separate provisioners into different namespaces in favor of zero-trust or | ||
multi-cloud deployments. | ||
|
||
### Workspace nodes | ||
|
||
| Users | Node capacity | Replicas | GCP | AWS | Azure | | ||
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- | | ||
| Up to 3,000 | 8 vCPU, 32 GB memory | 256 / 12 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` | | ||
|
||
**Footnotes**: | ||
|
||
- Assumed that a workspace user needs 2 GB memory to perform | ||
- Maximum number of Kubernetes workspace pods per node: 256 | ||
- As workspace nodes can be distributed between regions, on-premises networks | ||
and cloud areas, consider different namespaces in favor of zero-trust or | ||
multi-cloud deployments. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -1,4 +1,4 @@ | ||||||
# Reference architectures | ||||||
# Reference Architectures | ||||||
|
||||||
This document provides prescriptive solutions and reference architectures to | ||||||
support successful deployments of up to 2000 users and outlines at a high-level | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
@@ -156,8 +156,8 @@ Coder: | |||||
|
||||||
- Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU during dashboard | ||||||
tests. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- Median API request rate: 350 req/s during dashboard tests, 250 req/s during | ||||||
Web Terminal and workspace apps tests. | ||||||
- Median API request rate: 350 RPS during dashboard tests, 250 RPS during Web | ||||||
Terminal and workspace apps tests. | ||||||
- 2000 agent API connections with latency: p90 at 60 ms, p95 at 220 ms. | ||||||
- on average 2400 Web Socket connections during dashboard tests. | ||||||
|
||||||
|
@@ -171,3 +171,141 @@ Database: | |||||
metadata. | ||||||
- Memory utilization averages at 40%. | ||||||
- `write_ops_count` between 6.7 and 8.4 operations per second. | ||||||
|
||||||
## Available reference architectures | ||||||
|
||||||
[Up to 1,000 users](1k-users.md) | ||||||
|
||||||
[Up to 2,000 users](2k-users.md) | ||||||
|
||||||
[Up to 3,000 users](3k-users.md) | ||||||
|
||||||
## Hardware recommendation | ||||||
|
||||||
### Control plane: coderd | ||||||
|
||||||
To ensure stability and reliability of the Coder control plane, it's essential | ||||||
to focus on node sizing, resource limits, and the number of replicas. We | ||||||
recommend referencing public cloud providers such as AWS, GCP, and Azure for | ||||||
guidance on optimal configurations. A reasonable approach involves using scaling | ||||||
formulas based on factors like CPU, memory, and the number of users. | ||||||
|
||||||
While the minimum requirements specify 1 CPU core and 2 GB of memory per | ||||||
`coderd` replica, it is recommended to allocate additional resources to ensure | ||||||
deployment stability. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
#### CPU and memory usage | ||||||
|
||||||
The memory consumption may increase with enabled agent stats collection by the | ||||||
Prometheus metrics aggregator (optional). | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Enabling direct connections between users and workspace agents (apps or SSH | ||||||
traffic) can help prevent an increase in CPU usage. It is recommended to keep | ||||||
this option enabled unless there are compelling reasons to disable it. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Inactive users do not consume Coder resources. | ||||||
|
||||||
#### Scaling formula | ||||||
|
||||||
When determining scaling requirements, consider the following factors: | ||||||
|
||||||
- `1 vCPU x 2 GB memory x 250 users`: A reasonable formula to determine resource | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this better matches our reference arch?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about the future, I would leave There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can probably argue for wiggle room here based on how certain Terraform providers may be more CPU-intensive than others. |
||||||
allocation based on the number of users and their expected usage patterns. | ||||||
- API latency/response time: Monitor API latency and response times to ensure | ||||||
optimal performance under varying loads. | ||||||
- Average number of HTTP requests: Track the average number of HTTP requests to | ||||||
gauge system usage and identify potential bottlenecks. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
**HTTP API latency** | ||||||
|
||||||
For a reliable Coder deployment dealing with medium to high loads, it's | ||||||
important that API calls for workspace/template queries and workspace build | ||||||
operations respond within 300 ms. However, API template insights calls, which | ||||||
involve browsing workspace agent stats and user activity data, may require more | ||||||
time. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
Also, if the Coder deployment expects traffic from developers spread across the | ||||||
globe, keep in mind that customer-facing latency might be higher because of the | ||||||
distance between users and the load balancer. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
**Node Autoscaling** | ||||||
|
||||||
We recommend disabling the autoscaling for `coderd` nodes. Autoscaling can cause | ||||||
interruptions for user connections, see [Autoscaling](../scale.md#autoscaling) | ||||||
for more details. | ||||||
|
||||||
### Control plane: provisionerd | ||||||
|
||||||
Each provisioner can run a single concurrent workspace build. For example, | ||||||
running 10 provisioner containers will allow 10 users to start workspaces at the | ||||||
same time. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
By default, the Coder server runs built-in provisioner daemons, but the | ||||||
_Enterprise_ Coder release allows for running external provisioners to separate | ||||||
the load caused by workspace provisioning on the `coderd` nodes. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
#### Scaling formula | ||||||
|
||||||
When determining scaling requirements, consider the following factors: | ||||||
|
||||||
- `1 vCPU x 1 GB memory x 2 concurrent workspace build`: A formula to determine | ||||||
resource allocation based on the number of concurrent workspace builds, and | ||||||
standard complexity of a Terraform template. _Rule of thumb_: the more | ||||||
provisioners are free/available, the more concurrent workspace builds can be | ||||||
performed. | ||||||
johnstcn marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
**Node Autoscaling** | ||||||
|
||||||
Autoscaling provisioners is not an easy problem to solve unless it can be | ||||||
predicted when a number of concurrent workspace builds increases. | ||||||
|
||||||
We recommend disabling autoscaling and adjusting the number of provisioners to | ||||||
developer needs based on the workspace build queuing time. | ||||||
|
||||||
### Data plane: Workspaces | ||||||
|
||||||
To determine workspace resource limits and keep the best developer experience | ||||||
for workspace users, administrators must be aware of a few assumptions. | ||||||
|
||||||
- Workspace pods run on the same Kubernetes cluster, but possible in a different | ||||||
namespace or a node pool. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- Workspace limits (per workspace user): | ||||||
- Evaluate the workspace utilization pattern. For instance, a regular web | ||||||
development does not require high CPU capacity all the time, but only during | ||||||
project builds or load tests. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
- Evaluate minimal limits for single workspace. Include in the calculation | ||||||
requirements for Coder agent running in an idle workspace - 0.1 vCPU and 256 | ||||||
MB. For instance, developers can choose between 0.5-8 vCPUs, and 1-16 GB | ||||||
memory. | ||||||
|
||||||
#### Scaling formula | ||||||
|
||||||
When determining scaling requirements, consider the following factors: | ||||||
|
||||||
- `1 vCPU x 2 GB memory x 1 workspace`: A formula to determine resource | ||||||
allocation based on the minimal requirements for an idle workspace with a | ||||||
running Coder agent and occasional CPU and memory bursts for building | ||||||
projects. | ||||||
johnstcn marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
**Node Autoscaling** | ||||||
|
||||||
Workspace nodes can be set to operate in autoscaling mode to mitigate the risk | ||||||
of prolonged high resource utilization. | ||||||
|
||||||
One approach is to scale up workspace nodes when total CPU usage or memory | ||||||
consumption reaches 80%. Another option is to scale based on metrics such as the | ||||||
number of workspaces or active users. It's important to note that as new users | ||||||
onboard, the autoscaling configuration should account for ongoing workspaces. | ||||||
|
||||||
Scaling down workspace nodes to zero is not recommended, as it will result in | ||||||
longer wait times for workspace provisioning by users. | ||||||
mtojek marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
### Database | ||||||
bpmct marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
TODO | ||||||
|
||||||
PostgreSQL database | ||||||
|
||||||
measure and document the impact of dbcrypt | ||||||
|
||||||
### |
Uh oh!
There was an error while loading. Please reload this page.