Skip to content

docs: provide hardware recommendations for reference architectures #12534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Mar 15, 2024
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions docs/admin/architectures/1k-users.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Reference Architecture: up to 1,000 users

The 1,000 users architecture is designed to cover a wide range of workflows.
Examples of subjects that might utilize this architecture include medium-sized
tech startups, educational units, or small to mid-sized enterprises.

**Target load**: API: up to 180 RPS

**High Availability**: non-essential for small deployments

## Hardware recommendations

### Coderd nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | ------------------- | -------- | --------------- | ---------- | ----------------- |
| Up to 1,000 | 2 vCPU, 8 GB memory | 2 | `n1-standard-2` | `t3.large` | `Standard_D2s_v3` |

**Footnotes**:

- For small deployments (ca. 100 users, 10 concurrent workspace builds), it is
acceptable to deploy provisioners on `coderd` nodes.

### Provisioner nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- |
| Up to 1,000 | 8 vCPU, 32 GB memory | 2 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` |

**Footnotes**:

- An external provisioner is deployed as Kubernetes pod.

### Workspace nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | ----------------------- | ---------------- | ------------ | ----------------- |
| Up to 1,000 | 8 vCPU, 32 GB memory | 64 / 16 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` |

**Footnotes**:

- Assumed that a workspace user needs 2 GB memory to perform
- Maximum number of Kubernetes workspace pods per node: 256
48 changes: 48 additions & 0 deletions docs/admin/architectures/2k-users.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Reference Architecture: up to 2,000 users

In the 2,000 users architecture, there is a moderate increase in traffic,
suggesting a growing user base or expanding operations. This setup is
well-suited for mid-sized companies experiencing growth or for universities
seeking to accommodate their expanding user populations.

Users can be evenly distributed between 2 regions or be attached to different
clusters.

**Target load**: API: up to 300 RPS

**High Availability**: The mode is _disabled_, but administrators may consider
enabling it for deployment reliability.

## Hardware recommendations

### Coderd nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- |
| Up to 2,000 | 4 vCPU, 16 GB memory | 2 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` |

### Provisioner nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- |
| Up to 2,000 | 8 vCPU, 32 GB memory | 4 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` |

**Footnotes**:

- An external provisioner is deployed as Kubernetes pod.
- It is not recommended to run provisioner daemons on `coderd` nodes.
- Consider separating provisioners into different namespaces in favor of
zero-trust or multi-cloud deployments.

### Workspace nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- |
| Up to 2,000 | 8 vCPU, 32 GB memory | 128 / 16 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` |

**Footnotes**:

- Assumed that a workspace user needs 2 GB memory to perform
- Maximum number of Kubernetes workspace pods per node: 256
- Nodes can be distributed in 2 regions, not necessarily evenly split, depending
on developer team sizes
45 changes: 45 additions & 0 deletions docs/admin/architectures/3k-users.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Reference Architecture: up to 3,000 users

The 3,000 users architecture targets large-scale enterprises, possibly with
on-premises network and cloud deployments.

**Target load**: API: up to 550 RPS

**High Availability**: Typically, such scale requires a fully-managed HA
PostgreSQL service, and all Coder observability features enabled for operational
purposes.

## Hardware recommendations

### Coderd nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | -------- | --------------- | ----------- | ----------------- |
| Up to 3,000 | 8 vCPU, 32 GB memory | 4 | `n1-standard-4` | `t3.xlarge` | `Standard_D4s_v3` |

### Provisioner nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- |
| Up to 3,000 | 8 vCPU, 32 GB memory | 8 / 30 provisioners each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` |

**Footnotes**:

- An external provisioner is deployed as Kubernetes pod.
- It is strongly discouraged to run provisioner daemons on `coderd` nodes.
- Separate provisioners into different namespaces in favor of zero-trust or
multi-cloud deployments.

### Workspace nodes

| Users | Node capacity | Replicas | GCP | AWS | Azure |
| ----------- | -------------------- | ------------------------ | ---------------- | ------------ | ----------------- |
| Up to 3,000 | 8 vCPU, 32 GB memory | 256 / 12 workspaces each | `t2d-standard-8` | `t3.2xlarge` | `Standard_D8s_v3` |

**Footnotes**:

- Assumed that a workspace user needs 2 GB memory to perform
- Maximum number of Kubernetes workspace pods per node: 256
- As workspace nodes can be distributed between regions, on-premises networks
and cloud areas, consider different namespaces in favor of zero-trust or
multi-cloud deployments.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Reference architectures
# Reference Architectures

This document provides prescriptive solutions and reference architectures to
support successful deployments of up to 2000 users and outlines at a high-level
Expand Down Expand Up @@ -156,8 +156,8 @@ Coder:

- Median CPU usage for _coderd_: 3 vCPU, peaking at 3.7 vCPU during dashboard
tests.
- Median API request rate: 350 req/s during dashboard tests, 250 req/s during
Web Terminal and workspace apps tests.
- Median API request rate: 350 RPS during dashboard tests, 250 RPS during Web
Terminal and workspace apps tests.
- 2000 agent API connections with latency: p90 at 60 ms, p95 at 220 ms.
- on average 2400 Web Socket connections during dashboard tests.

Expand All @@ -171,3 +171,141 @@ Database:
metadata.
- Memory utilization averages at 40%.
- `write_ops_count` between 6.7 and 8.4 operations per second.

## Available reference architectures

[Up to 1,000 users](1k-users.md)

[Up to 2,000 users](2k-users.md)

[Up to 3,000 users](3k-users.md)

## Hardware recommendation

### Control plane: coderd

To ensure stability and reliability of the Coder control plane, it's essential
to focus on node sizing, resource limits, and the number of replicas. We
recommend referencing public cloud providers such as AWS, GCP, and Azure for
guidance on optimal configurations. A reasonable approach involves using scaling
formulas based on factors like CPU, memory, and the number of users.

While the minimum requirements specify 1 CPU core and 2 GB of memory per
`coderd` replica, it is recommended to allocate additional resources to ensure
deployment stability.

#### CPU and memory usage

The memory consumption may increase with enabled agent stats collection by the
Prometheus metrics aggregator (optional).

Enabling direct connections between users and workspace agents (apps or SSH
traffic) can help prevent an increase in CPU usage. It is recommended to keep
this option enabled unless there are compelling reasons to disable it.

Inactive users do not consume Coder resources.

#### Scaling formula

When determining scaling requirements, consider the following factors:

- `1 vCPU x 2 GB memory x 250 users`: A reasonable formula to determine resource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this better matches our reference arch?

Suggested change
- `1 vCPU x 2 GB memory x 250 users`: A reasonable formula to determine resource
- `0.5 vCPU x 2 GB memory x 250 users`: A reasonable formula to determine resource

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the future, I would leave 1 vCPU, WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably argue for wiggle room here based on how certain Terraform providers may be more CPU-intensive than others.

allocation based on the number of users and their expected usage patterns.
- API latency/response time: Monitor API latency and response times to ensure
optimal performance under varying loads.
- Average number of HTTP requests: Track the average number of HTTP requests to
gauge system usage and identify potential bottlenecks.

**HTTP API latency**

For a reliable Coder deployment dealing with medium to high loads, it's
important that API calls for workspace/template queries and workspace build
operations respond within 300 ms. However, API template insights calls, which
involve browsing workspace agent stats and user activity data, may require more
time.

Also, if the Coder deployment expects traffic from developers spread across the
globe, keep in mind that customer-facing latency might be higher because of the
distance between users and the load balancer.

**Node Autoscaling**

We recommend disabling the autoscaling for `coderd` nodes. Autoscaling can cause
interruptions for user connections, see [Autoscaling](../scale.md#autoscaling)
for more details.

### Control plane: provisionerd

Each provisioner can run a single concurrent workspace build. For example,
running 10 provisioner containers will allow 10 users to start workspaces at the
same time.

By default, the Coder server runs built-in provisioner daemons, but the
_Enterprise_ Coder release allows for running external provisioners to separate
the load caused by workspace provisioning on the `coderd` nodes.

#### Scaling formula

When determining scaling requirements, consider the following factors:

- `1 vCPU x 1 GB memory x 2 concurrent workspace build`: A formula to determine
resource allocation based on the number of concurrent workspace builds, and
standard complexity of a Terraform template. _Rule of thumb_: the more
provisioners are free/available, the more concurrent workspace builds can be
performed.

**Node Autoscaling**

Autoscaling provisioners is not an easy problem to solve unless it can be
predicted when a number of concurrent workspace builds increases.

We recommend disabling autoscaling and adjusting the number of provisioners to
developer needs based on the workspace build queuing time.

### Data plane: Workspaces

To determine workspace resource limits and keep the best developer experience
for workspace users, administrators must be aware of a few assumptions.

- Workspace pods run on the same Kubernetes cluster, but possible in a different
namespace or a node pool.
- Workspace limits (per workspace user):
- Evaluate the workspace utilization pattern. For instance, a regular web
development does not require high CPU capacity all the time, but only during
project builds or load tests.
- Evaluate minimal limits for single workspace. Include in the calculation
requirements for Coder agent running in an idle workspace - 0.1 vCPU and 256
MB. For instance, developers can choose between 0.5-8 vCPUs, and 1-16 GB
memory.

#### Scaling formula

When determining scaling requirements, consider the following factors:

- `1 vCPU x 2 GB memory x 1 workspace`: A formula to determine resource
allocation based on the minimal requirements for an idle workspace with a
running Coder agent and occasional CPU and memory bursts for building
projects.

**Node Autoscaling**

Workspace nodes can be set to operate in autoscaling mode to mitigate the risk
of prolonged high resource utilization.

One approach is to scale up workspace nodes when total CPU usage or memory
consumption reaches 80%. Another option is to scale based on metrics such as the
number of workspaces or active users. It's important to note that as new users
onboard, the autoscaling configuration should account for ongoing workspaces.

Scaling down workspace nodes to zero is not recommended, as it will result in
longer wait times for workspace provisioning by users.

### Database

TODO

PostgreSQL database

measure and document the impact of dbcrypt

###
22 changes: 21 additions & 1 deletion docs/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -375,10 +375,30 @@
},
{
"title": "Scaling Coder",
"description": "Reference architecture and load testing tools",
"description": "Learn how to use load testing tools",
"path": "./admin/scale.md",
"icon_path": "./images/icons/scale.svg"
},
{
"title": "Reference Architectures",
"description": "Learn about reference architectures for Coder",
"path": "./admin/architectures/index.md",
"icon_path": "./images/icons/scale.svg",
"children": [
{
"title": "Up to 1,000 users",
"path": "./admin/architectures/1k-users.md"
},
{
"title": "Up to 2,000 users",
"path": "./admin/architectures/2k-users.md"
},
{
"title": "Up to 3,000 users",
"path": "./admin/architectures/3k-users.md"
}
]
},
{
"title": "External Provisioners",
"description": "Run provisioners isolated from the Coder server",
Expand Down