Skip to content

docs: scaling Coder #5206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
add plumbing for gke doc
  • Loading branch information
bpmct committed Nov 30, 2022
commit 1cf65aa28432eb1d8f8cb94f59c78d45f3189a5b
Empty file removed docs/admin/scale/docker.md
Empty file.
50 changes: 50 additions & 0 deletions docs/admin/scale/gke.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Scaling Coder on Google Kubernetes Engine (GKE)

This is a reference architecture for Coder on [Google Kubernetes Engine](#). We regurily load test these environments with a standard [kubernetes example](https://github.com/coder/coder/tree/main/examples/templates/kubernetes) template.

> Performance and ideal node sizing depends on many factors, including workspace image and the [workspace sizes](https://github.com/coder/coder/issues/3519) you wish to give developers. Use Coder's [scale testing utility](./index.md#scale-testing-utility) to test your own deployment.

## 50 users

### Cluster configuration

- **Autoscaling profile**: `optimize-utilization`

- **Node pools**
- Default
- **Operating system**: `Ubuntu with containerd`
- **Instance type**: `e2-highcpu-8`
- **Min nodes**: `1`
- **Max nodes**: `4`

### Coder settings

- **Replica count**: `1`
- **Provisioner daemons**: `30`
- **Template**: [kubernetes example](https://github.com/coder/coder/tree/main/examples/templates/kubernetes)

## 100 users
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index.md claims that we tested at 50, 100, and 1000 users, but this doc only has 50, 100

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea we should disregard the numbers and instance types at the moment. these are all placeholder-ish to align on format and info we want to display


For deployments with 100+ users, we recommend running the Coder server in a separate node pool via taints, tolerations, and nodeselectors.

### Cluster configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're using Terraform or GDM for spawning these machines, can we please share and link it here? It might be easier for customers to bring up their own clusters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, we're doing it manually (or internal Terraform configs that are not ready for the public), but agree we should eventually provide the Terraform config for many of these environments.

I'd hate for Terraform to be a prerequisite for us testing a specific environment (e.g. OpenShift, DigitalOcean), but agree that it's highly beneficial for reproducibility and so customers could quickly spin up clusters


- **Node pools**
- Coder server
- **Instance type**: `e2-highcpu-4`
- **Operating system**: `Ubuntu with containerd`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ubuntu version is missing. We could also refer to the VM image.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree about including versions and dates! GKE refers this as ubuntu_containerd, which varies based on the Kubernetes version, so using this as a mental note to include the cluster version too ;)

Screen Shot 2022-12-01 at 3 26 28 PM

- **Autoscaling profile**: `optimize-utilization`
- **Min nodes**: `2`
- **Max nodes**: `4`
- Workspaces
- **Instance type**: `e2-highcpu-16`
- **Node**: `Ubuntu with containerd`
- **Autoscaling profile**: `optimize-utilization`
- **Min nodes**: `3`
- **Max nodes**: `10`

### Coder settings

- **Replica count**: `4`
- **Provisioner daemons**: `25`
- **Template**: [kubernetes example](https://github.com/coder/coder/tree/main/examples/templates/kubernetes)
12 changes: 6 additions & 6 deletions docs/admin/scale/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ We regularly scale-test Coder against various reference architectures. Additiona

## Reference Architectures

| Environment | Users | Workspaces | Last tested | Status |
| ----------------------------------------- | ----- | ---------- | ------------ | -------- |
| [Google Kubernetes Engine (GKE)](#) | 100 | 200 | Nov 29, 2022 | Complete |
| [AWS Elastic Kubernetes Service (EKS)](#) | 100 | 200 | Nov 29, 2022 | Complete |
| [Google Compute Engine + Docker](#) | 1000 | 200 | Nov 29, 2022 | Complete |
| [Google Compute Engine + VMs](#) | 1000 | 200 | Nov 29, 2022 | Complete |
| Environment | Users | Last tested | Status |
| ------------------------------------------------- | ------------- | ------------ | -------- |
| [Google Kubernetes Engine (GKE)](./gke.md) | 50, 100, 1000 | Nov 29, 2022 | Complete |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete

It would be awesome if we can share more details here picturing the autoscaling behavior and the duration of tests.

BTW this data can be easily converted into a set of blog posts about loadtesting.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete

Also, some reference data about API latencies might be helpful also for us, so that we know if the Coder performance improved or decreased over time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Test "Complete" is a very weak statement. Like, we could totally fail the test and still say, well the test is complete.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was inspired by GitLab's Performance and Stability page. I wasn't sure the best way, in a table view, to show that we've validated a Coder deployment with n users, but agree that complete isn't the best term.

Perhaps Validated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column could also be omitted and we put a ✅ or ⌛ next to each user count. I'm not exactly sure the best format for this information at the moment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume that we could at least add a legend: green-yellow-red

green: everything went smooth 👍 SLAs (do we have any?) not impacted, platform performance not degraded
yellow: users can operate, but in a few cases we observed SLA being violated, for instance - due to high latency. We should describe specifically what went wrong
red: total disaster, platform misbehaves, not usable, etc.

In general, it would be awesome if we can automatically raise a Github issue for every performance test run and discuss the results there. BTW this is a good moment to "build" the SRE attitude in Coder :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The green yellow red system is a bit overkill for what we need right now. As we develop out our tests and automation we can start using it but we're nowhere near there yet. We also don't have any SLAs or criteria to base a yellow on yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like an action item for me 👍

| [AWS Elastic Kubernetes Service (EKS)](./eks.md) | 50, 100, 1000 | Nov 29, 2022 | Complete |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 [✖] ./eks.md → Status: 400

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - i wanted to get a review on the GKE format before duplicating it. ideally, we have a way of generating these though

| [Google Compute Engine + Docker](./gce-docker.md) | 15, 50 | Nov 29, 2022 | Complete |
| [Google Compute Engine + VMs](./gce-vms.md) | 1000 | Nov 29, 2022 | Complete |

## Scale testing utility

Expand Down
9 changes: 8 additions & 1 deletion docs/manifest.json
Original file line number Diff line number Diff line change
Expand Up @@ -257,7 +257,14 @@
"title": "Scaling Coder",
"description": "Reference architecture and load testing tools",
"icon_path": "./images/icons/scale.svg",
"path": "./admin/scale/index.md"
"path": "./admin/scale/index.md",
"children": [
{
"title": "GKE",
"description": "Learn how to scale Coder on GKE",
"path": "./admin/scale/gke.md"
}
]
},
{
"title": "Audit Logs",
Expand Down