-
Notifications
You must be signed in to change notification settings - Fork 887
docs: scaling Coder #5206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: scaling Coder #5206
Changes from all commits
fc8839d
a587e45
fdacfad
8cd6abb
7637f86
c1de2b4
e1b04a1
1cf65aa
933beac
b31e813
b493aa9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
# Scaling Coder on Google Kubernetes Engine (GKE) | ||
|
||
This is a reference architecture for Coder on [Google Kubernetes Engine](#). We regurily load test these environments with a standard [kubernetes example](https://github.com/coder/coder/tree/main/examples/templates/kubernetes) template. | ||
|
||
> Performance and ideal node sizing depends on many factors, including workspace image and the [workspace sizes](https://github.com/coder/coder/issues/3519) you wish to give developers. Use Coder's [scale testing utility](./index.md#scale-testing-utility) to test your own deployment. | ||
|
||
## 50 users | ||
|
||
### Cluster configuration | ||
|
||
- **Autoscaling profile**: `optimize-utilization` | ||
|
||
- **Node pools** | ||
- Default | ||
- **Operating system**: `Ubuntu with containerd` | ||
- **Instance type**: `e2-highcpu-8` | ||
- **Min nodes**: `1` | ||
- **Max nodes**: `4` | ||
|
||
### Coder settings | ||
|
||
- **Replica count**: `1` | ||
- **Provisioner daemons**: `30` | ||
- **Template**: [kubernetes example](https://github.com/coder/coder/tree/main/examples/templates/kubernetes) | ||
- **Coder server limits**: | ||
- CPU: `2 cores` | ||
- RAM: `4 GB` | ||
- **Coder server requests**: | ||
- CPU: `2 cores` | ||
- RAM: `4 GB` | ||
|
||
## 100 users | ||
|
||
For deployments with 100+ users, we recommend running the Coder server in a separate node pool via taints, tolerations, and nodeselectors. | ||
|
||
### Cluster configuration | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're using Terraform or GDM for spawning these machines, can we please share and link it here? It might be easier for customers to bring up their own clusters. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At this point, we're doing it manually (or internal Terraform configs that are not ready for the public), but agree we should eventually provide the Terraform config for many of these environments. I'd hate for Terraform to be a prerequisite for us testing a specific environment (e.g. OpenShift, DigitalOcean), but agree that it's highly beneficial for reproducibility and so customers could quickly spin up clusters |
||
|
||
- **Node pools** | ||
- Coder server | ||
- **Instance type**: `e2-highcpu-4` | ||
- **Operating system**: `Ubuntu with containerd` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ubuntu version is missing. We could also refer to the VM image. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
- **Autoscaling profile**: `optimize-utilization` | ||
- **Min nodes**: `2` | ||
- **Max nodes**: `4` | ||
- Workspaces | ||
- **Instance type**: `e2-highcpu-16` | ||
- **Node**: `Ubuntu with containerd` | ||
- **Autoscaling profile**: `optimize-utilization` | ||
- **Min nodes**: `3` | ||
- **Max nodes**: `10` | ||
|
||
### Coder settings | ||
|
||
- **Replica count**: `4` | ||
- **Provisioner daemons**: `25` | ||
- **Template**: [kubernetes example](https://github.com/coder/coder/tree/main/examples/templates/kubernetes) | ||
- **Coder server limits**: | ||
- CPU: `4 cores` | ||
- RAM: `8 GB` | ||
- **Coder server requests**: | ||
- CPU: `4 cores` | ||
- RAM: `8 GB` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
We regularly scale-test Coder against various reference architectures. Additionally, we provide a [scale testing utility](#scaletest-utility) which can be used in your own environment to give insight on how Coder scales with your deployment's specific templates, images, etc. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
As a customer, I'd like to reproduce Coder's results, but the doc doesn't mention the used release version. It might be good to keep the version and date of performing load tests. |
||
|
||
## Reference Architectures | ||
|
||
| Environment | Users | Last tested | Status | | ||
| ------------------------------------------------- | ------------- | ------------ | -------- | | ||
| [Google Kubernetes Engine (GKE)](./gke.md) | 50, 100, 1000 | Nov 29, 2022 | Complete | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It would be awesome if we can share more details here picturing the autoscaling behavior and the duration of tests. BTW this data can be easily converted into a set of blog posts about loadtesting. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also, some reference data about API latencies might be helpful also for us, so that we know if the Coder performance improved or decreased over time. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, Test "Complete" is a very weak statement. Like, we could totally fail the test and still say, well the test is complete. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was inspired by GitLab's Performance and Stability page. I wasn't sure the best way, in a table view, to show that we've validated a Coder deployment with Perhaps There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The column could also be omitted and we put a ✅ or ⌛ next to each user count. I'm not exactly sure the best format for this information at the moment There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I presume that we could at least add a legend: green-yellow-red green: everything went smooth 👍 SLAs (do we have any?) not impacted, platform performance not degraded In general, it would be awesome if we can automatically raise a Github issue for every performance test run and discuss the results there. BTW this is a good moment to "build" the SRE attitude in Coder :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The green yellow red system is a bit overkill for what we need right now. As we develop out our tests and automation we can start using it but we're nowhere near there yet. We also don't have any SLAs or criteria to base a yellow on yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds like an action item for me 👍 |
||
| [AWS Elastic Kubernetes Service (EKS)](./eks.md) | 50, 100, 1000 | Nov 29, 2022 | Complete | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🐛 [✖] ./eks.md → Status: 400 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yep - i wanted to get a review on the GKE format before duplicating it. ideally, we have a way of generating these though |
||
| [Google Compute Engine + Docker](./gce-docker.md) | 15, 50 | Nov 29, 2022 | Complete | | ||
bpmct marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| [Google Compute Engine + VMs](./gce-vms.md) | 1000 | Nov 29, 2022 | Complete | | ||
bpmct marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Scale testing utility | ||
|
||
Since Coder's performance is highly dependent on the templates and workflows you support, we recommend using our scale testing utility against your own environments. | ||
|
||
The following command will run the same scenario against your own Coder deployment. You can also specify a template name and any parameter values. | ||
|
||
```sh | ||
coder scaletest create-workspaces \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this what we want the command to be like, or what it currently is? Just tried to look this up, and I couldn't find There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This PR is blocked by #5202. I'll be sure to update the schema to whatever it changes to prior to merging |
||
--count 100 \ | ||
bpmct marked this conversation as resolved.
Show resolved
Hide resolved
|
||
--template "my-custom-template" \ | ||
--parameter image="my-custom-image" \ | ||
--run-command "sleep 2 && echo hello" | ||
|
||
# Run `coder scaletest create-workspaces --help` for all usage | ||
``` | ||
|
||
> To avoid outages and orphaned resources, we recommend running scale tests on a secondary "staging" environment. | ||
|
||
The test does the following: | ||
|
||
- create `n` workspaces | ||
- establish SSH connection to each workspace | ||
- run `sleep 3 && echo hello` on each workspace via the web terminal | ||
- close connections, attempt to delete all workspaces | ||
- return results (e.g. `99 succeeded, 1 failed to connect`) | ||
Comment on lines
+30
to
+36
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I should document what test is run inside each environment/architecture. Similar to GitLab's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can also do this via graphs (e.g. "workspaces created, etc") |
||
|
||
Workspace jobs run concurrently, meaning that the test will attempt to connect to each workspace as soon as it is provisioned instead of waiting for all 100 workspaces to create. | ||
|
||
## Troubleshooting | ||
|
||
If a load test fails or if you are experiencing performance issues during day-to-day use, you can leverage Coder's [performance tracing](#) and [prometheus metrics](../prometheus.md) to identify bottlenecks during scale tests. Additionally, you can use your existing cloud monitoring stack to measure load, view server logs, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index.md claims that we tested at 50, 100, and 1000 users, but this doc only has 50, 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yea we should disregard the numbers and instance types at the moment. these are all placeholder-ish to align on format and info we want to display