docs: scaling Coder #5206

bpmct · 2022-11-30T21:38:52Z

No description provided.

docs/admin/scale/index.md

Co-authored-by: Dean Sheather <dean@deansheather.com>

…scale-docs

ghuntley · 2022-12-01T01:51:46Z

docs/admin/scale/index.md

+| Environment                                       | Users         | Last tested  | Status   |
+| ------------------------------------------------- | ------------- | ------------ | -------- |
+| [Google Kubernetes Engine (GKE)](./gke.md)        | 50, 100, 1000 | Nov 29, 2022 | Complete |
+| [AWS Elastic Kubernetes Service (EKS)](./eks.md)  | 50, 100, 1000 | Nov 29, 2022 | Complete |


🐛 [✖] ./eks.md → Status: 400

yep - i wanted to get a review on the GKE format before duplicating it. ideally, we have a way of generating these though

docs/admin/scale/index.md

mtojek

Thanks for drafting the document, @bpmct! My recommendation is to work on providing more numbers confirming our success (Completed), but it's a good starting point!

mtojek · 2022-12-01T07:13:12Z

docs/admin/scale/index.md

@@ -0,0 +1,36 @@
+We regularly scale-test Coder against various reference architectures. Additionally, we provide a [scale testing utility](#scaletest-utility) which can be used in your own environment to give insight on how Coder scales with your deployment's specific templates, images, etc.


We regularly scale-test Coder

As a customer, I'd like to reproduce Coder's results, but the doc doesn't mention the used release version. It might be good to keep the version and date of performing load tests.

docs/admin/scale/index.md

mtojek · 2022-12-01T07:17:45Z

docs/admin/scale/gke.md

+
+For deployments with 100+ users, we recommend running the Coder server in a separate node pool via taints, tolerations, and nodeselectors.
+
+### Cluster configuration


If we're using Terraform or GDM for spawning these machines, can we please share and link it here? It might be easier for customers to bring up their own clusters.

At this point, we're doing it manually (or internal Terraform configs that are not ready for the public), but agree we should eventually provide the Terraform config for many of these environments.

I'd hate for Terraform to be a prerequisite for us testing a specific environment (e.g. OpenShift, DigitalOcean), but agree that it's highly beneficial for reproducibility and so customers could quickly spin up clusters

mtojek · 2022-12-01T07:18:14Z

docs/admin/scale/gke.md

+- **Node pools**
+  - Coder server
+    - **Instance type**: `e2-highcpu-4`
+    - **Operating system**: `Ubuntu with containerd`


Ubuntu version is missing. We could also refer to the VM image.

Totally agree about including versions and dates! GKE refers this as ubuntu_containerd, which varies based on the Kubernetes version, so using this as a mental note to include the cluster version too ;)

mtojek · 2022-12-01T07:21:22Z

docs/admin/scale/index.md

+
+| Environment                                       | Users         | Last tested  | Status   |
+| ------------------------------------------------- | ------------- | ------------ | -------- |
+| [Google Kubernetes Engine (GKE)](./gke.md)        | 50, 100, 1000 | Nov 29, 2022 | Complete |


Complete

It would be awesome if we can share more details here picturing the autoscaling behavior and the duration of tests.

BTW this data can be easily converted into a set of blog posts about loadtesting.

Complete

Also, some reference data about API latencies might be helpful also for us, so that we know if the Coder performance improved or decreased over time.

Yeah, Test "Complete" is a very weak statement. Like, we could totally fail the test and still say, well the test is complete.

This was inspired by GitLab's Performance and Stability page. I wasn't sure the best way, in a table view, to show that we've validated a Coder deployment with n users, but agree that complete isn't the best term.

Perhaps Validated?

The column could also be omitted and we put a ✅ or ⌛ next to each user count. I'm not exactly sure the best format for this information at the moment

I presume that we could at least add a legend: green-yellow-red

green: everything went smooth 👍 SLAs (do we have any?) not impacted, platform performance not degraded
yellow: users can operate, but in a few cases we observed SLA being violated, for instance - due to high latency. We should describe specifically what went wrong
red: total disaster, platform misbehaves, not usable, etc.

In general, it would be awesome if we can automatically raise a Github issue for every performance test run and discuss the results there. BTW this is a good moment to "build" the SRE attitude in Coder :)

The green yellow red system is a bit overkill for what we need right now. As we develop out our tests and automation we can start using it but we're nowhere near there yet. We also don't have any SLAs or criteria to base a yellow on yet.

Sounds like an action item for me 👍

docs/admin/scale/index.md

spikecurtis · 2022-12-01T18:47:50Z

docs/admin/scale/gke.md

+  - CPU: `2 cores`
+  - RAM: `4 GB`
+
+## 100 users


Index.md claims that we tested at 50, 100, and 1000 users, but this doc only has 50, 100

ah yea we should disregard the numbers and instance types at the moment. these are all placeholder-ish to align on format and info we want to display

docs/admin/scale/index.md

bpmct · 2022-12-01T19:06:41Z

My recommendation is to work on providing more numbers confirming our success (Completed)

@mtojek - do you mean test cases beyond the number of workspaces, or benchmarks (e.g. time to complete)? Some examples would be helpful, even if you don't consider them a blocker to merging this first PR.

spikecurtis · 2022-12-01T20:33:08Z

docs/admin/scale/index.md

+- return results (e.g. `99 succeeded, 1 failed to connect`)
+
+```sh
+coder scaletest create-workspaces \


Is this what we want the command to be like, or what it currently is?

Just tried to look this up, and I couldn't find scaletest. What I did find was loadtest but that takes a config file, not flags.

This PR is blocked by #5202. I'll be sure to update the schema to whatever it changes to prior to merging

mtojek · 2022-12-02T09:52:45Z

@mtojek - do you mean test cases beyond the number of workspaces, or benchmarks (e.g. time to complete)? Some examples would be helpful, even if you don't consider them a blocker to merging this first PR.

I'm always in favor of quick iterations and having deliverables as soon as possible, so not a blocker at all!

As an enterprise DevOps persona, I would like to read about following aspects of load testing before considering the platform:

What kind of scenarios have been exercised?

How many workspaces created/edited/deleted?
How many templates imported/deleted?
How many different user accounts have been used?
Was the traffic generated from one machine or many machines?
Were load tests performed via tailnet or within local network?

Could you post any detailed test results?

Graphs would be the best option - number of Coder resources, API latency, CPU load, memory usage
Duration of tests, retries?
SLAs for API and crucial provisioning operations, p50, p90, p95

IMHO We need to prove that we know what we're doing and that we control the platform. It would be vague if we just post the word "Completed" without any result interpretation. We don't need to interpret them every time we run tests, but we should do document scenarios at least once.

As I said, we don't need to work on those items at the moment, but it would be great to prepare a roadmap plan for long-term scale tests.

bpmct · 2022-12-05T18:04:16Z

docs/admin/scale/index.md

+The test does the following:
+
+- create `n` workspaces
+- establish SSH connection to each workspace
+- run `sleep 3 && echo hello` on each workspace via the web terminal
+- close connections, attempt to delete all workspaces
+- return results (e.g. `99 succeeded, 1 failed to connect`)


I should document what test is run inside each environment/architecture. Similar to GitLab's

We can also do this via graphs (e.g. "workspaces created, etc")

bpmct · 2022-12-06T19:42:03Z

we can also add SQL sizing recommendations

github-actions · 2022-12-14T00:11:35Z

This Pull Request is becoming stale. In order to minimize WIP, prevent merge conflicts and keep the tracker readable, I'm going close to this PR in 3 days if there isn't more activity.

docs: scaling Coder

fc8839d

bpmct requested a review from deansheather November 30, 2022 21:38

change icon

a587e45

deansheather reviewed Nov 30, 2022

View reviewed changes

docs/admin/scale/index.md Outdated Show resolved Hide resolved

docs/admin/scale/index.md Outdated Show resolved Hide resolved

docs/admin/scale/index.md Outdated Show resolved Hide resolved

bpmct and others added 7 commits November 30, 2022 13:46

Update docs/admin/scale/index.md

fdacfad

Co-authored-by: Dean Sheather <dean@deansheather.com>

Update docs/admin/scale/index.md

8cd6abb

Co-authored-by: Dean Sheather <dean@deansheather.com>

Update docs/admin/scale/index.md

7637f86

Co-authored-by: Dean Sheather <dean@deansheather.com>

add prom link

c1de2b4

Merge branch 'bpmct/scale-docs' of github.com:coder/coder into bpmct/…

e1b04a1

…scale-docs

add plumbing for gke doc

1cf65aa

add limits/requests

933beac

bpmct requested review from mafredri, mtojek and spikecurtis November 30, 2022 23:08

ghuntley reviewed Dec 1, 2022

View reviewed changes

mtojek reviewed Dec 1, 2022

View reviewed changes

spikecurtis reviewed Dec 1, 2022

View reviewed changes

bpmct added 2 commits December 1, 2022 23:18

changes from feedback

b31e813

change

b493aa9

This was referenced Dec 1, 2022

scaletest: use new format (no config file required) #5202

Closed

scaletest: push-the-limits test #5242

Closed

bpmct commented Dec 5, 2022

View reviewed changes

bpmct mentioned this pull request Dec 6, 2022

docs: Scaling Coder for 100, 1000, 3000 active users #2971

Closed

github-actions bot added the stale This issue is like stale bread. label Dec 14, 2022

github-actions bot closed this Dec 18, 2022

github-actions bot deleted the bpmct/scale-docs branch June 2, 2023 00:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: scaling Coder #5206

docs: scaling Coder #5206

bpmct commented Nov 30, 2022

ghuntley Dec 1, 2022

bpmct Dec 1, 2022

mtojek left a comment

mtojek Dec 1, 2022

mtojek Dec 1, 2022

bpmct Dec 1, 2022

mtojek Dec 1, 2022

bpmct Dec 1, 2022

mtojek Dec 1, 2022

mtojek Dec 1, 2022

spikecurtis Dec 1, 2022

bpmct Dec 1, 2022

bpmct Dec 1, 2022

mtojek Dec 2, 2022

deansheather Dec 6, 2022

mtojek Dec 6, 2022

spikecurtis Dec 1, 2022

bpmct Dec 1, 2022

bpmct commented Dec 1, 2022

spikecurtis Dec 1, 2022

bpmct Dec 1, 2022 •

edited

Loading

mtojek commented Dec 2, 2022

bpmct Dec 5, 2022

bpmct Dec 5, 2022

bpmct commented Dec 6, 2022

github-actions bot commented Dec 14, 2022

		@@ -0,0 +1,36 @@
		We regularly scale-test Coder against various reference architectures. Additionally, we provide a [scale testing utility](#scaletest-utility) which can be used in your own environment to give insight on how Coder scales with your deployment's specific templates, images, etc.


		For deployments with 100+ users, we recommend running the Coder server in a separate node pool via taints, tolerations, and nodeselectors.

		### Cluster configuration

docs: scaling Coder #5206

docs: scaling Coder #5206

Conversation

bpmct commented Nov 30, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mtojek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpmct commented Dec 1, 2022

Choose a reason for hiding this comment

bpmct Dec 1, 2022 • edited Loading

Choose a reason for hiding this comment

mtojek commented Dec 2, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bpmct commented Dec 6, 2022

github-actions bot commented Dec 14, 2022

bpmct Dec 1, 2022 •

edited

Loading