You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/tutorials/best-practices/scale-coder.md
+38-20Lines changed: 38 additions & 20 deletions
Original file line number
Diff line number
Diff line change
@@ -13,23 +13,30 @@ operating smoothly with a high number of active users and workspaces.
13
13
Observability is one of the most important aspects to a scalable Coder
14
14
deployment.
15
15
16
-
[Monitor your Coder deployment](../../admin/monitoring/index.md) with log output and metrics to identify potential bottlenecks before they negatively affect the end-user experience and measure the effects of modifications you make to your deployment.
16
+
[Monitor your Coder deployment](../../admin/monitoring/index.md) with log output
17
+
and metrics to identify potential bottlenecks before they negatively affect the
18
+
end-user experience and measure the effects of modifications you make to your
19
+
deployment.
17
20
18
21
**Log output**
19
22
20
-
- Capture log output from Loki, CloudWatch logs, and other tools on your Coder Server instances and external provisioner
21
-
daemons and store them in a searchable log store.
23
+
- Capture log output from Loki, CloudWatch logs, and other tools on your Coder
24
+
Server instances and external provisioner daemons and store them in a
25
+
searchable log store.
22
26
23
27
- Retain logs for a minimum of thirty days, ideally ninety days. This allows
24
28
you to look back to see when anomalous behaviors began.
25
29
26
30
**Metrics**
27
31
28
-
- Capture infrastructure metrics like CPU, memory, open files, and network I/O for all Coder Server, external provisioner daemon, workspace proxy, and PostgreSQL instances.
32
+
- Capture infrastructure metrics like CPU, memory, open files, and network I/O
33
+
for all Coder Server, external provisioner daemon, workspace proxy, and
34
+
PostgreSQL instances.
29
35
30
36
### Capture Coder server metrics with Prometheus
31
37
32
-
To capture metrics from Coder Server and external provisioner daemons with [Prometheus](../../admin/integrations/prometheus.md):
38
+
To capture metrics from Coder Server and external provisioner daemons with
Retain metric time series for at least six months. This allows you to see performance trends relative to user growth.
65
+
Retain metric time series for at least six months. This allows you to see
66
+
performance trends relative to user growth.
59
67
60
-
For a more comprehensive overview, integrate metrics with an observability dashboard, for example, [Grafana](../../admin/monitoring/index.md).
68
+
For a more comprehensive overview, integrate metrics with an observability
69
+
dashboard, for example, [Grafana](../../admin/monitoring/index.md).
61
70
62
71
### Observability key metrics
63
72
64
-
Configure alerting based on these metrics to ensure you surface problems before they affect the end-user experience.
73
+
Configure alerting based on these metrics to ensure you surface problems before
74
+
they affect the end-user experience.
65
75
66
76
**CPU and Memory Utilization**
67
77
68
-
- Monitor the utilization as a fraction of the available resources on the instance.
78
+
- Monitor the utilization as a fraction of the available resources on the
79
+
instance.
69
80
70
-
Utilization will vary with use throughout the course of a day, week, and longer timelines. Monitor trends and pay special attention to the daily and weekly peak utilization. Use long-term trends to plan infrastructure
71
-
upgrades.
81
+
Utilization will vary with use throughout the course of a day, week, and
82
+
longer timelines. Monitor trends and pay special attention to the daily and
83
+
weekly peak utilization. Use long-term trends to plan infrastructure upgrades.
72
84
73
85
**Tail latency of Coder Server API requests**
74
86
75
-
- High tail latency can indicate Coder Server or the PostgreSQL database is low on resources.
87
+
- High tail latency can indicate Coder Server or the PostgreSQL database is low
88
+
on resources.
76
89
77
90
Use the `coderd_api_request_latencies_seconds` metric.
78
91
@@ -86,15 +99,20 @@ Configure alerting based on these metrics to ensure you surface problems before
86
99
87
100
### Locality
88
101
89
-
To ensure increased availability of the Coder API, deploy at least three instances. Spread the instances across nodes with anti-affinity rules in
102
+
To ensure increased availability of the Coder API, deploy at least three
103
+
instances. Spread the instances across nodes with anti-affinity rules in
90
104
Kubernetes or in different availability zones of the same geographic region.
91
105
92
106
Do not deploy in different geographic regions.
93
107
94
-
Coder Servers need to be able to
95
-
communicate with one another directly with low latency, under 10ms. Note that this is for the availability of the Coder API. Workspaces are not fault tolerant unless they are explicitly built that way at the template level.
108
+
Coder Servers need to be able to communicate with one another directly with low
109
+
latency, under 10ms. Note that this is for the availability of the Coder API.
110
+
Workspaces are not fault tolerant unless they are explicitly built that way at
111
+
the template level.
96
112
97
-
Deploy Coder Server instances as geographically close to PostgreSQL as possible. Low-latency communication (under 10ms) with Postgres is essential for Coder Server's performance.
113
+
Deploy Coder Server instances as geographically close to PostgreSQL as possible.
114
+
Low-latency communication (under 10ms) with Postgres is essential for Coder
0 commit comments