feat: add connection statistics for workspace agents #6469

kylecarbs · 2023-03-07T01:42:23Z

This adds a bar at the bottom of the dashboard (only visible to admins) with periodically updating statistics on workspaces.

After merge, I'll add warnings for the DERPForcedWebsocket property to indicate reduced throughput.

This made for some weird tracking... we want the point-in-time number of counts!

mafredri

Took a preliminary look at this, I'll finish my review tomorrow.

mafredri · 2023-03-07T17:24:51Z

agent/agent.go

 		select {
 		case a.connStatsChan <- stats:
+			// Only store the latest stat when it's successfully sent!
+			// Otherwise, it should be sent again on the next iteration.
+			a.latestStat.Store(stats)


Considering the previous comment about Tailscale resetting counts on every report, I'd think this current implementation will lose stats?

I imagine a more safe way to update the stats would be something along the lines of:

a.statMu.Lock() a.stats.RxBytes += ... select { case a.connStatsChan <- a.stats: // note: a copy assuming basic struct // sent default: // dropped this report } a.statMu.Unlock() This way we're always incrementing the numbers, even from dropped reports.

Ahh, good point!

I wonder if we'd be better off blocking instead of dropping? It seems like that's fine from the Tailscale side, and then we don't really run any risks here.

I changed it to block instead. Let me know your thoughts! The loop retries anyways.

mafredri · 2023-03-07T17:29:40Z

coderd/coderd.go

@@ -405,15 +408,16 @@ func New(options *Options) *API {
 		r.Post("/csp/reports", api.logReportCSPViolations)

 		r.Get("/buildinfo", buildInfo)
+		r.Route("/deployment", func(r chi.Router) {
+			r.Use(apiKeyMiddleware)
+			r.Get("/config", api.deploymentConfig)


Breaking change? /config/deployment => /deployment/config.

Since this endpoint is only relied on in the dashboard, I wouldn't consider this breaking, but if you think it is I'm fine to put ! with it!

No strong preference, just being cautious. 😄

coderd/database/dbfake/databasefake.go

coderd/database/migrations/000102_workspace_agent_stats_types.up.sql

mafredri

I didn't look very closely at frontend, but backend looks mostly good. I still have some concerns about the agent stats reporting (see comments) but that's either a lack of my understanding or something we should fix, approving nonetheless.

mafredri · 2023-03-08T16:16:55Z

cli/deployment/config.go

@@ -406,7 +406,7 @@ func newConfig() *codersdk.DeploymentConfig {
 			Usage:   "How frequently agent stats are recorded",
 			Flag:    "agent-stats-refresh-interval",
 			Hidden:  true,
-			Default: 10 * time.Minute,
+			Default: 30 * time.Second,


This is a pretty big change, I think it's OK but increases spamminess somewhat.

It definitely is, but I did some napkin math and I think it should be alright.

Even if a user has hundreds of workspaces, a few hundred more writes/minute shouldn't be a big deal. I suppose it might spam the logs, which I'll check and resolve before merge.

I'd hate for all of coderd to be spammed with stat logs in a large deployment ;p

mafredri · 2023-03-08T16:22:08Z

agent/agent.go

@@ -1270,10 +1267,16 @@ func (a *agent) startReportingConnectionStats(ctx context.Context) {
 		// Convert from microseconds to milliseconds.
 		stats.ConnectionMedianLatencyMS /= 1000

+		lastStat := a.latestStat.Load()
+		if lastStat != nil && reflect.DeepEqual(lastStat, stats) {


I guess this still confuses me a bit. If Tailscale stats aren't cumulative, isn't the only way this matches lastStat if there was no chatter (tx/rx), the latency and sessions for SSH/Code/JetBrains stayed the same?

Since we're also doing network pings in the latency check, I think there is a non-zero chance for multiple reportStats to be running concurrently, essentially competing about Load()/Store() here?

Hmm, good points. I'll refactor this.

After looking at this again, it seems like this should be fine.

This will only match if there's no traffic, but that's arguably great because then we aren't spamming the database with nonsense. I don't want to do this in coderd, because we'd need to query for the last stat to ensure it's not the same.

reportStats is blocking, and so subsequent agent stat refreshes will wait before running again, so I don't think they'd compete.

Let me know if I'm overlooking something or didn't understand properly, I'm sick and my brain is stuffy right now ;p

mafredri · 2023-03-08T16:28:28Z

coderd/database/migrations/000107_workspace_agent_stats_connection_latency.down.sql

@@ -0,0 +1 @@
+ALTER TABLE	workspace_agent_stats ALTER COLUMN connection_median_latency_ms TYPE bigint;


Will this work with non-empty data? You could consider adding two fixtures in testdata/fixtures (000106_pre_workspace_agent_stats_connection_latency.up.sql and 000107_post_workspace_agent_stats_connection_latency.up.sql). In the former you add a row with bigint value and in the latter you add a row with float value. If tests pass then all is good. 👍🏻

mafredri · 2023-03-08T16:40:04Z

codersdk/deployment.go

+type DeploymentStats struct {
+	// AggregatedFrom is the time in which stats are aggregated from.
+	// This might be back in time a specific duration or interval.
+	AggregatedFrom time.Time `json:"aggregated_since" format:"date-time"`


Any reason to keep json/field out of sync?

Nope, just a mistake on my end. Good catch!

codersdk/deployment.go

mafredri · 2023-03-08T16:43:30Z

codersdk/deployment.go

+	BuildingWorkspaces int64 `json:"building_workspaces"`
+	RunningWorkspaces  int64 `json:"running_workspaces"`
+	FailedWorkspaces   int64 `json:"failed_workspaces"`
+	StoppedWorkspaces  int64 `json:"stopped_workspaces"`


Could utilize nesting, e.g. workspaces.pending, session_count.vscode, etc. Matter of preference, so dealers choice.

Agreed that's a bit easier to parse!

mafredri · 2023-03-08T16:44:43Z

codersdk/deployment.go

+	SessionCountReconnectingPTY int64 `json:"session_count_reconnecting_pty"`
+
+	WorkspaceRxBytes int64 `json:"workspace_rx_bytes"`
+	WorkspaceTxBytes int64 `json:"workspace_tx_bytes"`


These could be under workspaces.rx_bytes. Singular vs plural (workspace, workspaces) above is a bit confusing currently.

kylecarbs added 14 commits March 2, 2023 10:53

fix: don't make session counts cumulative

8f1f141

This made for some weird tracking... we want the point-in-time number of counts!

Add databasefake query for getting agent stats

ddf9841

Add deployment stats endpoint

28d6db5

The query... works?!?

29719a4

Fix aggregation query

09a2dad

Select from multiple tables instead

12a52b1

Fix continuous stats

a1804a9

Increase period of stat refreshes

93f013b

Add workspace counts to deployment stats

50260c3

fmt

d1bae99

Add a slight bit of responsiveness

9fe9d4c

Fix template version editor overflow

00ebe2e

Add refresh button

cd76533

Fix font family on button

506740b

kylecarbs requested review from ammario and bpmct March 7, 2023 01:42

kylecarbs self-assigned this Mar 7, 2023

kylecarbs added 4 commits March 7, 2023 03:20

Merge branch 'main' into exportstats

1924f58

Fix latest stat being reported

9f00ac5

Merge branch 'main' into exportstats

4b6992c

Revert agent conn stats

1af9f64

kylecarbs requested a review from mafredri March 7, 2023 16:09

kylecarbs added 3 commits March 7, 2023 16:11

Merge branch 'main' into exportstats

e3ca39f

Fix linting error

8ad39d6

Fix tests

0f06b23

kylecarbs force-pushed the exportstats branch from fbd3c9e to 3ad3508 Compare March 7, 2023 17:20

Fix gen

e87ba59

kylecarbs force-pushed the exportstats branch from 3ad3508 to e87ba59 Compare March 7, 2023 17:30

mafredri reviewed Mar 7, 2023

View reviewed changes

ammario removed their request for review March 7, 2023 17:41

kylecarbs added 3 commits March 7, 2023 17:59

Fix migrations

99d7d1a

Block on sending stat updates

37ad03f

Merge branch 'main' into exportstats

415d8b1

mafredri approved these changes Mar 8, 2023

View reviewed changes

kylecarbs added 2 commits March 8, 2023 17:31

Add test fixtures

0037a64

Merge branch 'main' into exportstats

3d70b2a

kylecarbs force-pushed the exportstats branch from 924235f to 568c16f Compare March 9, 2023 02:44

Fix response structure

d708210

kylecarbs force-pushed the exportstats branch from 568c16f to d708210 Compare March 9, 2023 02:51

make gen

c951d5a

kylecarbs merged commit 5304b4e into main Mar 9, 2023

kylecarbs deleted the exportstats branch March 9, 2023 03:05

github-actions bot locked and limited conversation to collaborators Mar 9, 2023

		@@ -0,0 +1 @@
		ALTER TABLE workspace_agent_stats ALTER COLUMN connection_median_latency_ms TYPE bigint;

feat: add connection statistics for workspace agents #6469

feat: add connection statistics for workspace agents #6469

Uh oh!

Conversation

kylecarbs commented Mar 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylecarbs commented Mar 7, 2023 •

edited

Loading