chore: add prometheus monitoring of workspace traffic generation #7583

johnstcn · 2023-05-17T11:00:51Z

Part of #7599

Exposes reads/writes from scaletest traffic generation (default: 0.0.0.0:21112)
Adds self-hosted prometheus with remote_write to loadtest terraform
Adds convenience script to run a traffic generation test

Some things that immediately stand out:

I don't have a satisfactory test for the prometheus endpoint; I need to figure out how to slot a prometheus.Registry into the CLI test.
Running the traffic generation test requires you to know the loadtest name, it would be nicer if it could figure this out automatically.
The traffic generation will keep running indefinitely until you delete the pod; this could also be considered a 'feature'.

scaletest/terraform/prometheus.tf

mtojek

Reviewed and left a few questions. I admit that I didn't verify the infra code, but I suppose that you have already done it.

Regarding metrics, maybe you will benefit from the PR I'm working on now: https://github.com/coder/coder/blob/84b59ea2dbd62b3297204ba7a09bd84f4b29558f/agent/agentssh/metrics.go

cli/scaletest.go

mtojek · 2023-05-18T08:37:16Z

cli/scaletest.go

-					name    = "workspace-traffic"
-					id      = strconv.Itoa(idx)
+					agentID   uuid.UUID
+					agentName string


I didn't go through the whole PR, but why is it required to specify the agent name? isn't it always "main"?

My understanding is that the agent name maps directly to what's specified in the workspace Terraform, is this not the case?

cli/testdata/coder_scaletest_workspace-traffic_--help.golden

mtojek · 2023-05-18T08:39:18Z

scaletest/terraform/README.md

   > You don't need to run `coder login` yourself.

+   - To create workspaces, run `./coder_shim.sh scaletest create-workspaces --template="kubernetes" --count=N`


I believe that at some point you may want to include also rich parameters as they cause more DB calls in total (worst case scenario).

Yep, that would be a good test for load related to concurrent workspace builds.
For the moment though, I'm going to keep it simple with zero parameters.

scaletest/terraform/gcp_cluster.tf

scaletest/terraform/coder_workspacetraffic.sh

scaletest/workspacetraffic/metrics.go

…rometheus

mtojek

I don't see anything blocking, so LGTM 👍

The traffic generation will keep running indefinitely until you delete the pod; this could also be considered a 'feature'.

Do we expect any weird errors when the operator stops the traffic generation?

mtojek · 2023-05-26T08:00:57Z

cli/scaletest.go

+
+			logger := slog.Make(sloghuman.Sink(io.Discard))
+			prometheusSrvClose := ServeHandler(ctx, logger, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), scaletestPrometheusAddress, "prometheus")
+			defer prometheusSrvClose()


I'm wondering if we need to add some graceful period before closing to make sure that all relevant metrics are scraped before the tool goes down.

Yeah, this would be heavily dependent on the prometheus scrape interval. Simplest is probably to expose it as a parameter to be set by the test operatorl.

scaletest/workspacetraffic/run_test.go

scaletest/workspacetraffic/run.go

scaletest/workspacetraffic/metrics.go

mtojek · 2023-05-26T08:20:57Z

scaletest/workspacetraffic/run.go

 	n, err := w.ReadWriter.Write(p)
-	if err == nil {
+	w.metrics.WriteLatencySeconds.WithLabelValues(w.labels...).Observe(time.Since(start).Seconds())


I'm curious if we need another metric (bytes?) to calculate the bandwidth. As far as I see, there are only WriteLatencySeconds and total BytesWritten. I'm curious if we need a metric to calculate "bytes per write in time"

Given that bytes per tick and ticks per second are inputs to this, I think we should probably be OK. But good food for thought!

You calc throughput by the (count at time X) - (count at time Y). We should not be doing that math in coderd imo

Good catch, will address in a follow-up

Prometheus itself will happily compute the time derivative of any metric, so a counter of bytes written is all we need.

johnstcn added 7 commits May 17, 2023 11:59

add prometheus metrics endpoint to workspacetraffic command

f62cf16

use self-managed prometheus

00b9eca

plumb in prom properly

9a9778c

fix label on second pod monitor

f00b8d7

do not clobber existing prometheus

69bc5ad

remote_write all metrics

57d338d

add convenience script to create traffic inside cluster

a1d14df

johnstcn self-assigned this May 17, 2023

johnstcn added 2 commits May 17, 2023 15:36

fixup! add convenience script to create traffic inside cluster

fb96c4d

make fmt, make gen

59ef445

johnstcn marked this pull request as ready for review May 17, 2023 15:08

johnstcn requested a review from spikecurtis May 17, 2023 15:09

johnstcn commented May 17, 2023

View reviewed changes

scaletest/terraform/prometheus.tf Outdated Show resolved Hide resolved

Update scaletest/terraform/prometheus.tf

6734521

johnstcn requested a review from mtojek May 17, 2023 22:14

mtojek reviewed May 18, 2023

View reviewed changes

johnstcn added 6 commits May 24, 2023 16:45

Merge remote-tracking branch 'origin/main' into cj/workspacetraffic-p…

4ab2314

…rometheus

address PR comments

d54af33

move test to workspacetraffic package

8c48d2b

Merge remote-tracking branch 'origin/main' into cj/workspacetraffic-p…

cf00b5a

…rometheus

update golden files

e6917e6

separate prom address var

ea71c4f

johnstcn requested a review from mtojek May 25, 2023 15:06

mtojek approved these changes May 26, 2023

View reviewed changes

johnstcn added 5 commits May 26, 2023 12:09

break out errors in to read/write

399e506

ignore certain errors such as websocket closing or context cancellation

ec96a00

test metrics more closely

a4ecd0c

fixup! test metrics more closely

fe0ecfc

add wait for prometheus metrics

239ba96

johnstcn merged commit 795050b into main May 26, 2023

johnstcn deleted the cj/workspacetraffic-prometheus branch May 26, 2023 12:53

github-actions bot locked and limited conversation to collaborators May 26, 2023

		> You don't need to run `coder login` yourself.

		- To create workspaces, run `./coder_shim.sh scaletest create-workspaces --template="kubernetes" --count=N`

chore: add prometheus monitoring of workspace traffic generation #7583

chore: add prometheus monitoring of workspace traffic generation #7583

Uh oh!

Conversation

johnstcn commented May 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtojek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

johnstcn commented May 17, 2023 •

edited

Loading