Skip to content

chore: add prometheus monitoring of workspace traffic generation #7583

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 26, 2023

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented May 17, 2023

Part of #7599

  • Exposes reads/writes from scaletest traffic generation (default: 0.0.0.0:21112)
  • Adds self-hosted prometheus with remote_write to loadtest terraform
  • Adds convenience script to run a traffic generation test

Some things that immediately stand out:

  • I don't have a satisfactory test for the prometheus endpoint; I need to figure out how to slot a prometheus.Registry into the CLI test.
  • Running the traffic generation test requires you to know the loadtest name, it would be nicer if it could figure this out automatically.
  • The traffic generation will keep running indefinitely until you delete the pod; this could also be considered a 'feature'.

@johnstcn johnstcn self-assigned this May 17, 2023
@johnstcn johnstcn marked this pull request as ready for review May 17, 2023 15:08
@johnstcn johnstcn requested a review from spikecurtis May 17, 2023 15:09
@johnstcn johnstcn requested a review from mtojek May 17, 2023 22:14
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed and left a few questions. I admit that I didn't verify the infra code, but I suppose that you have already done it.

Regarding metrics, maybe you will benefit from the PR I'm working on now: https://github.com/coder/coder/blob/84b59ea2dbd62b3297204ba7a09bd84f4b29558f/agent/agentssh/metrics.go

name = "workspace-traffic"
id = strconv.Itoa(idx)
agentID uuid.UUID
agentName string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through the whole PR, but why is it required to specify the agent name? isn't it always "main"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the agent name maps directly to what's specified in the workspace Terraform, is this not the case?

> You don't need to run `coder login` yourself.

- To create workspaces, run `./coder_shim.sh scaletest create-workspaces --template="kubernetes" --count=N`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that at some point you may want to include also rich parameters as they cause more DB calls in total (worst case scenario).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be a good test for load related to concurrent workspace builds.
For the moment though, I'm going to keep it simple with zero parameters.

@johnstcn johnstcn requested a review from mtojek May 25, 2023 15:06
Copy link
Member

@mtojek mtojek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anything blocking, so LGTM 👍

The traffic generation will keep running indefinitely until you delete the pod; this could also be considered a 'feature'.

Do we expect any weird errors when the operator stops the traffic generation?


logger := slog.Make(sloghuman.Sink(io.Discard))
prometheusSrvClose := ServeHandler(ctx, logger, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), scaletestPrometheusAddress, "prometheus")
defer prometheusSrvClose()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we need to add some graceful period before closing to make sure that all relevant metrics are scraped before the tool goes down.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this would be heavily dependent on the prometheus scrape interval. Simplest is probably to expose it as a parameter to be set by the test operatorl.

n, err := w.ReadWriter.Write(p)
if err == nil {
w.metrics.WriteLatencySeconds.WithLabelValues(w.labels...).Observe(time.Since(start).Seconds())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if we need another metric (bytes?) to calculate the bandwidth. As far as I see, there are only WriteLatencySeconds and total BytesWritten. I'm curious if we need a metric to calculate "bytes per write in time"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that bytes per tick and ticks per second are inputs to this, I think we should probably be OK. But good food for thought!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You calc throughput by the (count at time X) - (count at time Y). We should not be doing that math in coderd imo

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will address in a follow-up

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus itself will happily compute the time derivative of any metric, so a counter of bytes written is all we need.

@johnstcn johnstcn merged commit 795050b into main May 26, 2023
@johnstcn johnstcn deleted the cj/workspacetraffic-prometheus branch May 26, 2023 12:53
@github-actions github-actions bot locked and limited conversation to collaborators May 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants