-
Notifications
You must be signed in to change notification settings - Fork 887
chore: add prometheus monitoring of workspace traffic generation #7583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed and left a few questions. I admit that I didn't verify the infra code, but I suppose that you have already done it.
Regarding metrics, maybe you will benefit from the PR I'm working on now: https://github.com/coder/coder/blob/84b59ea2dbd62b3297204ba7a09bd84f4b29558f/agent/agentssh/metrics.go
name = "workspace-traffic" | ||
id = strconv.Itoa(idx) | ||
agentID uuid.UUID | ||
agentName string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't go through the whole PR, but why is it required to specify the agent name? isn't it always "main"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the agent name maps directly to what's specified in the workspace Terraform, is this not the case?
> You don't need to run `coder login` yourself. | ||
|
||
- To create workspaces, run `./coder_shim.sh scaletest create-workspaces --template="kubernetes" --count=N` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that at some point you may want to include also rich parameters as they cause more DB calls in total (worst case scenario).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that would be a good test for load related to concurrent workspace builds.
For the moment though, I'm going to keep it simple with zero parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anything blocking, so LGTM 👍
The traffic generation will keep running indefinitely until you delete the pod; this could also be considered a 'feature'.
Do we expect any weird errors when the operator stops the traffic generation?
|
||
logger := slog.Make(sloghuman.Sink(io.Discard)) | ||
prometheusSrvClose := ServeHandler(ctx, logger, promhttp.HandlerFor(reg, promhttp.HandlerOpts{}), scaletestPrometheusAddress, "prometheus") | ||
defer prometheusSrvClose() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we need to add some graceful period before closing to make sure that all relevant metrics are scraped before the tool goes down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this would be heavily dependent on the prometheus scrape interval. Simplest is probably to expose it as a parameter to be set by the test operatorl.
n, err := w.ReadWriter.Write(p) | ||
if err == nil { | ||
w.metrics.WriteLatencySeconds.WithLabelValues(w.labels...).Observe(time.Since(start).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if we need another metric (bytes?) to calculate the bandwidth. As far as I see, there are only WriteLatencySeconds
and total BytesWritten
. I'm curious if we need a metric to calculate "bytes per write in time"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that bytes per tick and ticks per second are inputs to this, I think we should probably be OK. But good food for thought!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You calc throughput by the (count at time X) - (count at time Y)
. We should not be doing that math in coderd imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will address in a follow-up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prometheus itself will happily compute the time derivative of any metric, so a counter of bytes written is all we need.
Part of #7599
Some things that immediately stand out: