-
Notifications
You must be signed in to change notification settings - Fork 887
feat: Implement aggregator for agent metrics #7259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some worries with regard to the aggregator main goroutine and potential for lockup, but other than that nothing major. Looks good, left some small suggestions for improvement too. 👍🏻
// messages. This is because a registry cannot have conflicting | ||
// help messages for the same metric in a "gather". If our coder agents are | ||
// on different versions, this is a possible scenario. | ||
metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd." | |
metricHelpForAgent = "Metrics are forwarded from workspace agents connected to this instance of coderd." |
Suggestion: I think it's preferable to use the plural form here.
Side-note: In a HA setup with multiple coderd's, would metrics only be for those few agents connected to that specific coderd instance? Or is it global?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: I think it's preferable to use the plural form here.
Fixed
Side-note: In a HA setup with multiple coderd's, would metrics only be for those few agents connected to that specific coderd instance? Or is it global?
Only connected ones as metrics are not fetched from a database but just cached in memory. I think that it applies to most of Coder metrics, right?
continue | ||
} | ||
constMetric := prometheus.MustNewConstMetric(desc, valueType, m.Value, m.username, m.workspaceName, m.agentName) | ||
inputCh <- constMetric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how this will be used, but I worry about the use of channels. Essentially the way this behaves now is that it's enough for one misbehaving consumer to lock up this entire goroutine since it's dependent on the collected metrics being consumed in a timely manner.
So essentially we can insert 128 entries here immediately, but if more is queued up then it's on the one who called Collect(ch)
to ensure things progress smoothly. This could prevent update and cleanup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially we can insert 128 entries here immediately, but if more is queued up then it's on the one who called Collect(ch) to ensure things progress smoothly. This could prevent update and cleanup.
128 entries correspond to 128 agents submitting their metrics at the same time. I'm aware of the issue, but considering our scale, I don't think that we should encounter this problem soon. Speaking of a potential solution, do you think that we should introduce a "middle channel" to timeout the Collect(ch)
if metrics are not streamed? I'm afraid that it may complicate the existing logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I see. I think another option, instead of an intermediate channel would be to send a slice over this channel instead of 128 items over the channel. Then Collect can process the slice instead of consuming the channel and the goroutine can immediately continue its work. This way the logic doesn't need to change much at all. This ofc leads to increased memory usage when we exceed 128, but then again, if we do exceed it now it might not be ideal either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think another option, instead of an intermediate channel would be to send a slice over this channel instead of 128 items over the channel. Then Collect can process the slice instead of consuming the channel and the goroutine can immediately continue its work.
In this case Collect
will need to know when the slice is ready to be "ranged over", otherwise it would be a data race.
I slightly modified your idea. I kept the channel, but I'm passing the entire slice at once, so 128 items limit does not exist anymore. Please let me know your thoughts, @mafredri.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, and it actually looks like how I intended it 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can continue as-is, or make the channel slice change to avoid lockups in the main goroutine. What I might worry about if we do leave it as-is is how will we notice if it ever becomes a problem, but maybe it won't.
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Add ssh tests for longoutput, orphan Signed-off-by: Spike Curtis <spike@coder.com> * PTY/SSH tests & improvements Signed-off-by: Spike Curtis <spike@coder.com> * Fix some tests Signed-off-by: Spike Curtis <spike@coder.com> * Fix linting Signed-off-by: Spike Curtis <spike@coder.com> * fmt Signed-off-by: Spike Curtis <spike@coder.com> * Fix windows test Signed-off-by: Spike Curtis <spike@coder.com> * Windows copy test Signed-off-by: Spike Curtis <spike@coder.com> * WIP Windows pty handling Signed-off-by: Spike Curtis <spike@coder.com> * Fix truncation tests Signed-off-by: Spike Curtis <spike@coder.com> * Appease linter/fmt Signed-off-by: Spike Curtis <spike@coder.com> * Fix typo Signed-off-by: Spike Curtis <spike@coder.com> * Rework truncation test to not assume OS buffers Signed-off-by: Spike Curtis <spike@coder.com> * Disable orphan test on Windows --- uses sh Signed-off-by: Spike Curtis <spike@coder.com> * agent_test running SSH in pty use ptytest.Start Signed-off-by: Spike Curtis <spike@coder.com> * More detail about closing pseudoconsole on windows Signed-off-by: Spike Curtis <spike@coder.com> * Code review fixes Signed-off-by: Spike Curtis <spike@coder.com> * Rearrange ptytest method order Signed-off-by: Spike Curtis <spike@coder.com> * Protect pty.Resize on windows from races Signed-off-by: Spike Curtis <spike@coder.com> * Fix windows bugs Signed-off-by: Spike Curtis <spike@coder.com> * PTY doesn't extend PTYCmd Signed-off-by: Spike Curtis <spike@coder.com> * Fix windows types Signed-off-by: Spike Curtis <spike@coder.com> --------- Signed-off-by: Spike Curtis <spike@coder.com>
Co-authored-by: Kyle Carberry <kyle@carberry.com>
* chore: Implement workspace proxy health check cron At a given interval will check the reachability of workspace proxies. * Proxyhealth is an enterprise feature * Start proxyhealth go routine on enterprise coder
) This reverts commit 9ec16d4.
…der#7270) Fixes an issue where API tokens belonging to a deleted user were not invalidated: - Adds a trigger to delete rows from the api_key stable when the column deleted is set to true in the users table. - Adds a trigger to the api_keys table to ensure that new rows may not be added where user_id corresponds to a deleted user. - Adds a migration to delete all API keys from deleted users. - Adds tests + dbfake implementation for the above.
Signed-off-by: Spike Curtis <spike@coder.com>
* feat: add regions endpoint for proxies feature
* chore: add security advisories to docs * Update docs/security/0001_user_apikeys_invalidation.md Co-authored-by: Ammar Bandukwala <ammar@ammar.io> --------- Co-authored-by: Ammar Bandukwala <ammar@ammar.io>
Signed-off-by: Spike Curtis <spike@coder.com>
* wip: add expiration warning * Use GraceAt * show expiration warning for trial accounts * fix test * only show license banner for users with deployment permission --------- Co-authored-by: Marcin Tojek <marcin@coder.com>
* wip: license page * WIP * WIP * wip * wip * wip * wip * wip * wip * Apply suggestions from code review Co-authored-by: Ben Potter <ben@coder.com> * wip: ui improvements * wip: extract components * wip: stories * wip: stories * fixes from PR reviews * fix stories * fix empty license page * fix copy * fix * wip * add golang test --------- Co-authored-by: Ben Potter <ben@coder.com>
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
Related: #6724
This PR implements the metrics aggregator to collect internal stats from the agent. Agent metrics are cached in the aggregator until they expire or their values are overwritten.
Changes:
wgengine
internal metrics, and filter unrelated oneserrGroup
to collect metrics in parallellog
withlogEntry