feat: Implement aggregator for agent metrics #7259

mtojek · 2023-04-24T11:27:56Z

Related: #6724

This PR implements the metrics aggregator to collect internal stats from the agent. Agent metrics are cached in the aggregator until they expire or their values are overwritten.

Changes:

agent: report stats in every iteration. Stats will include metrics, and these are changing all the time
agent: report wgengine internal metrics, and filter unrelated ones
coderd: implement metrics aggregator that caches metrics and watches for expired ones
coderd: use errGroup to collect metrics in parallel
refactor: replace strings with metrics labels (workspace name, agent name, username)
refactor: rename log with logEntry

mafredri

I had some worries with regard to the aggregator main goroutine and potential for lockup, but other than that nothing major. Looks good, left some small suggestions for improvement too. 👍🏻

cli/server.go

mafredri · 2023-04-26T12:06:32Z

coderd/prometheusmetrics/aggregator.go

+	// messages. This is because a registry cannot have conflicting
+	// help messages for the same metric in a "gather". If our coder agents are
+	// on different versions, this is a possible scenario.
+	metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd."


Suggested change

metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd."

metricHelpForAgent = "Metrics are forwarded from workspace agents connected to this instance of coderd."

Suggestion: I think it's preferable to use the plural form here.

Side-note: In a HA setup with multiple coderd's, would metrics only be for those few agents connected to that specific coderd instance? Or is it global?

Suggestion: I think it's preferable to use the plural form here.

Fixed

Side-note: In a HA setup with multiple coderd's, would metrics only be for those few agents connected to that specific coderd instance? Or is it global?

Only connected ones as metrics are not fetched from a database but just cached in memory. I think that it applies to most of Coder metrics, right?

mafredri · 2023-04-26T12:15:34Z

coderd/prometheusmetrics/aggregator.go

+						continue
+					}
+					constMetric := prometheus.MustNewConstMetric(desc, valueType, m.Value, m.username, m.workspaceName, m.agentName)
+					inputCh <- constMetric


I'm not sure how this will be used, but I worry about the use of channels. Essentially the way this behaves now is that it's enough for one misbehaving consumer to lock up this entire goroutine since it's dependent on the collected metrics being consumed in a timely manner.

So essentially we can insert 128 entries here immediately, but if more is queued up then it's on the one who called Collect(ch) to ensure things progress smoothly. This could prevent update and cleanup.

So essentially we can insert 128 entries here immediately, but if more is queued up then it's on the one who called Collect(ch) to ensure things progress smoothly. This could prevent update and cleanup.

128 entries correspond to 128 agents submitting their metrics at the same time. I'm aware of the issue, but considering our scale, I don't think that we should encounter this problem soon. Speaking of a potential solution, do you think that we should introduce a "middle channel" to timeout the Collect(ch) if metrics are not streamed? I'm afraid that it may complicate the existing logic.

Hmm, I see. I think another option, instead of an intermediate channel would be to send a slice over this channel instead of 128 items over the channel. Then Collect can process the slice instead of consuming the channel and the goroutine can immediately continue its work. This way the logic doesn't need to change much at all. This ofc leads to increased memory usage when we exceed 128, but then again, if we do exceed it now it might not be ideal either.

I think another option, instead of an intermediate channel would be to send a slice over this channel instead of 128 items over the channel. Then Collect can process the slice instead of consuming the channel and the goroutine can immediately continue its work.

In this case Collect will need to know when the slice is ready to be "ranged over", otherwise it would be a data race.

I slightly modified your idea. I kept the channel, but I'm passing the entire slice at once, so 128 items limit does not exist anymore. Please let me know your thoughts, @mafredri.

Nice, and it actually looks like how I intended it 😄

coderd/workspaceagents.go

coderd/prometheusmetrics/aggregator.go

mafredri

I think we can continue as-is, or make the channel slice change to avoid lockups in the main goroutine. What I might worry about if we do leave it as-is is how will we notice if it ever becomes a problem, but maybe it won't.

coderd/prometheusmetrics/aggregator.go

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>

* Add ssh tests for longoutput, orphan Signed-off-by: Spike Curtis <spike@coder.com> * PTY/SSH tests & improvements Signed-off-by: Spike Curtis <spike@coder.com> * Fix some tests Signed-off-by: Spike Curtis <spike@coder.com> * Fix linting Signed-off-by: Spike Curtis <spike@coder.com> * fmt Signed-off-by: Spike Curtis <spike@coder.com> * Fix windows test Signed-off-by: Spike Curtis <spike@coder.com> * Windows copy test Signed-off-by: Spike Curtis <spike@coder.com> * WIP Windows pty handling Signed-off-by: Spike Curtis <spike@coder.com> * Fix truncation tests Signed-off-by: Spike Curtis <spike@coder.com> * Appease linter/fmt Signed-off-by: Spike Curtis <spike@coder.com> * Fix typo Signed-off-by: Spike Curtis <spike@coder.com> * Rework truncation test to not assume OS buffers Signed-off-by: Spike Curtis <spike@coder.com> * Disable orphan test on Windows --- uses sh Signed-off-by: Spike Curtis <spike@coder.com> * agent_test running SSH in pty use ptytest.Start Signed-off-by: Spike Curtis <spike@coder.com> * More detail about closing pseudoconsole on windows Signed-off-by: Spike Curtis <spike@coder.com> * Code review fixes Signed-off-by: Spike Curtis <spike@coder.com> * Rearrange ptytest method order Signed-off-by: Spike Curtis <spike@coder.com> * Protect pty.Resize on windows from races Signed-off-by: Spike Curtis <spike@coder.com> * Fix windows bugs Signed-off-by: Spike Curtis <spike@coder.com> * PTY doesn't extend PTYCmd Signed-off-by: Spike Curtis <spike@coder.com> * Fix windows types Signed-off-by: Spike Curtis <spike@coder.com> --------- Signed-off-by: Spike Curtis <spike@coder.com>

Co-authored-by: Kyle Carberry <kyle@carberry.com>

* chore: Implement workspace proxy health check cron At a given interval will check the reachability of workspace proxies. * Proxyhealth is an enterprise feature * Start proxyhealth go routine on enterprise coder

) This reverts commit 9ec16d4.

…der#7270) Fixes an issue where API tokens belonging to a deleted user were not invalidated: - Adds a trigger to delete rows from the api_key stable when the column deleted is set to true in the users table. - Adds a trigger to the api_keys table to ensure that new rows may not be added where user_id corresponds to a deleted user. - Adds a migration to delete all API keys from deleted users. - Adds tests + dbfake implementation for the above.

Signed-off-by: Spike Curtis <spike@coder.com>

* feat: add regions endpoint for proxies feature

* chore: add security advisories to docs * Update docs/security/0001_user_apikeys_invalidation.md Co-authored-by: Ammar Bandukwala <ammar@ammar.io> --------- Co-authored-by: Ammar Bandukwala <ammar@ammar.io>

…played (coder#7263)

Signed-off-by: Spike Curtis <spike@coder.com>

* wip: add expiration warning * Use GraceAt * show expiration warning for trial accounts * fix test * only show license banner for users with deployment permission --------- Co-authored-by: Marcin Tojek <marcin@coder.com>

* wip: license page * WIP * WIP * wip * wip * wip * wip * wip * wip * Apply suggestions from code review Co-authored-by: Ben Potter <ben@coder.com> * wip: ui improvements * wip: extract components * wip: stories * wip: stories * fixes from PR reviews * fix stories * fix empty license page * fix copy * fix * wip * add golang test --------- Co-authored-by: Ben Potter <ben@coder.com>

Co-authored-by: Muhammad Atif Ali <atif@coder.com>

coderd/prometheusmetrics/aggregator.go

mtojek added 3 commits April 24, 2023 11:58

API contract

6516216

Send agent metrics

dc202c4

Ignore metrics to save bandwidth

7747f2d

mtojek self-assigned this Apr 24, 2023

mtojek added 13 commits April 24, 2023 13:35

fix lint

9fd4ddb

logEntry

9af0246

make gen

4207dff

Use errGroup

99fe1bf

Use MustNewConstMetric

df80e9b

PoC works

d86496e

Metrics aggregator with channels

10e6d8d

Metrics expiry

8df9eea

histograms

1f5273b

unit test

1b8c486

fmt

423420b

test: metrics can expire

23bbe94

Aggregator

b7011ae

mtojek requested review from mafredri and coadler April 26, 2023 11:28

mtojek marked this pull request as ready for review April 26, 2023 11:28

mafredri approved these changes Apr 26, 2023

View reviewed changes

mafredri reviewed Apr 26, 2023

View reviewed changes

coderd/prometheusmetrics/aggregator.go Outdated Show resolved Hide resolved

mtojek added 3 commits April 26, 2023 17:18

Address PR comments

29a8702

wrap errors

7acd113

fix

b15c7b7

mafredri approved these changes Apr 27, 2023

View reviewed changes

coderd/prometheusmetrics/aggregator.go Outdated Show resolved Hide resolved

mtojek and others added 4 commits April 27, 2023 10:34

Update coderd/prometheusmetrics/aggregator.go

2ae7e4e

Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>

feat(community-templates): Added vscode-server-template (coder#7219)

1d93f66

Co-authored-by: Kyle Carberry <kyle@carberry.com>

chore: Proxy health status checks + endpoint (coder#7233)

c604633

* chore: Implement workspace proxy health check cron At a given interval will check the reachability of workspace proxies. * Proxyhealth is an enterprise feature * Start proxyhealth go routine on enterprise coder

Kira-Pilot and others added 22 commits April 27, 2023 10:35

Revert "feat(UI): add workspace restart button (coder#7137)" (coder#7268

7d84745

) This reverts commit 9ec16d4.

refactor(site): Group app and agent actions together (coder#7267)

407c332

chore(dogfood): remove unnecessary docker host replace (coder#7269)

44217de

Fix macOS pty race with dropped output (coder#7278)

e659c36

Signed-off-by: Spike Curtis <spike@coder.com>

feat: add regions endpoint for proxies feature (coder#7277)

6dc8b1f

* feat: add regions endpoint for proxies feature

fix(healthcheck): don't allow panics to exit coderd (coder#7276)

d2233be

chore: add security advisories to docs (coder#7282)

f3f5bed

* chore: add security advisories to docs * Update docs/security/0001_user_apikeys_invalidation.md Co-authored-by: Ammar Bandukwala <ammar@ammar.io> --------- Co-authored-by: Ammar Bandukwala <ammar@ammar.io>

fix(site): Do not show template params if there is no param to be dis…

50f60cb

…played (coder#7263)

fix(site): Fix default value for options (coder#7265)

1bf1b06

chore: fix flake in apptest reconnecting-pty test (coder#7281)

5f6b4dc

Reconnecting PTY waits for command output or EOF (coder#7279)

9141f7c

Signed-off-by: Spike Curtis <spike@coder.com>

docs(site): Mention template editor in template edit docs (coder#7261)

e0879b5

fix(site): Fix secondary buttons with popovers (coder#7296)

b6322d1

chore: change some wording in the dashboard (coder#7293)

1e3eb06

feat(agent): add http debug routes for magicsock (coder#7287)

366859b

feat: add license expiration warning (coder#7264)

ed8106d

* wip: add expiration warning * Use GraceAt * show expiration warning for trial accounts * fix test * only show license banner for users with deployment permission --------- Co-authored-by: Marcin Tojek <marcin@coder.com>

chore: add envbox documentation (coder#7198)

4937e75

docs: Fix relay link in HA doc (coder#7159)

619e470

Co-authored-by: Muhammad Atif Ali <atif@coder.com>

Merge branch 'main' into 6724-api-collect-metrics

16b5353

Refactor Collect channel

c1bd4d2

mafredri reviewed Apr 27, 2023

View reviewed changes

coderd/prometheusmetrics/aggregator.go Outdated Show resolved Hide resolved

fix

8baed98

mtojek merged commit bb0a38b into coder:main Apr 27, 2023

github-actions bot locked and limited conversation to collaborators Apr 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Implement aggregator for agent metrics #7259

feat: Implement aggregator for agent metrics #7259

Uh oh!

mtojek commented Apr 24, 2023 •

edited

Loading

Uh oh!

mafredri left a comment

Uh oh!

Uh oh!

mafredri Apr 26, 2023

Uh oh!

mtojek Apr 26, 2023

Uh oh!

mafredri Apr 26, 2023

Uh oh!

mtojek Apr 26, 2023

Uh oh!

mafredri Apr 27, 2023

Uh oh!

mtojek Apr 27, 2023

Uh oh!

mafredri Apr 27, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd."
	metricHelpForAgent = "Metrics are forwarded from workspace agents connected to this instance of coderd."

feat: Implement aggregator for agent metrics #7259

feat: Implement aggregator for agent metrics #7259

Uh oh!

Conversation

mtojek commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mafredri Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

mtojek Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

mafredri Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

mtojek Apr 26, 2023

Choose a reason for hiding this comment

Uh oh!

mafredri Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

mtojek Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

mafredri Apr 27, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mtojek commented Apr 24, 2023 •

edited

Loading