Skip to content

feat: Implement aggregator for agent metrics #7259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 46 commits into from
Apr 27, 2023

Conversation

mtojek
Copy link
Member

@mtojek mtojek commented Apr 24, 2023

Related: #6724

This PR implements the metrics aggregator to collect internal stats from the agent. Agent metrics are cached in the aggregator until they expire or their values are overwritten.

Changes:

  • agent: report stats in every iteration. Stats will include metrics, and these are changing all the time
  • agent: report wgengine internal metrics, and filter unrelated ones
  • coderd: implement metrics aggregator that caches metrics and watches for expired ones
  • coderd: use errGroup to collect metrics in parallel
  • refactor: replace strings with metrics labels (workspace name, agent name, username)
  • refactor: rename log with logEntry

@mtojek mtojek self-assigned this Apr 24, 2023
@mtojek mtojek requested review from mafredri and coadler April 26, 2023 11:28
@mtojek mtojek marked this pull request as ready for review April 26, 2023 11:28
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some worries with regard to the aggregator main goroutine and potential for lockup, but other than that nothing major. Looks good, left some small suggestions for improvement too. 👍🏻

// messages. This is because a registry cannot have conflicting
// help messages for the same metric in a "gather". If our coder agents are
// on different versions, this is a possible scenario.
metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
metricHelpForAgent = "Metric is forwarded from workspace agent connected to this instance of coderd."
metricHelpForAgent = "Metrics are forwarded from workspace agents connected to this instance of coderd."

Suggestion: I think it's preferable to use the plural form here.

Side-note: In a HA setup with multiple coderd's, would metrics only be for those few agents connected to that specific coderd instance? Or is it global?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: I think it's preferable to use the plural form here.

Fixed

Side-note: In a HA setup with multiple coderd's, would metrics only be for those few agents connected to that specific coderd instance? Or is it global?

Only connected ones as metrics are not fetched from a database but just cached in memory. I think that it applies to most of Coder metrics, right?

continue
}
constMetric := prometheus.MustNewConstMetric(desc, valueType, m.Value, m.username, m.workspaceName, m.agentName)
inputCh <- constMetric
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how this will be used, but I worry about the use of channels. Essentially the way this behaves now is that it's enough for one misbehaving consumer to lock up this entire goroutine since it's dependent on the collected metrics being consumed in a timely manner.

So essentially we can insert 128 entries here immediately, but if more is queued up then it's on the one who called Collect(ch) to ensure things progress smoothly. This could prevent update and cleanup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So essentially we can insert 128 entries here immediately, but if more is queued up then it's on the one who called Collect(ch) to ensure things progress smoothly. This could prevent update and cleanup.

128 entries correspond to 128 agents submitting their metrics at the same time. I'm aware of the issue, but considering our scale, I don't think that we should encounter this problem soon. Speaking of a potential solution, do you think that we should introduce a "middle channel" to timeout the Collect(ch) if metrics are not streamed? I'm afraid that it may complicate the existing logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I see. I think another option, instead of an intermediate channel would be to send a slice over this channel instead of 128 items over the channel. Then Collect can process the slice instead of consuming the channel and the goroutine can immediately continue its work. This way the logic doesn't need to change much at all. This ofc leads to increased memory usage when we exceed 128, but then again, if we do exceed it now it might not be ideal either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another option, instead of an intermediate channel would be to send a slice over this channel instead of 128 items over the channel. Then Collect can process the slice instead of consuming the channel and the goroutine can immediately continue its work.

In this case Collect will need to know when the slice is ready to be "ranged over", otherwise it would be a data race.

I slightly modified your idea. I kept the channel, but I'm passing the entire slice at once, so 128 items limit does not exist anymore. Please let me know your thoughts, @mafredri.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, and it actually looks like how I intended it 😄

Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can continue as-is, or make the channel slice change to avoid lockups in the main goroutine. What I might worry about if we do leave it as-is is how will we notice if it ever becomes a problem, but maybe it won't.

mtojek and others added 4 commits April 27, 2023 10:34
Co-authored-by: Mathias Fredriksson <mafredri@gmail.com>
* Add ssh tests for longoutput, orphan

Signed-off-by: Spike Curtis <spike@coder.com>

* PTY/SSH tests & improvements

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix some tests

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix linting

Signed-off-by: Spike Curtis <spike@coder.com>

* fmt

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix windows test

Signed-off-by: Spike Curtis <spike@coder.com>

* Windows copy test

Signed-off-by: Spike Curtis <spike@coder.com>

* WIP Windows pty handling

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix truncation tests

Signed-off-by: Spike Curtis <spike@coder.com>

* Appease linter/fmt

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix typo

Signed-off-by: Spike Curtis <spike@coder.com>

* Rework truncation test to not assume OS buffers

Signed-off-by: Spike Curtis <spike@coder.com>

* Disable orphan test on Windows --- uses sh

Signed-off-by: Spike Curtis <spike@coder.com>

* agent_test running SSH in pty use ptytest.Start

Signed-off-by: Spike Curtis <spike@coder.com>

* More detail about closing pseudoconsole on windows

Signed-off-by: Spike Curtis <spike@coder.com>

* Code review fixes

Signed-off-by: Spike Curtis <spike@coder.com>

* Rearrange ptytest method order

Signed-off-by: Spike Curtis <spike@coder.com>

* Protect pty.Resize on windows from races

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix windows bugs

Signed-off-by: Spike Curtis <spike@coder.com>

* PTY doesn't extend PTYCmd

Signed-off-by: Spike Curtis <spike@coder.com>

* Fix windows types

Signed-off-by: Spike Curtis <spike@coder.com>

---------

Signed-off-by: Spike Curtis <spike@coder.com>
Co-authored-by: Kyle Carberry <kyle@carberry.com>
* chore: Implement workspace proxy health check cron

At a given interval will check the reachability of workspace proxies.

* Proxyhealth is an enterprise feature
* Start proxyhealth go routine on enterprise coder
Kira-Pilot and others added 22 commits April 27, 2023 10:35
…der#7270)

Fixes an issue where API tokens belonging to a deleted user were
not invalidated:
- Adds a trigger to delete rows from the api_key stable when the
  column deleted is set to true in the users table.
- Adds a trigger to the api_keys table to ensure that new rows
  may not be added where user_id corresponds to a deleted user.
- Adds a migration to delete all API keys from deleted users.
- Adds tests + dbfake implementation for the above.
Signed-off-by: Spike Curtis <spike@coder.com>
* feat: add regions endpoint for proxies feature
* chore: add security advisories to docs

* Update docs/security/0001_user_apikeys_invalidation.md

Co-authored-by: Ammar Bandukwala <ammar@ammar.io>

---------

Co-authored-by: Ammar Bandukwala <ammar@ammar.io>
Signed-off-by: Spike Curtis <spike@coder.com>
* wip: add expiration warning

* Use GraceAt

* show expiration warning for trial accounts

* fix test

* only show license banner for users with deployment permission

---------

Co-authored-by: Marcin Tojek <marcin@coder.com>
* wip: license page

* WIP

* WIP

* wip

* wip

* wip

* wip

* wip

* wip

* Apply suggestions from code review

Co-authored-by: Ben Potter <ben@coder.com>

* wip: ui improvements

* wip: extract components

* wip: stories

* wip: stories

* fixes from PR reviews

* fix stories

* fix empty license page

* fix copy

* fix

* wip

* add golang test

---------

Co-authored-by: Ben Potter <ben@coder.com>
Co-authored-by: Muhammad Atif Ali <atif@coder.com>
@mtojek mtojek merged commit bb0a38b into coder:main Apr 27, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Apr 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.