-
Notifications
You must be signed in to change notification settings - Fork 894
feat: expose agent metrics via Prometheus endpoint #7011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 1 commit
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
8d4e67d
WIP
mtojek da729e6
Merge branch 'main' into 6724-metrics
mtojek 9ad09b2
WIP
mtojek 440657c
WIP
mtojek 8764f89
Agents
mtojek 663b5d5
fix
mtojek 63aff5e
1min
mtojek 3905481
fix
mtojek f8d6f46
WIP
mtojek d487a77
Test
mtojek 7acbaf0
docs
mtojek 7418779
fmt
mtojek 3a8e4e6
Add timer to measure the metrics collection
mtojek b5d0581
Use CachedGaugeVec
mtojek e4d708b
Unit tests
mtojek e0669f0
Address PR comments
mtojek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Agents
- Loading branch information
commit 8764f8975d75ebb45d9e5996fac9b7509c953e33
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this data already live on each workspace/agent?
I wonder if there is another way to do this. In v1 we implemented a custom
prometheus.Collector
to handle agent stats in a non-racy way.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to keep it aligned with other
prometheusmetrics
, and use a single source of metrics, the database. In this case, the information we present over Coderd API is consistent with Prometheus endpoint.Regarding the
prometheus.Collector
, I will take a look 👍 (as stated in the other comment).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the v1 implementation a bit complex for this use case. As I tried to separate metric collections apart from agent reporting logic, a collector like the one in v1 would be great if we have metrics coming from different parts of the application.
BTW It looks like the v1 collector doesn't support vectors, but here we depend mostly on them. Porting the collector would make it more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I didn't think we could port the v1 metrcis verbatim. How it works though is each agent uses prometheus to create their metrics. Those metrics get sent to the
coderd
they are connected to with their agent, and push their prom metrics. The aggregator then combines all those metrics together, labeling them for each workspace.The v1 collector does support labels, which is "vectors". Each unique label set is a single "metric" to the aggregator. So
coderd_agents_metric{favorite-number="7"}
andcoderd_agents_metric{favorite-number="1"}
are 2 differentprometheus.Metric
. This matches the prometheus design of labels:I liked the v1 design as it made it easier to add metrics from the agent, as I think we make our own payloads in v2. 🤷♂️
I was more pointing it out as making a
Collector
gives you a lot more freedom on how to manipulate theGather
part of the metrics.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is one approach, but unfortunately, we would miss metrics from disconnected agents.
As stated in the PR description, the idea behind this submission is to expose metric data we have already collected and stored in the database. If we have this data already stored in the database, why just don't use it :) A similar story would be with "agent stats" that are stored in a dedicated database table.
Let me know your thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disconnected agent stats in v1 Prometheus are eventually "stale" and then removed. Since Prometheus doesn't need to be a perfect source of truth (a little lag eg 1min is ok imo).
I agree with you though to just expose what we currently have is the go to move. My initial hunch was to make a collector that has all the internal counters you are trying to track.
When the "Gather" func is called, the Collector returns the cached counters.
Every "update" perioud, new counts are created and incremented in an internal loop. When the counts are finished, then the Collector is locked, all counters updated, unlocked.
So the exposed counters are always "correct"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is good for sure, it's just a matter of the capacity we have 👍
Yup, this is more or less what I've implemented on the gauge vector level: CachedGaugeVec. I just found it simpler compared with the concept of a collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah that makes sense and looks good 👍