Skip to content

feat: fetch prebuilds metrics state in background #17792

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 13, 2025

Conversation

dannykopping
Copy link
Contributor

@dannykopping dannykopping commented May 13, 2025

Collect() is called whenever the /metrics endpoint is hit to retrieve metrics.

The queries used in prebuilds metrics collection are quite heavy, and we want to avoid having them running concurrently / too often to keep db load down.

Here I'm moving towards a background retrieval of the state required to set the metrics, which gets invalidated every interval.

Also introduces coderd_prebuilt_workspaces_metrics_last_updated which operators can use to determine when these metrics go stale.

See #17789 as well.

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping force-pushed the dk/debounce-metrics-collection branch from 5da546e to e73dae6 Compare May 13, 2025 12:29
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping force-pushed the dk/debounce-metrics-collection branch from e73dae6 to fcbfb7f Compare May 13, 2025 12:29
@dannykopping dannykopping marked this pull request as ready for review May 13, 2025 12:35
@@ -55,20 +57,34 @@ var (
labels,
nil,
)
lastUpdateDesc = prometheus.NewDesc(
"coderd_prebuilt_workspaces_metrics_last_updated",
"The unix timestamp when the metrics related to prebuilt workspaces were last updated; these metrics are cached.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is unix timestamp easy to alert on? Like can you do something like unix_now() - metric_value > 1000 or something in grafana and co? If not, it might be better if this was a duration since the last successful fetch instead.

Copy link
Member

@johnstcn johnstcn May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me for duration since last successful fetch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idiomatic approach is to use unix timestamps, see prometheus_config_last_reload_success_timestamp_seconds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess we have an existing metric for the coder server start timestamp?

Copy link
Contributor Author

@dannykopping dannykopping May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so (or at least not one we export), but I think as long as this metric is updated relative to itself and up is taken into consideration, it should be useful.

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
@dannykopping dannykopping merged commit b2a1de9 into main May 13, 2025
34 checks passed
@dannykopping dannykopping deleted the dk/debounce-metrics-collection branch May 13, 2025 18:27
@github-actions github-actions bot locked and limited conversation to collaborators May 13, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants