fix: fix metric for hard-limited presets #18045

evgeniy-scherbina · 2025-05-26T20:56:53Z

// Report a metric only if the preset uses the latest version of the template and the template is not deleted.
// This avoids conflicts between metrics from old and new template versions.
//
// NOTE: Multiple versions of a preset can exist with the same orgName, templateName, and presetName,
// because templates can have multiple versions — or deleted templates can share the same name.
//
// The safest approach is to report the metric only for the latest version of the preset.
// When a new template version is released, the metric for the new preset should overwrite
// the old value in Prometheus.
//
// However, there’s one edge case: if an admin creates a template, it becomes hard-limited,
// then deletes the template and never creates another with the same name,
// the old preset will continue to be reported as hard-limited —
// even though it’s deleted. This will persist until `coderd` is restarted.

ssncferreira

Just a couple of suggestions regarding the metrics 🙂 But if this blocks the PR I am ok with addressing them in another PR

ssncferreira · 2025-05-27T11:15:52Z

enterprise/coderd/prebuilds/reconcile.go

+	// then deletes the template and never creates another with the same name,
+	// the old preset will continue to be reported as hard-limited —
+	// even though it’s deleted. This will persist until `coderd` is restarted.
+	if ps.Preset.UsingActiveVersion && !ps.Preset.Deleted {


What if we added another label for the template version?
I don’t think it would increase the metric's cardinality, since we only have one active template version per preset at a time. When a preset switches to a new template version, the old metric time series would just become inactive and eventually be garbage collected by Prometheus.

That's how all metrics are done: https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L35

I think from end-user point of view - it make sense. From end-user point of view there is only one version of preset - latest and not deleted. They don't really care about some outdated/deleted version of presets.

But I'm going to reimplement it significantly in a next PR.

Main issue here that I'm trying to do it on ReconcilePreset level, but I think right way to do it is on ReconcileAll level.
Gather information about all presets and then report accordingly.

I need to create hashMap:

// key = org-name-template-name-preset-name map[key][]preset

Then for every key in the map - I have to report either hardLimited or not.
I will report hard limited only in case if []preset contains latest and not deleted preset and it's hard-limited, otherwise I just remove it from the map.

If I understand correctly, https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L3 uses just the template name. I was referring to the template version ID, which is the only way to distinguish between multiple presets that use the same template.

So the full label set would be:
(preset_name, template_name, template_version_id, organization_name).

The downside of this approach is that Prometheus wouldn't know which version is the latest, so querying for the current status becomes harder, we would need extra logic to filter by the active version.
That trade-off might be worthwhile if we needed visibility into historical versions of a preset, but I don't think that's the case here.

So since we only care about the current status per preset, I agree that the current approach makes sense 👍

ssncferreira · 2025-05-27T11:23:30Z

enterprise/coderd/prebuilds/reconcile_test.go

@@ -1034,15 +1034,18 @@ func TestHardLimitedPresetShouldNotBlockDeletion(t *testing.T) {
 			require.Equal(t, database.WorkspaceTransitionDelete, workspaceBuilds[0].Transition)
 			require.Equal(t, database.WorkspaceTransitionStart, workspaceBuilds[1].Transition)

-			// Metric is deleted after preset became outdated.
+			// The metric is still set to 1, even though the preset has become outdated.
+			// This happens because the old value hasn't been overwritten by a newer preset yet.


nit: In trackHardLimitedStatus, when a preset is no longer hard limited, instead of deleting the entry immediately from the isPresetHardLimited map, we just set it to false.
Then, in the Collect function, we report the metric with value 0 one last time, and after that we delete the entry from the map, so it's no longer reported in future scrapes.
This ensures that while Prometheus still hasn’t cleaned up the time series, it holds the correct last value (0) rather than incorrectly keeping 1.

I think the only way to make it work correctly in 100% use-cases is how I described it above.
The issue is from system point of view we have multiple presets:

presets from outdated template version

presets from deleted template version

actual latest and not deleted preset

But from user-point of view, only one preset exists:

actual latest and not deleted preset

Let's say I report a metric for preset from deleted template version inside ReconcilePreset. It's not clear what to report. I want to report zero or remove it, but it can interfere with reporting from actual latest and not deleted preset - because they have same set of labels.

Alternatively I can skip (do nothing) - but in this case there is a risk, that there is no corresponding actual latest and not deleted preset which means outdated value will live in coderd memory until it's restarted and correspondingly invalid value will be reported to prometheus.

Fundamental issue is inside ReconcilePreset - I just don't have enough information to make a correct decision.

I agree, that decision needs to happen with full context in ReconcileAll 👍

…-limited-presets

ssncferreira

LGTM 👍

fix: fix metric for hard-limited presets

690d8e8

github-actions bot assigned evgeniy-scherbina May 26, 2025

ssncferreira reviewed May 27, 2025

View reviewed changes

evgeniy-scherbina marked this pull request as ready for review May 27, 2025 12:20

Merge remote-tracking branch 'origin/main' into 17988-metric-for-hard…

e50516d

…-limited-presets

ssncferreira approved these changes May 27, 2025

View reviewed changes

evgeniy-scherbina merged commit e8c75eb into main May 27, 2025
34 checks passed

evgeniy-scherbina deleted the 17988-metric-for-hard-limited-presets branch May 27, 2025 14:07

github-actions bot locked and limited conversation to collaborators May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: fix metric for hard-limited presets #18045

fix: fix metric for hard-limited presets #18045

Uh oh!

evgeniy-scherbina commented May 26, 2025 •

edited

Loading

Uh oh!

ssncferreira left a comment

Uh oh!

ssncferreira May 27, 2025

Uh oh!

evgeniy-scherbina May 27, 2025 •

edited

Loading

Uh oh!

ssncferreira May 27, 2025

Uh oh!

ssncferreira May 27, 2025

Uh oh!

evgeniy-scherbina May 27, 2025 •

edited

Loading

Uh oh!

ssncferreira May 27, 2025

Uh oh!

ssncferreira left a comment

Uh oh!

Uh oh!

Uh oh!

fix: fix metric for hard-limited presets #18045

fix: fix metric for hard-limited presets #18045

Uh oh!

Conversation

evgeniy-scherbina commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssncferreira left a comment

Choose a reason for hiding this comment

Uh oh!

ssncferreira May 27, 2025

Choose a reason for hiding this comment

Uh oh!

evgeniy-scherbina May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssncferreira May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ssncferreira May 27, 2025

Choose a reason for hiding this comment

Uh oh!

evgeniy-scherbina May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssncferreira May 27, 2025

Choose a reason for hiding this comment

Uh oh!

ssncferreira left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

evgeniy-scherbina commented May 26, 2025 •

edited

Loading

evgeniy-scherbina May 27, 2025 •

edited

Loading

evgeniy-scherbina May 27, 2025 •

edited

Loading