Skip to content

fix: fix metric for hard-limited presets #18045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 27, 2025

Conversation

evgeniy-scherbina
Copy link
Contributor

@evgeniy-scherbina evgeniy-scherbina commented May 26, 2025

// Report a metric only if the preset uses the latest version of the template and the template is not deleted.
// This avoids conflicts between metrics from old and new template versions.
//
// NOTE: Multiple versions of a preset can exist with the same orgName, templateName, and presetName,
// because templates can have multiple versions — or deleted templates can share the same name.
//
// The safest approach is to report the metric only for the latest version of the preset.
// When a new template version is released, the metric for the new preset should overwrite
// the old value in Prometheus.
//
// However, there’s one edge case: if an admin creates a template, it becomes hard-limited,
// then deletes the template and never creates another with the same name,
// the old preset will continue to be reported as hard-limited —
// even though it’s deleted. This will persist until `coderd` is restarted.

Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of suggestions regarding the metrics 🙂 But if this blocks the PR I am ok with addressing them in another PR

// then deletes the template and never creates another with the same name,
// the old preset will continue to be reported as hard-limited —
// even though it’s deleted. This will persist until `coderd` is restarted.
if ps.Preset.UsingActiveVersion && !ps.Preset.Deleted {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added another label for the template version?
I don’t think it would increase the metric's cardinality, since we only have one active template version per preset at a time. When a preset switches to a new template version, the old metric time series would just become inactive and eventually be garbage collected by Prometheus.

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how all metrics are done: https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L35

I think from end-user point of view - it make sense. From end-user point of view there is only one version of preset - latest and not deleted. They don't really care about some outdated/deleted version of presets.

But I'm going to reimplement it significantly in a next PR.


Main issue here that I'm trying to do it on ReconcilePreset level, but I think right way to do it is on ReconcileAll level.
Gather information about all presets and then report accordingly.

I need to create hashMap:

// key = org-name-template-name-preset-name
map[key][]preset

Then for every key in the map - I have to report either hardLimited or not.
I will report hard limited only in case if []preset contains latest and not deleted preset and it's hard-limited, otherwise I just remove it from the map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L3 uses just the template name. I was referring to the template version ID, which is the only way to distinguish between multiple presets that use the same template.

So the full label set would be:
(preset_name, template_name, template_version_id, organization_name).

The downside of this approach is that Prometheus wouldn't know which version is the latest, so querying for the current status becomes harder, we would need extra logic to filter by the active version.
That trade-off might be worthwhile if we needed visibility into historical versions of a preset, but I don't think that's the case here.

So since we only care about the current status per preset, I agree that the current approach makes sense 👍

@@ -1034,15 +1034,18 @@ func TestHardLimitedPresetShouldNotBlockDeletion(t *testing.T) {
require.Equal(t, database.WorkspaceTransitionDelete, workspaceBuilds[0].Transition)
require.Equal(t, database.WorkspaceTransitionStart, workspaceBuilds[1].Transition)

// Metric is deleted after preset became outdated.
// The metric is still set to 1, even though the preset has become outdated.
// This happens because the old value hasn't been overwritten by a newer preset yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In trackHardLimitedStatus, when a preset is no longer hard limited, instead of deleting the entry immediately from the isPresetHardLimited map, we just set it to false.
Then, in the Collect function, we report the metric with value 0 one last time, and after that we delete the entry from the map, so it's no longer reported in future scrapes.
This ensures that while Prometheus still hasn’t cleaned up the time series, it holds the correct last value (0) rather than incorrectly keeping 1.

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way to make it work correctly in 100% use-cases is how I described it above.
The issue is from system point of view we have multiple presets:

  • presets from outdated template version
  • presets from deleted template version
  • actual latest and not deleted preset

But from user-point of view, only one preset exists:

  • actual latest and not deleted preset

Let's say I report a metric for preset from deleted template version inside ReconcilePreset. It's not clear what to report. I want to report zero or remove it, but it can interfere with reporting from actual latest and not deleted preset - because they have same set of labels.

Alternatively I can skip (do nothing) - but in this case there is a risk, that there is no corresponding actual latest and not deleted preset which means outdated value will live in coderd memory until it's restarted and correspondingly invalid value will be reported to prometheus.

Fundamental issue is inside ReconcilePreset - I just don't have enough information to make a correct decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that decision needs to happen with full context in ReconcileAll 👍

@evgeniy-scherbina evgeniy-scherbina marked this pull request as ready for review May 27, 2025 12:20
Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@evgeniy-scherbina evgeniy-scherbina merged commit e8c75eb into main May 27, 2025
34 checks passed
@evgeniy-scherbina evgeniy-scherbina deleted the 17988-metric-for-hard-limited-presets branch May 27, 2025 14:07
@github-actions github-actions bot locked and limited conversation to collaborators May 27, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants