Skip to content

fix: fix metric for hard-limited presets #18045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 27, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 16 additions & 8 deletions enterprise/coderd/prebuilds/reconcile.go
Original file line number Diff line number Diff line change
Expand Up @@ -361,15 +361,23 @@ func (c *StoreReconciler) ReconcilePreset(ctx context.Context, ps prebuilds.Pres
slog.F("preset_name", ps.Preset.Name),
)

// Report a preset as hard-limited only if all the following conditions are met:
// - The preset is marked as hard-limited
// - The preset is using the active version of its template, and the template has not been deleted
// Report a metric only if the preset uses the latest version of the template and the template is not deleted.
// This avoids conflicts between metrics from old and new template versions.
//
// The second condition is important because a hard-limited preset that has become outdated is no longer relevant.
// Its associated prebuilt workspaces were likely deleted, and it's not meaningful to continue reporting it
// as hard-limited to the admin.
reportAsHardLimited := ps.IsHardLimited && ps.Preset.UsingActiveVersion && !ps.Preset.Deleted
c.metrics.trackHardLimitedStatus(ps.Preset.OrganizationName, ps.Preset.TemplateName, ps.Preset.Name, reportAsHardLimited)
// NOTE: Multiple versions of a preset can exist with the same orgName, templateName, and presetName,
// because templates can have multiple versions — or deleted templates can share the same name.
//
// The safest approach is to report the metric only for the latest version of the preset.
// When a new template version is released, the metric for the new preset should overwrite
// the old value in Prometheus.
//
// However, there’s one edge case: if an admin creates a template, it becomes hard-limited,
// then deletes the template and never creates another with the same name,
// the old preset will continue to be reported as hard-limited —
// even though it’s deleted. This will persist until `coderd` is restarted.
if ps.Preset.UsingActiveVersion && !ps.Preset.Deleted {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we added another label for the template version?
I don’t think it would increase the metric's cardinality, since we only have one active template version per preset at a time. When a preset switches to a new template version, the old metric time series would just become inactive and eventually be garbage collected by Prometheus.

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how all metrics are done: https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L35

I think from end-user point of view - it make sense. From end-user point of view there is only one version of preset - latest and not deleted. They don't really care about some outdated/deleted version of presets.

But I'm going to reimplement it significantly in a next PR.


Main issue here that I'm trying to do it on ReconcilePreset level, but I think right way to do it is on ReconcileAll level.
Gather information about all presets and then report accordingly.

I need to create hashMap:

// key = org-name-template-name-preset-name
map[key][]preset

Then for every key in the map - I have to report either hardLimited or not.
I will report hard limited only in case if []preset contains latest and not deleted preset and it's hard-limited, otherwise I just remove it from the map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L3 uses just the template name. I was referring to the template version ID, which is the only way to distinguish between multiple presets that use the same template.

So the full label set would be:
(preset_name, template_name, template_version_id, organization_name).

The downside of this approach is that Prometheus wouldn't know which version is the latest, so querying for the current status becomes harder, we would need extra logic to filter by the active version.
That trade-off might be worthwhile if we needed visibility into historical versions of a preset, but I don't think that's the case here.

So since we only care about the current status per preset, I agree that the current approach makes sense 👍

c.metrics.trackHardLimitedStatus(ps.Preset.OrganizationName, ps.Preset.TemplateName, ps.Preset.Name, ps.IsHardLimited)
}

// If the preset reached the hard failure limit for the first time during this iteration:
// - Mark it as hard-limited in the database
Expand Down
7 changes: 5 additions & 2 deletions enterprise/coderd/prebuilds/reconcile_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1034,15 +1034,18 @@ func TestHardLimitedPresetShouldNotBlockDeletion(t *testing.T) {
require.Equal(t, database.WorkspaceTransitionDelete, workspaceBuilds[0].Transition)
require.Equal(t, database.WorkspaceTransitionStart, workspaceBuilds[1].Transition)

// Metric is deleted after preset became outdated.
// The metric is still set to 1, even though the preset has become outdated.
// This happens because the old value hasn't been overwritten by a newer preset yet.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In trackHardLimitedStatus, when a preset is no longer hard limited, instead of deleting the entry immediately from the isPresetHardLimited map, we just set it to false.
Then, in the Collect function, we report the metric with value 0 one last time, and after that we delete the entry from the map, so it's no longer reported in future scrapes.
This ensures that while Prometheus still hasn’t cleaned up the time series, it holds the correct last value (0) rather than incorrectly keeping 1.

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the only way to make it work correctly in 100% use-cases is how I described it above.
The issue is from system point of view we have multiple presets:

  • presets from outdated template version
  • presets from deleted template version
  • actual latest and not deleted preset

But from user-point of view, only one preset exists:

  • actual latest and not deleted preset

Let's say I report a metric for preset from deleted template version inside ReconcilePreset. It's not clear what to report. I want to report zero or remove it, but it can interfere with reporting from actual latest and not deleted preset - because they have same set of labels.

Alternatively I can skip (do nothing) - but in this case there is a risk, that there is no corresponding actual latest and not deleted preset which means outdated value will live in coderd memory until it's restarted and correspondingly invalid value will be reported to prometheus.

Fundamental issue is inside ReconcilePreset - I just don't have enough information to make a correct decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, that decision needs to happen with full context in ReconcileAll 👍

mf, err = registry.Gather()
require.NoError(t, err)
metric = findMetric(mf, prebuilds.MetricPresetHardLimitedGauge, map[string]string{
"template_name": template.Name,
"preset_name": preset.Name,
"org_name": org.Name,
})
require.Nil(t, metric)
require.NotNil(t, metric)
require.NotNil(t, metric.GetGauge())
require.EqualValues(t, 1, metric.GetGauge().GetValue())
})
}
}
Expand Down
Loading