-
Notifications
You must be signed in to change notification settings - Fork 896
fix: fix metric for hard-limited presets #18045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: fix metric for hard-limited presets #18045
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of suggestions regarding the metrics 🙂 But if this blocks the PR I am ok with addressing them in another PR
// then deletes the template and never creates another with the same name, | ||
// the old preset will continue to be reported as hard-limited — | ||
// even though it’s deleted. This will persist until `coderd` is restarted. | ||
if ps.Preset.UsingActiveVersion && !ps.Preset.Deleted { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we added another label for the template version?
I don’t think it would increase the metric's cardinality, since we only have one active template version per preset at a time. When a preset switches to a new template version, the old metric time series would just become inactive and eventually be garbage collected by Prometheus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how all metrics are done: https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L35
I think from end-user point of view - it make sense. From end-user point of view there is only one version of preset - latest and not deleted. They don't really care about some outdated/deleted version of presets.
But I'm going to reimplement it significantly in a next PR.
Main issue here that I'm trying to do it on ReconcilePreset level, but I think right way to do it is on ReconcileAll level.
Gather information about all presets and then report accordingly.
I need to create hashMap:
// key = org-name-template-name-preset-name
map[key][]preset
Then for every key in the map - I have to report either hardLimited or not.
I will report hard limited only in case if []preset
contains latest and not deleted preset and it's hard-limited, otherwise I just remove it from the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, https://github.com/coder/coder/blob/main/enterprise/coderd/prebuilds/metricscollector.go#L3 uses just the template name. I was referring to the template version ID, which is the only way to distinguish between multiple presets that use the same template.
So the full label set would be:
(preset_name
, template_name
, template_version_id
, organization_name
).
The downside of this approach is that Prometheus wouldn't know which version is the latest, so querying for the current status becomes harder, we would need extra logic to filter by the active version.
That trade-off might be worthwhile if we needed visibility into historical versions of a preset, but I don't think that's the case here.
So since we only care about the current status per preset, I agree that the current approach makes sense 👍
@@ -1034,15 +1034,18 @@ func TestHardLimitedPresetShouldNotBlockDeletion(t *testing.T) { | |||
require.Equal(t, database.WorkspaceTransitionDelete, workspaceBuilds[0].Transition) | |||
require.Equal(t, database.WorkspaceTransitionStart, workspaceBuilds[1].Transition) | |||
|
|||
// Metric is deleted after preset became outdated. | |||
// The metric is still set to 1, even though the preset has become outdated. | |||
// This happens because the old value hasn't been overwritten by a newer preset yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In trackHardLimitedStatus
, when a preset is no longer hard limited, instead of deleting the entry immediately from the isPresetHardLimited
map, we just set it to false
.
Then, in the Collect
function, we report the metric with value 0
one last time, and after that we delete the entry from the map, so it's no longer reported in future scrapes.
This ensures that while Prometheus still hasn’t cleaned up the time series, it holds the correct last value (0
) rather than incorrectly keeping 1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only way to make it work correctly in 100% use-cases is how I described it above.
The issue is from system point of view we have multiple presets:
- presets from outdated template version
- presets from deleted template version
- actual latest and not deleted preset
But from user-point of view, only one preset exists:
- actual latest and not deleted preset
Let's say I report a metric for preset from deleted template version
inside ReconcilePreset. It's not clear what to report. I want to report zero or remove it, but it can interfere with reporting from actual latest and not deleted preset
- because they have same set of labels.
Alternatively I can skip (do nothing) - but in this case there is a risk, that there is no corresponding actual latest and not deleted preset
which means outdated value will live in coderd memory until it's restarted and correspondingly invalid value will be reported to prometheus.
Fundamental issue is inside ReconcilePreset - I just don't have enough information to make a correct decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, that decision needs to happen with full context in ReconcileAll 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Uh oh!
There was an error while loading. Please reload this page.