Skip to content

fix: reduce cost of prebuild failure #17697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
May 21, 2025

Conversation

evgeniy-scherbina
Copy link
Contributor

@evgeniy-scherbina evgeniy-scherbina commented May 6, 2025

Relates to #17432

Part 1:

Notes:

  • GetPresetsAtFailureLimit SQL query is added, which is similar to GetPresetsBackoff, they use same CTEs: filtered_builds, time_sorted_builds, but they are still different.

  • Query is executed on every loop iteration. We can consider marking specific preset as permanently failed as an optimization to avoid executing query on every loop iteration. But I decided don't do it for now.

  • By default FailureHardLimit is set to 3.

  • FailureHardLimit is configurable. Setting it to zero - means that hard limit is disabled.

Part 2

Notes:

  • PrebuildFailureLimitReached notification is added.
  • Notification is sent to template admins.
  • Notification is sent only the first time, when hard limit is reached. But it will log.Warn on every loop iteration.
  • I introduced this enum:
CREATE TYPE prebuild_status AS ENUM (
  'normal',           -- Prebuilds are working as expected; this is the default, healthy state.
  'hard_limited',     -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore.
  'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried.
);

validation_failed not used in this PR, but I think it will be used in next one, so I wanted to save us an extra migration.

  • Notification looks like this:
image

Latest notification views:

image image

@evgeniy-scherbina evgeniy-scherbina force-pushed the 17432-limit-prebuild-failure-cost branch from 1e9c551 to fcb3e5d Compare May 9, 2025 21:40
@evgeniy-scherbina evgeniy-scherbina marked this pull request as ready for review May 9, 2025 21:40
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query looks good, we need more visibility of this problem and its remediation, though.

func (q *querier) GetPresetsAtFailureLimit(ctx context.Context, hardLimit int64) ([]database.GetPresetsAtFailureLimitRow, error) {
// GetPresetsAtFailureLimit returns a list of template version presets that have reached the hard failure limit.
// Request the same authorization permissions as GetPresetsBackoff, since the methods are similar.
if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourceTemplate.All()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think "insights" is a weird auth context to use here.
Why did we choose this again @SasSwart?

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannykopping
Basically we have 3 actions:

  • Read Template.All()
  • Use Template.All()
  • ViewInsights Template.All()

Read is for reading the content of template
Use is for creating workspaces
ViewInsights is for gathering stats & metrics across templates

Looks like GetPresetsBackoff falls into ViewInsights category - because we gathering statistics across all template presets, calculating number of failed builds, etc...

I had a discussion with Steven regarding this, but I'm open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repurposing other RBAC roles just because they overlap logically feels like a bad design choice.
Each feature should have its own RBAC policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can introduce new resource type here: https://github.com/coder/coder/blob/main/coderd/rbac/object_gen.go

Then use it like this:

if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourcePrebuilds.All()); err != nil {}

Optionally introduce new action type as well, and use it like this:

if err := q.authorizeContext(ctx, policy.ActionCollectStatistics, rbac.ResourcePrebuilds.All()); err != nil {}

But I think we don't want to introduce too many action types.

cc @Emyrk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a follow-up, by the way 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have all the context for prebuilds. Can we throw a quick disc chat on the calendar?

@evgeniy-scherbina evgeniy-scherbina force-pushed the 17432-limit-prebuild-failure-cost branch from e88e301 to 0fdd096 Compare May 12, 2025 19:26
@evgeniy-scherbina evgeniy-scherbina changed the title fix: limit prebuild failure cost fix: reduce cost of prebuild failure May 14, 2025
@evgeniy-scherbina evgeniy-scherbina force-pushed the 17432-limit-prebuild-failure-cost branch from ab90612 to 5c60065 Compare May 15, 2025 18:37
Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, but I don't see any metric added which we previously discussed.
Given the size of the PR, it can be a follow-up.

Copy link
Contributor

@dannykopping dannykopping left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@evgeniy-scherbina evgeniy-scherbina merged commit 53e8e9c into main May 21, 2025
37 of 39 checks passed
@evgeniy-scherbina evgeniy-scherbina deleted the 17432-limit-prebuild-failure-cost branch May 21, 2025 19:16
@github-actions github-actions bot locked and limited conversation to collaborators May 21, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants