Skip to content

fix: reduce cost of prebuild failure #17697

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 51 commits into from
May 21, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
fcb3e5d
fix: limit prebuild failure cost
evgeniy-scherbina May 6, 2025
701e2a8
refactor: CR's fixes
evgeniy-scherbina May 12, 2025
0fdd096
refactor: CR's fixes
evgeniy-scherbina May 12, 2025
13c0e28
refactor: CR's fixes
evgeniy-scherbina May 12, 2025
29e9cff
refactor: CR's fixes
evgeniy-scherbina May 12, 2025
1d51dfc
Merge remote-tracking branch 'origin/main' into 17432-limit-prebuild-…
evgeniy-scherbina May 14, 2025
5c60065
feat: add notification for prebuilds failure
evgeniy-scherbina May 14, 2025
b0248c3
test: fix tests
evgeniy-scherbina May 15, 2025
08b18b0
test: add a notification test
evgeniy-scherbina May 15, 2025
0fd347a
fix: add prebuild_status migration
evgeniy-scherbina May 16, 2025
08981a4
generate.sh
evgeniy-scherbina May 16, 2025
0e4600c
fix: using prebuild status
evgeniy-scherbina May 16, 2025
107879b
fix: migrations numbers
evgeniy-scherbina May 16, 2025
2dea5ab
make gen
evgeniy-scherbina May 16, 2025
e252093
make gen/golden-files
evgeniy-scherbina May 16, 2025
906ceb9
fix: fix dbauthz
evgeniy-scherbina May 16, 2025
3a0284f
Merge remote-tracking branch 'origin/main' into 17432-limit-prebuild-…
evgeniy-scherbina May 16, 2025
13fb1af
fix: minor fix
evgeniy-scherbina May 16, 2025
108ea0a
test: fix tests
evgeniy-scherbina May 19, 2025
7cd2799
test: fix dbmem
evgeniy-scherbina May 19, 2025
f9148e2
make gen/golden-files
evgeniy-scherbina May 19, 2025
315acfa
Remove outdated TODOs
evgeniy-scherbina May 19, 2025
1679ac9
fix: enhance notification template
evgeniy-scherbina May 19, 2025
d9c9d17
Merge remote-tracking branch 'origin/main' into 17432-limit-prebuild-…
evgeniy-scherbina May 19, 2025
c47ddaa
make gen/golden-files
evgeniy-scherbina May 19, 2025
c19f28a
refactor: minor improvements
evgeniy-scherbina May 19, 2025
b18730f
fix: minor fix in template
evgeniy-scherbina May 19, 2025
37173b0
fix: minor improvement to template
evgeniy-scherbina May 19, 2025
0e3cc40
fix: minor fixes
evgeniy-scherbina May 19, 2025
784adba
fix: fix TODO
evgeniy-scherbina May 19, 2025
50bd9b4
refactor: CR's fixes
evgeniy-scherbina May 20, 2025
8a33ac8
refactor: CR's fixes
evgeniy-scherbina May 20, 2025
9439fa1
refactor: use not null for prebuild_status
evgeniy-scherbina May 20, 2025
e667dad
refactor: fix dbauthz test
evgeniy-scherbina May 20, 2025
8a51702
refactor: use healthy enum option instead of normal
evgeniy-scherbina May 20, 2025
a793e18
refactor: minor refactoring in tests
evgeniy-scherbina May 20, 2025
a0fb69c
refactor: rename DB method
evgeniy-scherbina May 20, 2025
c6f209c
refactor: make fmt
evgeniy-scherbina May 20, 2025
9dccf3e
refactor: make gen + reorder methods
evgeniy-scherbina May 20, 2025
c19ae04
refactor: make fmt
evgeniy-scherbina May 20, 2025
4b145cc
refactor: CR's fixes
evgeniy-scherbina May 20, 2025
80f3677
refactor: make gen
evgeniy-scherbina May 20, 2025
2144d13
refactor: CR's fixes
evgeniy-scherbina May 20, 2025
79725b3
refactor: CR's fixes
evgeniy-scherbina May 20, 2025
7e8b4b6
refactor: improve comments for test
evgeniy-scherbina May 20, 2025
ab5acfb
refactor: CR's fixes
evgeniy-scherbina May 20, 2025
7c09465
refactor: CR's fixes
evgeniy-scherbina May 21, 2025
b7a34c5
refactor: improve notification template
evgeniy-scherbina May 21, 2025
e1e141d
refactor: CR's fixes
evgeniy-scherbina May 21, 2025
354aeb4
refactor: CR's fixes
evgeniy-scherbina May 21, 2025
7564178
Merge remote-tracking branch 'origin/main' into 17432-limit-prebuild-…
evgeniy-scherbina May 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions cli/testdata/server-config.yaml.golden
Original file line number Diff line number Diff line change
Expand Up @@ -704,3 +704,7 @@ workspace_prebuilds:
# backoff.
# (default: 1h0m0s, type: duration)
reconciliation_backoff_lookback_period: 1h0m0s
# Maximum number of consecutive failed prebuilds before a preset hits the hard
# limit; disabled when set to zero.
# (default: 3, type: int)
failure_hard_limit: 3
4 changes: 4 additions & 0 deletions coderd/apidoc/docs.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions coderd/apidoc/swagger.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

27 changes: 27 additions & 0 deletions coderd/database/dbauthz/dbauthz.go
Original file line number Diff line number Diff line change
Expand Up @@ -2226,6 +2226,15 @@ func (q *querier) GetPresetParametersByTemplateVersionID(ctx context.Context, ar
return q.db.GetPresetParametersByTemplateVersionID(ctx, args)
}

func (q *querier) GetPresetsAtFailureLimit(ctx context.Context, hardLimit int64) ([]database.GetPresetsAtFailureLimitRow, error) {
// GetPresetsAtFailureLimit returns a list of template version presets that have reached the hard failure limit.
// Request the same authorization permissions as GetPresetsBackoff, since the methods are similar.
if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourceTemplate.All()); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think "insights" is a weird auth context to use here.
Why did we choose this again @SasSwart?

Copy link
Contributor Author

@evgeniy-scherbina evgeniy-scherbina May 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dannykopping
Basically we have 3 actions:

  • Read Template.All()
  • Use Template.All()
  • ViewInsights Template.All()

Read is for reading the content of template
Use is for creating workspaces
ViewInsights is for gathering stats & metrics across templates

Looks like GetPresetsBackoff falls into ViewInsights category - because we gathering statistics across all template presets, calculating number of failed builds, etc...

I had a discussion with Steven regarding this, but I'm open to suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repurposing other RBAC roles just because they overlap logically feels like a bad design choice.
Each feature should have its own RBAC policy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can introduce new resource type here: https://github.com/coder/coder/blob/main/coderd/rbac/object_gen.go

Then use it like this:

if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourcePrebuilds.All()); err != nil {}

Optionally introduce new action type as well, and use it like this:

if err := q.authorizeContext(ctx, policy.ActionCollectStatistics, rbac.ResourcePrebuilds.All()); err != nil {}

But I think we don't want to introduce too many action types.

cc @Emyrk

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do this in a follow-up, by the way 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have all the context for prebuilds. Can we throw a quick disc chat on the calendar?

return nil, err
}
return q.db.GetPresetsAtFailureLimit(ctx, hardLimit)
}

func (q *querier) GetPresetsBackoff(ctx context.Context, lookback time.Time) ([]database.GetPresetsBackoffRow, error) {
// GetPresetsBackoff returns a list of template version presets along with metadata such as the number of failed prebuilds.
if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourceTemplate.All()); err != nil {
Expand Down Expand Up @@ -4201,6 +4210,24 @@ func (q *querier) UpdateOrganizationDeletedByID(ctx context.Context, arg databas
return deleteQ(q.log, q.auth, q.db.GetOrganizationByID, deleteF)(ctx, arg.ID)
}

func (q *querier) UpdatePresetPrebuildStatus(ctx context.Context, arg database.UpdatePresetPrebuildStatusParams) error {
preset, err := q.db.GetPresetByID(ctx, arg.PresetID)
if err != nil {
return err
}

object := rbac.ResourceTemplate.
WithID(preset.TemplateID.UUID).
InOrg(preset.OrganizationID)

err = q.authorizeContext(ctx, policy.ActionUpdate, object)
if err != nil {
return err
}

return q.db.UpdatePresetPrebuildStatus(ctx, arg)
}

func (q *querier) UpdateProvisionerDaemonLastSeenAt(ctx context.Context, arg database.UpdateProvisionerDaemonLastSeenAtParams) error {
if err := q.authorizeContext(ctx, policy.ActionUpdate, rbac.ResourceProvisionerDaemon); err != nil {
return err
Expand Down
31 changes: 31 additions & 0 deletions coderd/database/dbauthz/dbauthz_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -4924,6 +4924,11 @@ func (s *MethodTestSuite) TestPrebuilds() {
Asserts(rbac.ResourceWorkspace.All(), policy.ActionRead).
ErrorsWithInMemDB(dbmem.ErrUnimplemented)
}))
s.Run("GetPresetsAtFailureLimit", s.Subtest(func(_ database.Store, check *expects) {
check.Args(int64(0)).
Asserts(rbac.ResourceTemplate.All(), policy.ActionViewInsights).
ErrorsWithInMemDB(dbmem.ErrUnimplemented)
}))
s.Run("GetPresetsBackoff", s.Subtest(func(_ database.Store, check *expects) {
check.Args(time.Time{}).
Asserts(rbac.ResourceTemplate.All(), policy.ActionViewInsights).
Expand Down Expand Up @@ -4971,8 +4976,34 @@ func (s *MethodTestSuite) TestPrebuilds() {
},
InvalidateAfterSecs: preset.InvalidateAfterSecs,
OrganizationID: org.ID,
PrebuildStatus: database.PrebuildStatusHealthy,
})
}))
s.Run("UpdatePresetPrebuildStatus", s.Subtest(func(db database.Store, check *expects) {
org := dbgen.Organization(s.T(), db, database.Organization{})
user := dbgen.User(s.T(), db, database.User{})
template := dbgen.Template(s.T(), db, database.Template{
OrganizationID: org.ID,
CreatedBy: user.ID,
})
templateVersion := dbgen.TemplateVersion(s.T(), db, database.TemplateVersion{
TemplateID: uuid.NullUUID{
UUID: template.ID,
Valid: true,
},
OrganizationID: org.ID,
CreatedBy: user.ID,
})
preset := dbgen.Preset(s.T(), db, database.InsertPresetParams{
TemplateVersionID: templateVersion.ID,
})
req := database.UpdatePresetPrebuildStatusParams{
PresetID: preset.ID,
Status: database.PrebuildStatusHealthy,
}
check.Args(req).
Asserts(rbac.ResourceTemplate.WithID(template.ID).InOrg(org.ID), policy.ActionUpdate)
}))
}

func (s *MethodTestSuite) TestOAuth2ProviderApps() {
Expand Down
25 changes: 25 additions & 0 deletions coderd/database/dbmem/dbmem.go
Original file line number Diff line number Diff line change
Expand Up @@ -4287,6 +4287,7 @@ func (q *FakeQuerier) GetPresetByID(ctx context.Context, presetID uuid.UUID) (da
CreatedAt: preset.CreatedAt,
DesiredInstances: preset.DesiredInstances,
InvalidateAfterSecs: preset.InvalidateAfterSecs,
PrebuildStatus: preset.PrebuildStatus,
TemplateID: tv.TemplateID,
OrganizationID: tv.OrganizationID,
}, nil
Expand Down Expand Up @@ -4352,6 +4353,10 @@ func (q *FakeQuerier) GetPresetParametersByTemplateVersionID(_ context.Context,
return parameters, nil
}

func (q *FakeQuerier) GetPresetsAtFailureLimit(ctx context.Context, hardLimit int64) ([]database.GetPresetsAtFailureLimitRow, error) {
return nil, ErrUnimplemented
}

func (*FakeQuerier) GetPresetsBackoff(_ context.Context, _ time.Time) ([]database.GetPresetsBackoffRow, error) {
return nil, ErrUnimplemented
}
Expand Down Expand Up @@ -9089,6 +9094,7 @@ func (q *FakeQuerier) InsertPreset(_ context.Context, arg database.InsertPresetP
Int32: 0,
Valid: true,
},
PrebuildStatus: database.PrebuildStatusHealthy,
}
q.presets = append(q.presets, preset)
return preset, nil
Expand Down Expand Up @@ -10917,6 +10923,25 @@ func (q *FakeQuerier) UpdateOrganizationDeletedByID(_ context.Context, arg datab
return sql.ErrNoRows
}

func (q *FakeQuerier) UpdatePresetPrebuildStatus(ctx context.Context, arg database.UpdatePresetPrebuildStatusParams) error {
err := validateDatabaseType(arg)
if err != nil {
return err
}

q.mutex.RLock()
defer q.mutex.RUnlock()

for _, preset := range q.presets {
if preset.ID == arg.PresetID {
preset.PrebuildStatus = arg.Status
return nil
}
}

return xerrors.Errorf("preset %v does not exist", arg.PresetID)
}

func (q *FakeQuerier) UpdateProvisionerDaemonLastSeenAt(_ context.Context, arg database.UpdateProvisionerDaemonLastSeenAtParams) error {
err := validateDatabaseType(arg)
if err != nil {
Expand Down
14 changes: 14 additions & 0 deletions coderd/database/dbmetrics/querymetrics.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

29 changes: 29 additions & 0 deletions coderd/database/dbmock/dbmock.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

9 changes: 8 additions & 1 deletion coderd/database/dump.sql

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
DELETE FROM notification_templates WHERE id = '414d9331-c1fc-4761-b40c-d1f4702279eb';
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
INSERT INTO notification_templates
(id, name, title_template, body_template, "group", actions)
VALUES ('414d9331-c1fc-4761-b40c-d1f4702279eb',
'Prebuild Failure Limit Reached',
E'There is a problem creating prebuilt workspaces',
$$
The number of failed prebuild attempts has reached the hard limit for template **{{ .Labels.template }}** and preset **{{ .Labels.preset }}**.

To resume prebuilds, fix the underlying issue and upload a new template version.

Refer to the documentation for more details:
- [Troubleshooting templates](https://coder.com/docs/admin/templates/troubleshooting)
- [Troubleshooting of prebuilt workspaces](https://coder.com/docs/admin/templates/extending-templates/prebuilt-workspaces#administration-and-troubleshooting)
$$,
'Template Events',
'[
{
"label": "View failed prebuilt workspaces",
"url": "{{base_url}}/workspaces?filter=owner:prebuilds+status:failed+template:{{.Labels.template}}"
},
{
"label": "View template version",
"url": "{{base_url}}/templates/{{.Labels.org}}/{{.Labels.template}}/versions/{{.Labels.template_version}}"
}
]'::jsonb);
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
-- Remove the column from the table first (must happen before dropping the enum type)
ALTER TABLE template_version_presets DROP COLUMN prebuild_status;

-- Then drop the enum type
DROP TYPE prebuild_status;
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
CREATE TYPE prebuild_status AS ENUM (
'healthy', -- Prebuilds are working as expected; this is the default, healthy state.
'hard_limited', -- Prebuilds have failed repeatedly and hit the configured hard failure limit; won't be retried anymore.
'validation_failed' -- Prebuilds failed due to a non-retryable validation error (e.g. template misconfiguration); won't be retried.
);

ALTER TABLE template_version_presets ADD COLUMN prebuild_status prebuild_status NOT NULL DEFAULT 'healthy'::prebuild_status;
74 changes: 68 additions & 6 deletions coderd/database/models.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading