-
Notifications
You must be signed in to change notification settings - Fork 889
fix: reduce cost of prebuild failure #17697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ab27056
to
3ef469a
Compare
1e9c551
to
fcb3e5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query looks good, we need more visibility of this problem and its remediation, though.
func (q *querier) GetPresetsAtFailureLimit(ctx context.Context, hardLimit int64) ([]database.GetPresetsAtFailureLimitRow, error) { | ||
// GetPresetsAtFailureLimit returns a list of template version presets that have reached the hard failure limit. | ||
// Request the same authorization permissions as GetPresetsBackoff, since the methods are similar. | ||
if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourceTemplate.All()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think "insights" is a weird auth context to use here.
Why did we choose this again @SasSwart?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dannykopping
Basically we have 3 actions:
Read
Template.All()Use
Template.All()ViewInsights
Template.All()
Read
is for reading the content of template
Use
is for creating workspaces
ViewInsights
is for gathering stats & metrics across templates
Looks like GetPresetsBackoff
falls into ViewInsights
category - because we gathering statistics across all template presets, calculating number of failed builds, etc...
I had a discussion with Steven regarding this, but I'm open to suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repurposing other RBAC roles just because they overlap logically feels like a bad design choice.
Each feature should have its own RBAC policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can introduce new resource type
here: https://github.com/coder/coder/blob/main/coderd/rbac/object_gen.go
Then use it like this:
if err := q.authorizeContext(ctx, policy.ActionViewInsights, rbac.ResourcePrebuilds.All()); err != nil {}
Optionally introduce new action type
as well, and use it like this:
if err := q.authorizeContext(ctx, policy.ActionCollectStatistics, rbac.ResourcePrebuilds.All()); err != nil {}
But I think we don't want to introduce too many action types.
cc @Emyrk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do this in a follow-up, by the way 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have all the context for prebuilds. Can we throw a quick disc chat on the calendar?
e88e301
to
0fdd096
Compare
ab90612
to
5c60065
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good so far, but I don't see any metric added which we previously discussed.
Given the size of the PR, it can be a follow-up.
coderd/database/migrations/000328_prebuild_failure_limit_notification.up.sql
Outdated
Show resolved
Hide resolved
coderd/database/migrations/000328_prebuild_failure_limit_notification.up.sql
Outdated
Show resolved
Hide resolved
coderd/database/migrations/000328_prebuild_failure_limit_notification.up.sql
Show resolved
Hide resolved
coderd/database/migrations/000328_prebuild_failure_limit_notification.up.sql
Outdated
Show resolved
Hide resolved
coderd/database/migrations/000329_add_status_to_template_presets.up.sql
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
Relates to #17432
Part 1:
Notes:
GetPresetsAtFailureLimit
SQL query is added, which is similar toGetPresetsBackoff
, they use same CTEs:filtered_builds
,time_sorted_builds
, but they are still different.Query is executed on every loop iteration. We can consider marking specific preset as permanently failed as an optimization to avoid executing query on every loop iteration. But I decided don't do it for now.
By default
FailureHardLimit
is set to 3.FailureHardLimit
is configurable. Setting it to zero - means that hard limit is disabled.Part 2
Notes:
PrebuildFailureLimitReached
notification is added.log.Warn
on every loop iteration.validation_failed
not used in this PR, but I think it will be used in next one, so I wanted to save us an extra migration.Latest notification views: