-
Notifications
You must be signed in to change notification settings - Fork 887
feat: cancel stuck pending jobs #17803
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This reverts commit 4b252eb.
Does this also solve #12331? |
coderd/database/dbauthz/dbauthz.go
Outdated
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceProvisionerJobs); err != nil { | ||
return nil, err | ||
} | ||
return q.db.GetProvisionerJobsByIDsWithQueuePosition(ctx, ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should really be scoped. If users are inside orgs, even if they are org-admins
they should not be able to read across org boundaries.
The organization boundaries have to be kept.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it's impossible to check this without extending the database that would link ProvisionerJobs to owners.
I'll do that in a separate PR.
@@ -503,7 +504,7 @@ func ReloadBuiltinRoles(opts *RoleOptions) { | |||
// the ability to create templates and provisioners has | |||
// a lot of overlap. | |||
ResourceProvisionerDaemon.Type: {policy.ActionCreate, policy.ActionRead, policy.ActionUpdate, policy.ActionDelete}, | |||
ResourceProvisionerJobs.Type: {policy.ActionRead}, | |||
ResourceProvisionerJobs.Type: {policy.ActionRead, policy.ActionUpdate, policy.ActionCreate}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't regular members need this permission too? To create the jobs for their workspaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They would, but I'm not sure I want to polute the RBAC with assigning members access to all provisioner jobs.
We currently don't check this at all. I'll introduce another PR that separetly focuses on solving this issue.
635218d
to
c03bfa3
Compare
coderd/database/dbauthz/dbauthz.go
Outdated
// TODO: Remove this once we have a proper rbac check for provisioner jobs. | ||
// Currently ProvisionerJobs are not associated with a user, so we can't | ||
// check for a user's permissions. We'd need to check for the associated workspace | ||
// and verify ownership through that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should probably mention all the ways a job could be made and what the permission check should look like. I think the following is correct:
template_version_import
jobs require eithertemplate_version:create
on the org (if creating a new template) ortemplate:update
andtemplate_version:create
on the template (if pushing a new version)template_version_dry_run
jobs require permissions toworkspace:create
in the org as well astemplate_version:read
on the specific template versionworkspace_build
jobs requireworkspace:update
on the specific workspace as well astemplate_version:read
on the specific template version
I just don't want the comment to only say that workspace ownership needs to be checked, and for some poor soul to start working on this thinking it'll be simple when in reality it will probably be difficult.
Likely should have a ticket created to track this too. You might want to add a comment here in the code with the ticket number after you create it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a simple description with link to the issue to not duplicate information.
Name: "Job Reaper Detect Interval", | ||
Description: "Interval to poll for hung and pending jobs and automatically terminate them.", | ||
Flag: "job-hang-detector-interval", | ||
Env: "CODER_JOB_HANG_DETECTOR_INTERVAL", | ||
Hidden: true, | ||
Default: time.Minute.String(), | ||
Value: &c.JobHangDetectorInterval, | ||
Value: &c.JobReaperDetectorInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up suggestion, non-blocking: there is now a disconnect between our internal naming ("job reaper") and the external flags ("hang detector"). We may wish to deprecate the old env in future to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As agreed - leaving as-is for now. The description & name reflect this, while changing Flag & Env will break existing setups.
|
||
// If the job was never started (pending), set the StartedAt time to the current | ||
// time so that the build duration is correct. | ||
if job.JobStatus == database.ProvisionerJobStatusPending { | ||
job.StartedAt = sql.NullTime{ | ||
Time: now, | ||
Valid: true, | ||
} | ||
} | ||
err = db.UpdateProvisionerJobWithCompleteWithStartedAtByID(ctx, database.UpdateProvisionerJobWithCompleteWithStartedAtByIDParams{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential follow-up: The fact that we need to worry about this implies to me that we need a separate manager for provisioner job transitions similar to the wsbuild
package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be, are there any more scenarios we'd like to validate and put in a new issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did up a table of all of the possible provisioner job states based on the four fields we use:
# | completed_at | error | canceled_at | started_at | status | comment |
---|---|---|---|---|---|---|
1 | NULL | NULL | NULL | NULL | pending | |
2 | NULL | NULL | NULL | timestamp | running | |
3 | NULL | NULL | timestamp | NULL | canceling | |
4 | NULL | NULL | timestamp | timestamp | canceling | |
5 | NULL | text | NULL | NULL | failed | |
6 | NULL | text | NULL | timestamp | failed | How can a job have failed if it never completed? |
7 | NULL | text | timestamp | NULL | failed | How can a job have failed it it never started? |
8 | NULL | text | timestamp | timestamp | failed | How can a job have failed if it never completed? |
9 | timestamp | NULL | NULL | NULL | succeeded | How can a job have succeeded if it never started? |
10 | timestamp | NULL | NULL | timestamp | succeeded | |
11 | timestamp | NULL | timestamp | NULL | canceled | |
12 | timestamp | NULL | timestamp | timestamp | canceled | |
13 | timestamp | text | NULL | NULL | failed | How can a job have failed if it never started? |
14 | timestamp | text | NULL | timestamp | failed | |
15 | timestamp | text | timestamp | NULL | failed | How can a job have failed if it never started? |
16 | timestamp | text | timestamp | timestamp | failed |
It's probably definitely out of scope of this PR, but I think there's a case to be made for drawing a state transition digram of provisioner jobs and figuring out which ones should never be possible. The ones I've commented on are my best guess at present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other nits, but I don't need to review again. Thanks for looking into this!
Closes: #16488