Skip to content

feat: cancel stuck pending jobs #17803

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 20, 2025
Merged

Conversation

ibetitsmike
Copy link
Contributor

Closes: #16488

@ibetitsmike ibetitsmike marked this pull request as ready for review May 13, 2025 18:46
@matifali
Copy link
Member

Does this also solve #12331?

@deansheather
Copy link
Member

Does this also solve #12331?

@matifali Partially, but it shouldn't be closed until there's also a way for users to cancel these jobs. See #17791

Comment on lines 2319 to 2322
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceProvisionerJobs); err != nil {
return nil, err
}
return q.db.GetProvisionerJobsByIDsWithQueuePosition(ctx, ids)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should really be scoped. If users are inside orgs, even if they are org-admins they should not be able to read across org boundaries.

The organization boundaries have to be kept.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it's impossible to check this without extending the database that would link ProvisionerJobs to owners.

I'll do that in a separate PR.

@@ -503,7 +504,7 @@ func ReloadBuiltinRoles(opts *RoleOptions) {
// the ability to create templates and provisioners has
// a lot of overlap.
ResourceProvisionerDaemon.Type: {policy.ActionCreate, policy.ActionRead, policy.ActionUpdate, policy.ActionDelete},
ResourceProvisionerJobs.Type: {policy.ActionRead},
ResourceProvisionerJobs.Type: {policy.ActionRead, policy.ActionUpdate, policy.ActionCreate},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't regular members need this permission too? To create the jobs for their workspaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would, but I'm not sure I want to polute the RBAC with assigning members access to all provisioner jobs.

We currently don't check this at all. I'll introduce another PR that separetly focuses on solving this issue.

@ibetitsmike ibetitsmike force-pushed the mike/16488-cancel-stuck-pending-jobs branch from 635218d to c03bfa3 Compare May 19, 2025 07:57
// TODO: Remove this once we have a proper rbac check for provisioner jobs.
// Currently ProvisionerJobs are not associated with a user, so we can't
// check for a user's permissions. We'd need to check for the associated workspace
// and verify ownership through that.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should probably mention all the ways a job could be made and what the permission check should look like. I think the following is correct:

  • template_version_import jobs require either template_version:create on the org (if creating a new template) or template:update and template_version:create on the template (if pushing a new version)
  • template_version_dry_run jobs require permissions to workspace:create in the org as well as template_version:read on the specific template version
  • workspace_build jobs require workspace:update on the specific workspace as well as template_version:read on the specific template version

I just don't want the comment to only say that workspace ownership needs to be checked, and for some poor soul to start working on this thinking it'll be simple when in reality it will probably be difficult.

Likely should have a ticket created to track this too. You might want to add a comment here in the code with the ticket number after you create it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a simple description with link to the issue to not duplicate information.

Comment on lines +1290 to +1296
Name: "Job Reaper Detect Interval",
Description: "Interval to poll for hung and pending jobs and automatically terminate them.",
Flag: "job-hang-detector-interval",
Env: "CODER_JOB_HANG_DETECTOR_INTERVAL",
Hidden: true,
Default: time.Minute.String(),
Value: &c.JobHangDetectorInterval,
Value: &c.JobReaperDetectorInterval,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up suggestion, non-blocking: there is now a disconnect between our internal naming ("job reaper") and the external flags ("hang detector"). We may wish to deprecate the old env in future to avoid confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As agreed - leaving as-is for now. The description & name reflect this, while changing Flag & Env will break existing setups.

Comment on lines +311 to +320

// If the job was never started (pending), set the StartedAt time to the current
// time so that the build duration is correct.
if job.JobStatus == database.ProvisionerJobStatusPending {
job.StartedAt = sql.NullTime{
Time: now,
Valid: true,
}
}
err = db.UpdateProvisionerJobWithCompleteWithStartedAtByID(ctx, database.UpdateProvisionerJobWithCompleteWithStartedAtByIDParams{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potential follow-up: The fact that we need to worry about this implies to me that we need a separate manager for provisioner job transitions similar to the wsbuild package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be, are there any more scenarios we'd like to validate and put in a new issue?

Copy link
Member

@johnstcn johnstcn May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did up a table of all of the possible provisioner job states based on the four fields we use:

# completed_at error canceled_at started_at status comment
1 NULL NULL NULL NULL pending
2 NULL NULL NULL timestamp running
3 NULL NULL timestamp NULL canceling
4 NULL NULL timestamp timestamp canceling
5 NULL text NULL NULL failed
6 NULL text NULL timestamp failed How can a job have failed if it never completed?
7 NULL text timestamp NULL failed How can a job have failed it it never started?
8 NULL text timestamp timestamp failed How can a job have failed if it never completed?
9 timestamp NULL NULL NULL succeeded How can a job have succeeded if it never started?
10 timestamp NULL NULL timestamp succeeded
11 timestamp NULL timestamp NULL canceled
12 timestamp NULL timestamp timestamp canceled
13 timestamp text NULL NULL failed How can a job have failed if it never started?
14 timestamp text NULL timestamp failed
15 timestamp text timestamp NULL failed How can a job have failed if it never started?
16 timestamp text timestamp timestamp failed

It's probably definitely out of scope of this PR, but I think there's a case to be made for drawing a state transition digram of provisioner jobs and figuring out which ones should never be possible. The ones I've commented on are my best guess at present.

Copy link
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other nits, but I don't need to review again. Thanks for looking into this!

@ibetitsmike ibetitsmike merged commit 769c9ee into main May 20, 2025
35 of 37 checks passed
@ibetitsmike ibetitsmike deleted the mike/16488-cancel-stuck-pending-jobs branch May 20, 2025 13:22
@github-actions github-actions bot locked and limited conversation to collaborators May 20, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Permanently enqueued provisioner jobs affect the position in queue
5 participants