-
Notifications
You must be signed in to change notification settings - Fork 883
provisionerd sends failed or complete last #2732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Spike Curtis <spike@coder.com>
@@ -7,6 +7,6 @@ func IsInitProcess() bool { | |||
return false | |||
} | |||
|
|||
func ForkReap(opt ...Option) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This trips up the linter on my box, so I fixed. Missed by CI because CI linting happens on linux.
@@ -130,6 +130,7 @@ func (api *API) provisionerJobLogs(rw http.ResponseWriter, r *http.Request, job | |||
for _, log := range logs { | |||
select { | |||
case bufferedLogs <- log: | |||
api.Logger.Debug(r.Context(), "subscribe buffered log", slog.F("job_id", job.ID), slog.F("stage", log.Stage)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this going to be noisy in our logs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Debug!
provisionerd/runner.go
Outdated
type messageSender interface { | ||
updateJob(ctx context.Context, in *proto.UpdateJobRequest) (*proto.UpdateJobResponse, error) | ||
failJob(ctx context.Context, in *proto.FailedJob) error | ||
completeJob(ctx context.Context, in *proto.CompletedJob) error | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just use the DRPCProvisionerDaemonServer
interface instead? Provisionerd seems to be wrapping those, which is fine, but that's just additional functionality and not required to fulfill the interface. This could reduce misdirection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That interface is larger, giving access to the Conn
and to the ability to acquire a new job. Don't want the runner doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The provisionerd.Server already has an acquireJob()
method and it doesn't have this signature, nor should it.
Also, DRPCProvisionerDaemonServer
has these as methods visible outside the package and we don't want that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure creating a single interface that's used in a single place improves clarity around that. I feel like a comment explaining why acquireJob
shouldn't be passed through would do just as well, because a consumer would just add it if they really wanted to call it anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, with the new structure of Runner in a separate package, we definitely need an interface here: *Runner can't use the *Server type because it would cause an import loop.
I'm pretty strongly against reusing the DRPCProvisionerDaemonServer
interface for this purpose because it requires implementing and exposing the AcquireJob()
method.
"Keep interfaces small" is generally considered a best practice.
More importantly in this context, though, is that we don't want the Runner acquiring jobs. That's not the contract: Server acquires the job, Runner updates/fails/completes it.
Moreover, the Server is presently designed with the assumption that there is only one acquired job at a time. If the Server exposes an external method to allow other things to acquire jobs, that assumption could easily be violated. We should not leave guns like that lying around for our future selves to get shot in the foot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions and minor stylistic comments, but in general this looks good to me! I'm deferring approval to someone more familiar with the previous logic, though.
Side-note: Being somewhat unfamiliar with provisionerd, it'd have been nice to have this as two PRs, one breaking out the functionality from one file, and another implementing the fix so it's clearer what's new and old code.
provisionerd/runner.go
Outdated
// are canceled but complete asynchronously (although they are prevented from further updating the job to the coder | ||
// server). The provided context sets how long to keep trying to send the FailJob. | ||
func (r *runner) fail(ctx context.Context, f *proto.FailedJob) error { | ||
f.JobId = r.job.JobId |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be protected by mutex? Also set in setFail
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r.job
is never changed, and so doesn't need to be protected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was referring to f.JobId
which is re-assigned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I just noticed that f won't actually be re-used between function invocations, so not a problem. Curious why this uses r.job.JobId
and the other r.job.GetJobId()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's on the passed in proto, and isn't protected by any mutex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this uses r.job.JobId and the other r.job.GetJobId()
that's just me being inconsistent for no reason.
provisionerd/runner.go
Outdated
case <-r.forceStopContext.Done(): | ||
return | ||
case <-r.gracefulContext.Done(): | ||
_ = stream.Send(&sdkproto.Provision_Request{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's supposed to happen here. Since we're closing the stream on function exit (above), we may do stream.Send
on a closed stream here. I'm guessing this should be OK, but a comment explaining it may be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I kinda mentally skipped over it because that's existing code, other than the names of the contexts.
But yeah, I think your analysis is right, the stream might be closed at the point this case fires, but sending the cancel is best effort anyway.
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a second pass on this and it looks good! I like the refactor into a separate package, turned out well. 👍🏻
(Approving, although I still think approval from someone more familiar with provisionerd
might be good.)
provisionerd/runner.go
Outdated
case <-r.forceStopContext.Done(): | ||
return | ||
case <-r.gracefulContext.Done(): | ||
_ = stream.Send(&sdkproto.Provision_Request{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm nervous about the lack of tests on the runner
package. What do you think about moving those over from provisionerd_test.go
?
CompleteJob(ctx context.Context, in *proto.CompletedJob) error | ||
} | ||
|
||
func NewRunner( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be called New
to eliminate redundancy in the naming.
@kylecarbs From what I looked, there's relatively high coverage via existing |
@mafredri I dislike us separating tests from the package at hand. We have a unique case with codersdk, but I think that should stay isolated. |
We can't just move the tests in It's a good idea, and I'm happy to do it, but I'm not convinced it's the best use of my (or anyone's, really) time right now. WDYT @kylecarbs ? |
Seems fine to me. I use VS Code's test gutters to check what is/isn't covered, and it works pretty well, so I'm sad this won't appear there. I think we make an issue for it, then this looks good to me! |
Issue is #2780 |
Substantially narrows the race window on #2603 but doesn't close it.
This PR refactors provisionerd to include a job runner that is designed to ensure that we get at most 1 FailedJob or CompletedJob message per job, and that we cease any updates for jobs once this is sent.
It uses unexported interfaces to distinguish between methods that can be called by things like the main provisionerd goroutines (things like canceling the current job), and methods internal to the runner. Similarly, we abstract the provisionerd server from the runner so that we ensure that sending messages about jobs happens with consistent code.