-
Notifications
You must be signed in to change notification settings - Fork 874
feat: reinitialize agents when a prebuilt workspace is claimed #17475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
35e4bf8
to
18da76e
Compare
fe569d4
to
fcdbba8
Compare
cc25406
to
26dbc3a
Compare
ec9ed29
to
362db7c
Compare
// Complete the job, optionally triggering workspace agent reinitialization: | ||
|
||
completedJob := proto.CompletedJob{ | ||
JobId: job.ID.String(), | ||
Type: &proto.CompletedJob_WorkspaceBuild_{ | ||
WorkspaceBuild: &proto.CompletedJob_WorkspaceBuild{}, | ||
}, | ||
} | ||
_, err = srv.CompleteJob(ctx, &completedJob) | ||
require.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate by defining 'this'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does CompleteJob
need to occur for the test to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah. CompleteJob is the thing that publishes to the channel. Its the entire point of this test, yes. If CompleteJob ever stops publishing, this should fail.
} | ||
|
||
go func() { | ||
<-ctx.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Howcome we wait for this to complete before cancel
-ing? To protect messages from being passed to workspaceClaims
after we close the channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we expect the user to defer cancel()
where they call this. But we can't enforce that. As a defensive measure to protect against leaking the goroutine and pubsub connection, as well as the channel, we call cancel() ourselves when the context expires.
coderd/prebuilds/claim.go
Outdated
} | ||
|
||
func StreamAgentReinitEvents(ctx context.Context, logger slog.Logger, rw http.ResponseWriter, r *http.Request, reinitEvents <-chan agentsdk.ReinitializationEvent) { | ||
sseSendEvent, sseSenderClosed, err := httpapi.ServerSentEventSender(rw, r) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can fix this later, but I think this might be inappropriate at this layer.
This seems like a detail of the HTTP API, and we aren't testing this directly so I'd argue this code could go back to coderd/workspaceagents.go
.
Message string `json:"message"` | ||
Reason ReinitializationReason `json:"reason"` | ||
WorkspaceID uuid.UUID | ||
UserID uuid.UUID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bleeds the abstraction a bit because there's nothing prebuilds-specific in this func. The workspace's agent is what reinitializes, and the user should be irrelevant here. Do we need this?
@@ -294,7 +299,7 @@ message Metadata { | |||
string workspace_owner_login_type = 18; | |||
repeated Role workspace_owner_rbac_roles = 19; | |||
bool is_prebuild = 20; | |||
string running_workspace_agent_token = 21; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self-review: it concerns me that running_workspace_agent_token
made it into main. I'd love to undo that before this is released but it might already be too late.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running_workspace_agent_token is defunct and meant to be replaced by running_agent_auth_tokens.
The former only supports a single agent. The latter supports multiple agents.
Nothing will break if the former stays, but it pollutes the API
message RunningAgentAuthToken { | ||
string agent_id = 1; | ||
string token = 2; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When adding a new field, you need to increment proto.CurrentVersion
. We tend to add a comment as well explaning the versioning history. See provisionerd/proto/version.go
.
tokens := metadata.GetRunningAgentAuthTokens() | ||
if len(tokens) == 1 { | ||
env = append(env, provider.RunningAgentTokenEnvironmentVariable("")+"="+tokens[0].Token) | ||
} else { | ||
// Not currently supported, but added for forward-compatibility | ||
for _, t := range tokens { | ||
// If there are multiple agents, provide all the tokens to terraform so that it can | ||
// choose the correct one for each agent ID. | ||
env = append(env, provider.RunningAgentTokenEnvironmentVariable(t.AgentId)+"="+t.Token) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't changes here also introduce tests in provisioner_test.go
?
defer srv.Close() | ||
|
||
requestCtx := testutil.Context(t, testutil.WaitShort) | ||
req, err := http.NewRequestWithContext(requestCtx, "GET", srv.URL, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use srv.Client()
here?
// - prebuild claimed: a prebuilt workspace is claimed, so the agent must reinitialize. | ||
func (c *Client) WaitForReinit(ctx context.Context) (*ReinitializationEvent, error) { | ||
// TODO: allow configuring httpclient | ||
c.SDK.HTTPClient.Timeout = time.Hour * 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this essentially place an upper limit of 24 hours on the maximum lifetime of a prebuild?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being worked on currently; disregard for the moment.
coderd/database/queries/presets.sql
Outdated
@@ -16,8 +18,9 @@ VALUES ( | |||
|
|||
-- name: InsertPresetParameters :many | |||
INSERT INTO | |||
template_version_preset_parameters (template_version_preset_id, name, value) | |||
template_version_preset_parameters (id, template_version_preset_id, name, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it important to be able to control the ID of a template version preset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when setting up test environments where one needs to assert based on the relationships between presets, their parameters and other parts of the DB, we need to be able to express that relationship in the test by setting the ids explicitly.
// | ||
// waiter := coderdtest.NewWorkspaceAgentWaiter(t, client, r.Workspace.ID) | ||
// waiter.WaitFor(coderdtest.AgentsReady) | ||
type WaitForCriterium func(agent codersdk.WorkspaceAgent) bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion, nonblocking: rename to WaitForAgentFn
for clarity
|
||
select { | ||
case <-ctx.Done(): | ||
logger.Warn(ctx, "agent shutting down", slog.Error(ctx.Err()), slog.Error(context.Cause(ctx))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warn seems alarmist; lots of legit reasons to shut down. Also, it's unexpected to include more than one call to slog.Error
// - prebuild claimed: a prebuilt workspace is claimed, so the agent must reinitialize. | ||
func (c *Client) WaitForReinit(ctx context.Context) (*ReinitializationEvent, error) { | ||
// TODO: allow configuring httpclient | ||
c.SDK.HTTPClient.Timeout = time.Hour * 24 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is a 24 hour timeout desirable? A prebuilt workspace could remain up without a reinit for much longer than this, no?
Also, this is a quiet side effect to the client: leaving it in a different state than it started in.
if err != nil { | ||
return nil, xerrors.Errorf("unmarshal reinit response: %w", err) | ||
} | ||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point you've got the event; no reason to check the context for cancelation again, just return it.
if err != nil { | ||
return nil, xerrors.Errorf("failed to read server-sent event: %w", err) | ||
} | ||
if sse.Type != codersdk.ServerSentEventTypeData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring Ping seems OK, but probably shouldn't ignore errors.
go func() { | ||
for retrier := retry.New(100*time.Millisecond, 10*time.Second); retrier.Wait(ctx); { | ||
logger.Debug(ctx, "waiting for agent reinitialization instructions") | ||
reinitEvent, err := client.WaitForReinit(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have this SSE endpoint that can stream multiple events --- so why are we hanging up after the first event just to redial the endpoint?
var once sync.Once | ||
cancel := func() { | ||
once.Do(func() { | ||
cancelSub() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked PGPubsub, and there isn't any guarantee that canceling the subscription will guarantee the callback is not in progress or about to be called once more. So, it's not safe to close the channel.
Fortunately for this particular application, the reader on the other side is not expecting the channel to be closed anyway.
default: | ||
} | ||
|
||
workspaceClaims := make(chan agentsdk.ReinitializationEvent, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't see any reason this channel should be buffered.
@@ -718,7 +727,7 @@ func createWorkspace( | |||
Initiator(initiatorID). | |||
ActiveVersion(). | |||
RichParameterValues(req.RichParameterValues). | |||
TemplateVersionPresetID(req.TemplateVersionPresetID) | |||
RunningAgentAuthTokens(agentTokensByAgentID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agentTokensByAgentID
never seems to be populated.
// reused. Context: the agent token is often used in immutable attributes of workspace resource (e.g. VM/container) | ||
// to initialize the agent, so if that value changes it will necessitate a replacement of that resource, thus | ||
// obviating the whole point of the prebuild. | ||
RunningAgentAuthTokens map[uuid.UUID]string `json:"running_agent_auth_tokens"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very nervous about putting secret credentials into the job input, which is returned when you query the job, making it much more likely to leak. I'd feel much better if provisionerdserver queried them from the database based on the agent IDs, rather than duplicating the creds in the database.
@@ -1733,6 +1744,21 @@ func (s *server) CompleteJob(ctx context.Context, completed *proto.CompletedJob) | |||
if err != nil { | |||
return nil, xerrors.Errorf("update workspace: %w", err) | |||
} | |||
|
|||
if input.PrebuildClaimedByUser != uuid.Nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels narrow and fragile that we need to pass this imperative direction to provisionerdserver to send a signal specifically for prebuilds. It seems like a generally useful signal to know that a build completed for a given workspace... why not just always send it?
Even if we wanted to send the new owner ID in the message, there is no need to get it from the job input, we've already queried the workspace.
Signed-off-by: Danny Kopping <dannykopping@gmail.com>
This pull request allows coder workspace agents to be reinitialized when a prebuilt workspace is claimed by a user. This facilitates the transfer of ownership between the anonymous prebuilds system user and the new owner of the workspace.
Only a single agent per prebuilt workspace is supported for now, but plumbing has already been done to facilitate the seamless transition to multi-agent support.