Skip to content

feat: reinitialize agents when a prebuilt workspace is claimed #17475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

SasSwart
Copy link
Contributor

This pull request allows coder workspace agents to be reinitialized when a prebuilt workspace is claimed by a user. This facilitates the transfer of ownership between the anonymous prebuilds system user and the new owner of the workspace.

Only a single agent per prebuilt workspace is supported for now, but plumbing has already been done to facilitate the seamless transition to multi-agent support.

@SasSwart SasSwart changed the title WIP: agent reinitialization feat: reinitialize agents when a prebuilt workspace is claimed Apr 21, 2025
@SasSwart SasSwart force-pushed the jjs/prebuilds-agent-reinit branch from 35e4bf8 to 18da76e Compare April 23, 2025 13:49
@evgeniy-scherbina evgeniy-scherbina force-pushed the yevhenii/512-claim-prebuild branch from fe569d4 to fcdbba8 Compare April 23, 2025 15:23
@SasSwart SasSwart force-pushed the jjs/prebuilds-agent-reinit branch from cc25406 to 26dbc3a Compare April 24, 2025 12:34
Base automatically changed from yevhenii/512-claim-prebuild to main April 24, 2025 13:39
@SasSwart SasSwart force-pushed the jjs/prebuilds-agent-reinit branch from ec9ed29 to 362db7c Compare April 25, 2025 08:32
Comment on lines 1836 to 1845
// Complete the job, optionally triggering workspace agent reinitialization:

completedJob := proto.CompletedJob{
JobId: job.ID.String(),
Type: &proto.CompletedJob_WorkspaceBuild_{
WorkspaceBuild: &proto.CompletedJob_WorkspaceBuild{},
},
}
_, err = srv.CompleteJob(ctx, &completedJob)
require.NoError(t, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate by defining 'this'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CompleteJob need to occur for the test to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah. CompleteJob is the thing that publishes to the channel. Its the entire point of this test, yes. If CompleteJob ever stops publishing, this should fail.

}

go func() {
<-ctx.Done()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howcome we wait for this to complete before cancel-ing? To protect messages from being passed to workspaceClaims after we close the channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we expect the user to defer cancel() where they call this. But we can't enforce that. As a defensive measure to protect against leaking the goroutine and pubsub connection, as well as the channel, we call cancel() ourselves when the context expires.

}

func StreamAgentReinitEvents(ctx context.Context, logger slog.Logger, rw http.ResponseWriter, r *http.Request, reinitEvents <-chan agentsdk.ReinitializationEvent) {
sseSendEvent, sseSenderClosed, err := httpapi.ServerSentEventSender(rw, r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can fix this later, but I think this might be inappropriate at this layer.
This seems like a detail of the HTTP API, and we aren't testing this directly so I'd argue this code could go back to coderd/workspaceagents.go.

Message string `json:"message"`
Reason ReinitializationReason `json:"reason"`
WorkspaceID uuid.UUID
UserID uuid.UUID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bleeds the abstraction a bit because there's nothing prebuilds-specific in this func. The workspace's agent is what reinitializes, and the user should be irrelevant here. Do we need this?

@SasSwart SasSwart marked this pull request as ready for review May 1, 2025 19:34
@@ -294,7 +299,7 @@ message Metadata {
string workspace_owner_login_type = 18;
repeated Role workspace_owner_rbac_roles = 19;
bool is_prebuild = 20;
string running_workspace_agent_token = 21;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self-review: it concerns me that running_workspace_agent_token made it into main. I'd love to undo that before this is released but it might already be too late.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

running_workspace_agent_token is defunct and meant to be replaced by running_agent_auth_tokens.
The former only supports a single agent. The latter supports multiple agents.
Nothing will break if the former stays, but it pollutes the API

Comment on lines +275 to +279
message RunningAgentAuthToken {
string agent_id = 1;
string token = 2;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When adding a new field, you need to increment proto.CurrentVersion. We tend to add a comment as well explaning the versioning history. See provisionerd/proto/version.go.

Comment on lines +276 to +286
tokens := metadata.GetRunningAgentAuthTokens()
if len(tokens) == 1 {
env = append(env, provider.RunningAgentTokenEnvironmentVariable("")+"="+tokens[0].Token)
} else {
// Not currently supported, but added for forward-compatibility
for _, t := range tokens {
// If there are multiple agents, provide all the tokens to terraform so that it can
// choose the correct one for each agent ID.
env = append(env, provider.RunningAgentTokenEnvironmentVariable(t.AgentId)+"="+t.Token)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't changes here also introduce tests in provisioner_test.go?

defer srv.Close()

requestCtx := testutil.Context(t, testutil.WaitShort)
req, err := http.NewRequestWithContext(requestCtx, "GET", srv.URL, nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use srv.Client() here?

// - prebuild claimed: a prebuilt workspace is claimed, so the agent must reinitialize.
func (c *Client) WaitForReinit(ctx context.Context) (*ReinitializationEvent, error) {
// TODO: allow configuring httpclient
c.SDK.HTTPClient.Timeout = time.Hour * 24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this essentially place an upper limit of 24 hours on the maximum lifetime of a prebuild?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being worked on currently; disregard for the moment.

@@ -16,8 +18,9 @@ VALUES (

-- name: InsertPresetParameters :many
INSERT INTO
template_version_preset_parameters (template_version_preset_id, name, value)
template_version_preset_parameters (id, template_version_preset_id, name, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it important to be able to control the ID of a template version preset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when setting up test environments where one needs to assert based on the relationships between presets, their parameters and other parts of the DB, we need to be able to express that relationship in the test by setting the ids explicitly.

//
// waiter := coderdtest.NewWorkspaceAgentWaiter(t, client, r.Workspace.ID)
// waiter.WaitFor(coderdtest.AgentsReady)
type WaitForCriterium func(agent codersdk.WorkspaceAgent) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, nonblocking: rename to WaitForAgentFn for clarity


select {
case <-ctx.Done():
logger.Warn(ctx, "agent shutting down", slog.Error(ctx.Err()), slog.Error(context.Cause(ctx)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warn seems alarmist; lots of legit reasons to shut down. Also, it's unexpected to include more than one call to slog.Error

// - prebuild claimed: a prebuilt workspace is claimed, so the agent must reinitialize.
func (c *Client) WaitForReinit(ctx context.Context) (*ReinitializationEvent, error) {
// TODO: allow configuring httpclient
c.SDK.HTTPClient.Timeout = time.Hour * 24
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a 24 hour timeout desirable? A prebuilt workspace could remain up without a reinit for much longer than this, no?

Also, this is a quiet side effect to the client: leaving it in a different state than it started in.

if err != nil {
return nil, xerrors.Errorf("unmarshal reinit response: %w", err)
}
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you've got the event; no reason to check the context for cancelation again, just return it.

if err != nil {
return nil, xerrors.Errorf("failed to read server-sent event: %w", err)
}
if sse.Type != codersdk.ServerSentEventTypeData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring Ping seems OK, but probably shouldn't ignore errors.

go func() {
for retrier := retry.New(100*time.Millisecond, 10*time.Second); retrier.Wait(ctx); {
logger.Debug(ctx, "waiting for agent reinitialization instructions")
reinitEvent, err := client.WaitForReinit(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have this SSE endpoint that can stream multiple events --- so why are we hanging up after the first event just to redial the endpoint?

var once sync.Once
cancel := func() {
once.Do(func() {
cancelSub()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked PGPubsub, and there isn't any guarantee that canceling the subscription will guarantee the callback is not in progress or about to be called once more. So, it's not safe to close the channel.

Fortunately for this particular application, the reader on the other side is not expecting the channel to be closed anyway.

default:
}

workspaceClaims := make(chan agentsdk.ReinitializationEvent, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any reason this channel should be buffered.

@@ -718,7 +727,7 @@ func createWorkspace(
Initiator(initiatorID).
ActiveVersion().
RichParameterValues(req.RichParameterValues).
TemplateVersionPresetID(req.TemplateVersionPresetID)
RunningAgentAuthTokens(agentTokensByAgentID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agentTokensByAgentID never seems to be populated.

// reused. Context: the agent token is often used in immutable attributes of workspace resource (e.g. VM/container)
// to initialize the agent, so if that value changes it will necessitate a replacement of that resource, thus
// obviating the whole point of the prebuild.
RunningAgentAuthTokens map[uuid.UUID]string `json:"running_agent_auth_tokens"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very nervous about putting secret credentials into the job input, which is returned when you query the job, making it much more likely to leak. I'd feel much better if provisionerdserver queried them from the database based on the agent IDs, rather than duplicating the creds in the database.

@@ -1733,6 +1744,21 @@ func (s *server) CompleteJob(ctx context.Context, completed *proto.CompletedJob)
if err != nil {
return nil, xerrors.Errorf("update workspace: %w", err)
}

if input.PrebuildClaimedByUser != uuid.Nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels narrow and fragile that we need to pass this imperative direction to provisionerdserver to send a signal specifically for prebuilds. It seems like a generally useful signal to know that a build completed for a given workspace... why not just always send it?

Even if we wanted to send the new owner ID in the message, there is no need to get it from the job input, we've already queried the workspace.

Signed-off-by: Danny Kopping <dannykopping@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants