Skip to content

feat: reinitialize agents when a prebuilt workspace is claimed #17475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
c09c9b9
WIP: agent reinitialization
SasSwart Apr 21, 2025
476fe71
fix assignment to nil map
SasSwart Apr 21, 2025
8c8bca6
fix: ensure prebuilt workspace agent tokens are reused when a prebuil…
SasSwart Apr 23, 2025
7ce4eea
test agent reinitialization
SasSwart Apr 24, 2025
52ac64e
remove defunct metric
SasSwart Apr 24, 2025
362db7c
Remove todo
SasSwart Apr 25, 2025
dcc7379
test that we trigger workspace agent reinitialization under the right…
SasSwart Apr 28, 2025
ff66b3f
slight improvements to a test
SasSwart Apr 28, 2025
efff5d9
review notes to improve legibility
SasSwart Apr 28, 2025
cebd5db
add an integration test for prebuilt workspace agent reinitialization
SasSwart Apr 29, 2025
2679138
Merge remote-tracking branch 'origin/main' into jjs/prebuilds-agent-r…
SasSwart Apr 29, 2025
9feebef
enable the premium license in a prebuilds integration test
SasSwart Apr 29, 2025
b117b5c
encapsulate WaitForReinitLoop for easier testing
SasSwart Apr 30, 2025
a22b414
introduce unit testable abstraction layers
SasSwart Apr 30, 2025
9bbd2c7
test workspace claim pubsub
SasSwart May 1, 2025
5804201
add tests for agent reinitialization
SasSwart May 1, 2025
7e8dcee
review notes
SasSwart May 1, 2025
725f97b
Merge remote-tracking branch 'origin/main' into jjs/prebuilds-agent-r…
SasSwart May 1, 2025
a9b1567
make fmt lint
SasSwart May 1, 2025
21ee970
remove go mod replace
SasSwart May 1, 2025
e54d7e7
remove defunct logging
SasSwart May 1, 2025
2799858
update dependency on terraform-provider-coder
SasSwart May 2, 2025
1d93003
update dependency on terraform-provider-coder
SasSwart May 2, 2025
763fc12
go mod tidy
SasSwart May 2, 2025
0f879c7
make -B gen
SasSwart May 2, 2025
61784c9
dont require ids to InsertPresetParameters
SasSwart May 2, 2025
604eb27
dont require ids to InsertPresetParameters
SasSwart May 2, 2025
bf4d2cf
fix: set the running agent token
dannykopping May 2, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
review notes to improve legibility
  • Loading branch information
SasSwart committed Apr 28, 2025
commit efff5d9e7383653f079d2f682cf003171b005c89
4 changes: 2 additions & 2 deletions cli/agent.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,7 @@ func (r *RootCmd) workspaceAgent() *serpent.Command {
Handler: func(inv *serpent.Invocation) error {
ctx, cancel := context.WithCancelCause(inv.Context())
defer func() {
cancel(xerrors.New("defer"))
cancel(xerrors.New("agent exited"))
}()

var (
Expand Down Expand Up @@ -335,7 +335,7 @@ func (r *RootCmd) workspaceAgent() *serpent.Command {
// TODO: timeout ok?
reinitCtx, reinitCancel := context.WithTimeout(context.Background(), time.Hour*24)
defer reinitCancel()
reinitEvents := make(chan agentsdk.ReinitializationResponse)
reinitEvents := make(chan agentsdk.ReinitializationEvent)

go func() {
// Retry to wait for reinit, main context cancels the retrier.
Expand Down
6 changes: 3 additions & 3 deletions cli/agent_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -359,18 +359,18 @@ func TestAgent_Prebuild(t *testing.T) {

// Check that the agent is in a happy steady state
waiter := coderdtest.NewWorkspaceAgentWaiter(t, client, r.Workspace.ID)
waiter.WaitFor(coderdtest.AgentReady)
waiter.WaitFor(coderdtest.AgentsReady)

// Trigger reinitialization
channel := agentsdk.PrebuildClaimedChannel(r.Workspace.ID)
err := pubsub.Publish(channel, []byte(user.UserID.String()))
require.NoError(t, err)

// Check that the agent reinitializes
waiter.WaitFor(coderdtest.AgentNotReady)
waiter.WaitFor(coderdtest.AgentsNotReady)

// Check that reinitialization completed
waiter.WaitFor(coderdtest.AgentReady)
waiter.WaitFor(coderdtest.AgentsReady)
}

func matchAgentWithVersion(rs []codersdk.WorkspaceResource) bool {
Expand Down
16 changes: 13 additions & 3 deletions coderd/coderdtest/coderdtest.go
Original file line number Diff line number Diff line change
Expand Up @@ -1105,14 +1105,24 @@ func (w WorkspaceAgentWaiter) MatchResources(m func([]codersdk.WorkspaceResource
return w
}

// WaitForCriterium represents a boolean assertion to be made against each agent
// that a given WorkspaceAgentWaited knows about. Each WaitForCriterium should apply
// the check to a single agent, but it should be named for plural, because `func (w WorkspaceAgentWaiter) WaitFor`
// applies the check to all agents that it is aware of. This ensures that the public API of the waiter
// reads correctly. For example:
//
// waiter := coderdtest.NewWorkspaceAgentWaiter(t, client, r.Workspace.ID)
// waiter.WaitFor(coderdtest.AgentsReady)
type WaitForCriterium func(agent codersdk.WorkspaceAgent) bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion, nonblocking: rename to WaitForAgentFn for clarity


func AgentReady(agent codersdk.WorkspaceAgent) bool {
// AgentsReady checks that the latest lifecycle state of an agent is "Ready".
func AgentsReady(agent codersdk.WorkspaceAgent) bool {
return agent.LifecycleState == codersdk.WorkspaceAgentLifecycleReady
}

func AgentNotReady(agent codersdk.WorkspaceAgent) bool {
return !AgentReady(agent)
// AgentsReady checks that the latest lifecycle state of an agent is anything except "Ready".
func AgentsNotReady(agent codersdk.WorkspaceAgent) bool {
return !AgentsReady(agent)
}

func (w WorkspaceAgentWaiter) WaitFor(criteria ...WaitForCriterium) {
Expand Down
2 changes: 1 addition & 1 deletion coderd/provisionerdserver/provisionerdserver.go
Original file line number Diff line number Diff line change
Expand Up @@ -1752,7 +1752,7 @@ func (s *server) CompleteJob(ctx context.Context, completed *proto.CompletedJob)

channel := agentsdk.PrebuildClaimedChannel(workspace.ID)
if err := s.Pubsub.Publish(channel, []byte(input.PrebuildClaimedByUser.String())); err != nil {
s.Logger.Error(ctx, "failed to publish message to instruct prebuilt workspace agent reinitialization", slog.Error(err))
s.Logger.Error(ctx, "failed to trigger prebuilt workspace agent reinitialization", slog.Error(err))
}
}
case *proto.CompletedJob_TemplateDryRun_:
Expand Down
23 changes: 12 additions & 11 deletions coderd/provisionerdserver/provisionerdserver_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1771,20 +1771,21 @@ func TestCompleteJob(t *testing.T) {
t.Run(tc.name, func(t *testing.T) {
t.Parallel()

// Setup starting state:
// GIVEN an enqueued provisioner job and its dependencies:

userID := uuid.New()

srv, db, ps, pd := setup(t, false, &overrides{})

buildID := uuid.New()
scheduledJobInput := provisionerdserver.WorkspaceProvisionJob{
jobInput := provisionerdserver.WorkspaceProvisionJob{
WorkspaceBuildID: buildID,
}
if tc.shouldReinitializeAgent { // This is the key lever in the test
scheduledJobInput.PrebuildClaimedByUser = userID
// GIVEN the enqueued provisioner job is for a workspace being claimed by a user:
jobInput.PrebuildClaimedByUser = userID
}
input, err := json.Marshal(scheduledJobInput)
input, err := json.Marshal(jobInput)
require.NoError(t, err)

ctx := testutil.Context(t, testutil.WaitShort)
Expand Down Expand Up @@ -1821,19 +1822,17 @@ func TestCompleteJob(t *testing.T) {
})
require.NoError(t, err)

// Subscribe to workspace reinitialization:
// GIVEN something is listening to process workspace reinitialization:

// If the job needs to reinitialize agents for the workspace,
// check that the instruction to do so was enqueued
eventName := agentsdk.PrebuildClaimedChannel(workspace.ID)
gotChan := make(chan []byte, 1)
reinitChan := make(chan []byte, 1)
cancel, err := ps.Subscribe(eventName, func(inner context.Context, userIDMessage []byte) {
gotChan <- userIDMessage
reinitChan <- userIDMessage
})
require.NoError(t, err)
defer cancel()

// Complete the job, optionally triggering workspace agent reinitialization:
// WHEN the jop is completed

completedJob := proto.CompletedJob{
JobId: job.ID.String(),
Expand All @@ -1845,12 +1844,14 @@ func TestCompleteJob(t *testing.T) {
require.NoError(t, err)

select {
case userIDMessage := <-gotChan:
case userIDMessage := <-reinitChan:
// THEN workspace agent reinitialization instruction was received:
gotUserID, err := uuid.ParseBytes(userIDMessage)
require.NoError(t, err)
require.True(t, tc.shouldReinitializeAgent)
require.Equal(t, userID, gotUserID)
case <-ctx.Done():
// THEN workspace agent reinitialization instruction was not received.
require.False(t, tc.shouldReinitializeAgent)
}
})
Expand Down
3 changes: 1 addition & 2 deletions coderd/workspaceagents.go
Original file line number Diff line number Diff line change
Expand Up @@ -1233,15 +1233,14 @@ func (api *API) workspaceAgentReinit(rw http.ResponseWriter, r *http.Request) {
Type: codersdk.ServerSentEventTypePing,
})

// Expand with future use-cases for agent reinitialization.
for {
select {
case <-ctx.Done():
return
case user := <-prebuildClaims:
err = sseSendEvent(codersdk.ServerSentEvent{
Type: codersdk.ServerSentEventTypeData,
Data: agentsdk.ReinitializationResponse{
Data: agentsdk.ReinitializationEvent{
Message: fmt.Sprintf("prebuild claimed by user: %s", user),
Reason: agentsdk.ReinitializeReasonPrebuildClaimed,
},
Expand Down
4 changes: 0 additions & 4 deletions coderd/workspaces.go
Original file line number Diff line number Diff line change
Expand Up @@ -692,10 +692,6 @@ func createWorkspace(
if len(agents) > 1 {
return xerrors.Errorf("multiple agents are not yet supported in prebuilt workspaces")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels wrong to error here. While we don't support multiple agents, we shouldn't fail the call to createWorkspace, but rather prevent the ability to claim prebuilds. Right now I think this would be unrecoverable until the template is changed.

Can you please add a test for this scenario to validate this behaviour?

We could hard-fail this, though, on the template import side when an admin tries to use prebuilds with multiple agents (in a follow-up PR).

}
// agentTokensByAgentID = make(map[uuid.UUID]string, len(agents))
// for _, agent := range agents {
// agentTokensByAgentID[agent.ID] = agent.AuthToken.String()
// }
}

// We have to refetch the workspace for the joined in fields.
Expand Down
10 changes: 5 additions & 5 deletions codersdk/agentsdk/agentsdk.go
Original file line number Diff line number Diff line change
Expand Up @@ -694,7 +694,7 @@ const (
ReinitializeReasonPrebuildClaimed ReinitializationReason = "prebuild_claimed"
)

type ReinitializationResponse struct {
type ReinitializationEvent struct {
Message string `json:"message"`
Reason ReinitializationReason `json:"reason"`
}
Expand All @@ -707,7 +707,7 @@ func PrebuildClaimedChannel(id uuid.UUID) string {
// - ping: ignored, keepalive
// - prebuild claimed: a prebuilt workspace is claimed, so the agent must reinitialize.
// NOTE: the caller is responsible for closing the events chan.
func (c *Client) WaitForReinit(ctx context.Context, events chan<- ReinitializationResponse) error {
func (c *Client) WaitForReinit(ctx context.Context, events chan<- ReinitializationEvent) error {
// TODO: allow configuring httpclient
c.SDK.HTTPClient.Timeout = time.Hour * 24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this essentially place an upper limit of 24 hours on the maximum lifetime of a prebuild?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is being worked on currently; disregard for the moment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is a 24 hour timeout desirable? A prebuilt workspace could remain up without a reinit for much longer than this, no?

Also, this is a quiet side effect to the client: leaving it in a different state than it started in.


Expand Down Expand Up @@ -737,19 +737,19 @@ func (c *Client) WaitForReinit(ctx context.Context, events chan<- Reinitializati
if sse.Type != codersdk.ServerSentEventTypeData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring Ping seems OK, but probably shouldn't ignore errors.

continue
}
var reinitResp ReinitializationResponse
var reinitEvent ReinitializationEvent
b, ok := sse.Data.([]byte)
if !ok {
return xerrors.Errorf("expected data as []byte, got %T", sse.Data)
}
err = json.Unmarshal(b, &reinitResp)
err = json.Unmarshal(b, &reinitEvent)
if err != nil {
return xerrors.Errorf("unmarshal reinit response: %w", err)
}
select {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you've got the event; no reason to check the context for cancelation again, just return it.

case <-ctx.Done():
return ctx.Err()
case events <- reinitResp:
case events <- reinitEvent:
}
}
}
1 change: 1 addition & 0 deletions provisioner/terraform/provision.go
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,7 @@ func provisionEnv(
if len(tokens) == 1 {
env = append(env, provider.RunningAgentTokenEnvironmentVariable("")+"="+tokens[0].Token)
} else {
// Not currently supported, but added for forward-compatibility
for _, t := range tokens {
// If there are multiple agents, provide all the tokens to terraform so that it can
// choose the correct one for each agent ID.
Expand Down
Loading