chore: refactor agent routines that use the v2 API #12223

spikecurtis · 2024-02-20T11:16:15Z

In anticipation of needing the LogSender to run on a context that doesn't get immediately canceled when you Close() the agent, I've undertaken a little refactor to manage the goroutines that get run against the Tailnet and Agent API connection.

This handles controlling two contexts, one that gets canceled right away at the start of graceful shutdown, and another that stays up to allow graceful shutdown to complete.

spikecurtis · 2024-02-20T11:16:29Z

feat: switch agent to use v2 API for sending logs #12068
chore: refactor agent routines that use the v2 API #12223 👈
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @spikecurtis and the rest of your teammates on Graphite

spikecurtis · 2024-02-20T11:16:35Z

main
- chore: refactor agent routines that use the v2 API #12223 👈
  - feat: switch agent to use v2 API for sending logs #12068

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @spikecurtis and the rest of your teammates on Graphite

mafredri

This is a monster undertaking and has been due for a long time, thanks for tackling it and nice work so-far!

agent/agent.go

mafredri · 2024-02-21T11:57:26Z

agent/agent.go

+			if err != nil {
+				return xerrors.Errorf("init script runner: %w", err)
+			}
+			err = a.trackGoroutine(func() {


I'm a bit worried about the complexity of what's going on here (muddy colored functions). We're inside a routine belonging to the api conn manager, but within it starting a routine that's to be tracked on a higher level.

This is definitely due for a refactor, but it's probably fine to keep it like this for now.

Maybe we should separate out all the script stuff into its own service/loop that performs the appropriate execution based on agent events.

Yeah, I think separating the script stuff out along with the logging into a neat subcomponent would be useful. Part of the problem is the "Manifest" is a grab bag of different things, so it's harder to make it well-encapsulated sub-components.

Not for this PR though. I want to finish off the v2 Agent API before we contemplate a v2.1!

mafredri · 2024-02-21T12:14:52Z

agent/agent.go

+		if network == nil {
+			// use the graceful context here, because creating the tailnet is not itself tied to the
+			// agent API.
+			network, err = a.createTailnet(a.gracefulCtx, manifest.AgentID, manifest.DERPMap, manifest.DERPForceWebSockets, manifest.DisableDirectConnections)


I feel like this context may be problematic during Close. If we close the agent before we've established a connection, then the connection can't establish and we have some connection dependent things in Close.

Why should we limit tailnet to anything except "hardCtx"?

This function is also pretty confusing since it takes ctx as an argument, but uses graceful here.

Yeah, this whole thing needs a bigger refactor like you suggested above. This routine doesn't directly use the drpc.Conn at all, so could arguably be outside the apiConnRoutineManager --- but it does rely on the manifest, and several other routines block on the network being available. Fixing that is more than I wanted to bite off.

In terms of the context, I don't really think it matters whether we use the gracefulCtx or the hardCtx. As soon as we cancel the gracefulCtx, then we send disconnect to the coordinator and clients will start disconnecting their tunnels.

If we close before we've ever established a tailnet, the I don't think anything in Close() can block on it. We'd never have any SSH sessions, and would never have connected to the Coordinator.

mafredri · 2024-02-21T12:22:36Z

agent/agent.go

 lifecycleWaitLoop:
 	for {
 		select {
-		case <-ctx.Done():
+		case <-a.hardCtx.Done():


An option to using hardCtx here would be to close a.lifeCycleReported once the loop exits. (As closing hardCtx would signal the loop to close.) This ensures we don't leave the routine behind.

I guess it would be fine to use the channel close to indicate a timeout, but I don't want to wait for the channel to close in the success case, since that's a chicken-and-egg problem (we wait until the final state is reported to close the hardCtx in the success case).

Since it only enforces that the goroutine exits in the failure case, I don't think it's a worthwhile refactor.

mafredri · 2024-02-21T12:27:04Z

agent/agent.go

 				break lifecycleWaitLoop
 			}
 		}
 	}

+	// Wait for graceful disconnect from the Coordinator RPC
+	select {
+	case <-a.hardCtx.Done():


While always checking the context is a good way to ensure we actually do exit, it creates a lot of paths that short-circuit execution, potentially leaving things behind. The state is also harder to reason about as there are multiple branches.

That is to say, I think it would be OK to skip hardCtx here? I guess the only danger is if coordination.Close() hangs indefinitely or is very slow.

This change might also suggest it'd be sensible to handle hardCtx in runCoodinator (instead of just return <-errCh).

The that arm of the select is mostly about logging when we hit a timeout.

I take your point from earlier discussions that relying solely on context to tear down goroutines can cause us to exit before finalizer actions have run. Here we are explicitly checking the status of a finalizer.

We're definitely not doing a complete job of tracking goroutines in the agent, but I'm not convinced that should be a goal unto itself. Tracking and waiting for errant goroutines doesn't actually prevent leaks. If there is a leak, in testing, it causes the test to hang, whereas goleak fails much faster. In production, it will cause shutdown to hang so the cluster manager has to force-stop the agent.

Fair 👍🏻

mafredri

Looking good, thanks for making the proposed changes!

mafredri · 2024-02-22T11:48:29Z

agent/agent.go

+	// Wait for graceful disconnect from the Coordinator RPC
+	select {
+	case <-a.hardCtx.Done():
+		a.logger.Warn(context.Background(), "timed out waiting for Coordinator RPC disconnect")


Suggested change

a.logger.Warn(context.Background(), "timed out waiting for Coordinator RPC disconnect")

a.logger.Warn(context.Background(), "timed out waiting for coordinator RPC disconnect")

spikecurtis mentioned this pull request Feb 20, 2024

feat: switch agent to use v2 API for sending logs #12068

Merged

github-actions bot assigned spikecurtis Feb 20, 2024

spikecurtis requested review from coadler and mafredri February 20, 2024 11:19

spikecurtis marked this pull request as ready for review February 20, 2024 11:19

spikecurtis force-pushed the spike/10534-agent-api-routines branch 4 times, most recently from 7ecf10e to 2f05031 Compare February 20, 2024 16:19

mafredri reviewed Feb 21, 2024

View reviewed changes

spikecurtis force-pushed the spike/10534-agent-api-routines branch 2 times, most recently from 5cf1bb4 to f9ca6b2 Compare February 22, 2024 06:14

chore: refactor agent routines that use the v2 API

d9b360c

spikecurtis force-pushed the spike/10534-agent-api-routines branch from f9ca6b2 to d9b360c Compare February 22, 2024 08:15

mafredri approved these changes Feb 22, 2024

View reviewed changes

spikecurtis merged commit af3fdc6 into main Feb 23, 2024

spikecurtis deleted the spike/10534-agent-api-routines branch February 23, 2024 07:04

github-actions bot locked and limited conversation to collaborators Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: refactor agent routines that use the v2 API #12223

chore: refactor agent routines that use the v2 API #12223

spikecurtis commented Feb 20, 2024 •

edited

Loading

spikecurtis commented Feb 20, 2024 •

edited

Loading

spikecurtis commented Feb 20, 2024

mafredri left a comment

mafredri Feb 21, 2024

spikecurtis Feb 22, 2024

mafredri Feb 21, 2024

spikecurtis Feb 22, 2024

mafredri Feb 21, 2024

spikecurtis Feb 22, 2024

mafredri Feb 21, 2024

spikecurtis Feb 22, 2024

mafredri Feb 22, 2024

mafredri left a comment

mafredri Feb 22, 2024

	a.logger.Warn(context.Background(), "timed out waiting for Coordinator RPC disconnect")
	a.logger.Warn(context.Background(), "timed out waiting for coordinator RPC disconnect")

chore: refactor agent routines that use the v2 API #12223

chore: refactor agent routines that use the v2 API #12223

Conversation

spikecurtis commented Feb 20, 2024 • edited Loading

spikecurtis commented Feb 20, 2024 • edited Loading

spikecurtis commented Feb 20, 2024

mafredri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mafredri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spikecurtis commented Feb 20, 2024 •

edited

Loading

spikecurtis commented Feb 20, 2024 •

edited

Loading