feat: add agent acks to in-memory coordinator #12786

coadler · 2024-03-27T22:46:51Z

When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.

coadler · 2024-03-27T22:47:04Z

feat(enterprise): add ready for handshake support to pgcoord #12935
feat: add agent acks to in-memory coordinator #12786 👈
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @coadler and the rest of your teammates on Graphite

tailnet/proto/tailnet.proto

tailnet/coordinator.go

spikecurtis · 2024-03-28T07:31:37Z

tailnet/coordinator.go

+				// potentially be smarter to only send an ACK once per client,
+				// but there's nothing currently stopping clients from reusing
+				// IDs.
+				for _, peer := range resp.GetPeerUpdates() {


I worry this is too superficial --- here we are only acknowledging the fact that we received a peer update, not that it was programmed into wireguard, which is what is actually needed for the handshake to complete.

I guess this is an OK start in that it cuts out any propagation delay from the Coordinator out of the race condition, but still leaves the race there. I realize that you have yet to add support to the PGCoordinator, which is where we suspect the real problems are, so we will need to test this out and confirm that missed handshakes are substantially reduced. We can embed the ack deeper into tailnet in some later PR if we are still missing handshakes.

Yeah, completely eliminating the race would require digging down into the configmaps which I wasn't keen to do unless necessary. In my testing with the in-memory coordinator I wasn't able to hit the 5s backoff anymore. I suspect pgcoord to actually fare better considering the extra round trip latency as compared to the in-memory coordinator.

tailnet/coordinator.go

codersdk/workspacesdk/workspacesdk.go

tailnet/configmaps.go

spikecurtis · 2024-04-03T06:35:17Z

tailnet/configmaps.go

@@ -387,33 +408,78 @@ func (c *configMaps) updatePeerLocked(update *proto.CoordinateResponse_PeerUpdat
 		// SSH connections don't send packets while idle, so we use keep alives
 		// to avoid random hangs while we set up the connection again after
 		// inactivity.
-		node.KeepAlive = ok && peerStatus.Active
+		node.KeepAlive = (statusOk && peerStatus.Active) || (peerOk && lc.node != nil && lc.node.KeepAlive)


Do we still need this status-based keep-alive business? My understanding was that was a bit of a hack to avoid building what you are now building. That is, we didn't want to turn on keep-alives at the source here because that would cause it to start the handshake too early. But now, at the source we can wait until the destination sends us READY_FOR_HANDSHAKE.

IIRC, we never really needed keep-alive turned on at the destination, but until now the Conn was unaware of the distinction.

Putting this calculation here made sense when it was only based on the status but now there is logic in some of the case statements below that set the KeepAlive.

I think we need a single function that computes the KeepAlive value that we can call after all the changes are made, and consistently use it to compute the value.

I'd like to leave this be, mostly since I don't want to change the behavior of how tunnel destinations handle keep alives for now. Once the RFH stuff is all in and working properly, I'll revisit.

Yeah, I guess we should leave the status-based keep alive processing until everything is working including PGCoordinator.

tailnet/configmaps.go

tailnet/conn.go

codersdk/workspacesdk/connector.go

spikecurtis

Some stylistic suggestions inline, but I don't need to review again. Looks good overall!

tailnet/configmaps.go

When an agent receives a node, it responds with an ACK which is relayed to the client. After the client receives the ACK, it's allowed to begin pinging.

github-actions bot assigned coadler Mar 27, 2024

coadler force-pushed the colin/feat_add_agent_acks_to_in-memory_coordinator branch 2 times, most recently from 1f1e8f4 to c066f24 Compare March 27, 2024 22:51

coadler marked this pull request as ready for review March 27, 2024 22:55

coadler requested a review from spikecurtis March 27, 2024 22:57

spikecurtis reviewed Mar 28, 2024

View reviewed changes

coadler requested a review from spikecurtis March 29, 2024 02:05

spikecurtis reviewed Apr 1, 2024

View reviewed changes

tailnet/configmaps.go Outdated Show resolved Hide resolved

tailnet/configmaps.go Outdated Show resolved Hide resolved

tailnet/configmaps.go Outdated Show resolved Hide resolved

tailnet/configmaps.go Outdated Show resolved Hide resolved

tailnet/configmaps.go Outdated Show resolved Hide resolved

coadler requested a review from spikecurtis April 3, 2024 05:27

spikecurtis reviewed Apr 3, 2024

View reviewed changes

coadler requested a review from spikecurtis April 3, 2024 23:19

spikecurtis approved these changes Apr 4, 2024

View reviewed changes

tailnet/configmaps.go Outdated Show resolved Hide resolved

tailnet/configmaps.go Outdated Show resolved Hide resolved

tailnet/configmaps.go Outdated Show resolved Hide resolved

coadler added 14 commits April 10, 2024 21:59

feat: add agent acks to in-memory coordinator

1c287b2

When an agent receives a node, it responds with an ACK which is relayed to the client. After the client receives the ACK, it's allowed to begin pinging.

spike comments

4b14af0

fixup! spike comments

7e2d1bb

additional tests

3557c11

prevent clients from sending ready_for_handshake

1347370

rename

2896d6b

comments

214d380

fixup! comments

4aaa0e2

add auth to in-memory coordinator

5660e03

fixup! add auth to in-memory coordinator

298655b

comments

88416d2

make gen

bfda37e

don't kill agent connection on invalid RFH

2ee3c51

fix no permission test

cae734d

coadler force-pushed the colin/feat_add_agent_acks_to_in-memory_coordinator branch from dfdef50 to cae734d Compare April 10, 2024 22:00

coadler mentioned this pull request Apr 10, 2024

feat(enterprise): add ready for handshake support to pgcoord #12935

Merged

coadler merged commit e801e87 into main Apr 10, 2024

coadler deleted the colin/feat_add_agent_acks_to_in-memory_coordinator branch April 10, 2024 22:15

github-actions bot locked and limited conversation to collaborators Apr 10, 2024

feat: add agent acks to in-memory coordinator #12786

feat: add agent acks to in-memory coordinator #12786

Uh oh!

Conversation

coadler commented Mar 27, 2024

Uh oh!

coadler commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spikecurtis Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

coadler Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spikecurtis Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

coadler Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

spikecurtis Apr 4, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

spikecurtis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coadler commented Mar 27, 2024 •

edited

Loading