-
Notifications
You must be signed in to change notification settings - Fork 883
feat: add agent acks to in-memory coordinator #12786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add agent acks to in-memory coordinator #12786
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
1f1e8f4
to
c066f24
Compare
tailnet/coordinator.go
Outdated
// potentially be smarter to only send an ACK once per client, | ||
// but there's nothing currently stopping clients from reusing | ||
// IDs. | ||
for _, peer := range resp.GetPeerUpdates() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry this is too superficial --- here we are only acknowledging the fact that we received a peer update, not that it was programmed into wireguard, which is what is actually needed for the handshake to complete.
I guess this is an OK start in that it cuts out any propagation delay from the Coordinator
out of the race condition, but still leaves the race there. I realize that you have yet to add support to the PGCoordinator, which is where we suspect the real problems are, so we will need to test this out and confirm that missed handshakes are substantially reduced. We can embed the ack deeper into tailnet in some later PR if we are still missing handshakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, completely eliminating the race would require digging down into the configmaps which I wasn't keen to do unless necessary. In my testing with the in-memory coordinator I wasn't able to hit the 5s backoff anymore. I suspect pgcoord to actually fare better considering the extra round trip latency as compared to the in-memory coordinator.
tailnet/configmaps.go
Outdated
@@ -387,33 +408,78 @@ func (c *configMaps) updatePeerLocked(update *proto.CoordinateResponse_PeerUpdat | |||
// SSH connections don't send packets while idle, so we use keep alives | |||
// to avoid random hangs while we set up the connection again after | |||
// inactivity. | |||
node.KeepAlive = ok && peerStatus.Active | |||
node.KeepAlive = (statusOk && peerStatus.Active) || (peerOk && lc.node != nil && lc.node.KeepAlive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this status
-based keep-alive business? My understanding was that was a bit of a hack to avoid building what you are now building. That is, we didn't want to turn on keep-alives at the source here because that would cause it to start the handshake too early. But now, at the source we can wait until the destination sends us READY_FOR_HANDSHAKE.
IIRC, we never really needed keep-alive turned on at the destination, but until now the Conn was unaware of the distinction.
Putting this calculation here made sense when it was only based on the status
but now there is logic in some of the case statements below that set the KeepAlive
.
I think we need a single function that computes the KeepAlive
value that we can call after all the changes are made, and consistently use it to compute the value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to leave this be, mostly since I don't want to change the behavior of how tunnel destinations handle keep alives for now. Once the RFH stuff is all in and working properly, I'll revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess we should leave the status-based keep alive processing until everything is working including PGCoordinator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some stylistic suggestions inline, but I don't need to review again. Looks good overall!
When an agent receives a node, it responds with an ACK which is relayed to the client. After the client receives the ACK, it's allowed to begin pinging.
dfdef50
to
cae734d
Compare
When an agent receives a node, it responds with an ACK which is relayed
to the client. After the client receives the ACK, it's allowed to begin
pinging.