-
Notifications
You must be signed in to change notification settings - Fork 881
fix(tailnet): Improve tailnet setup and agentconn stability #6292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
} | ||
c.listeners = nil | ||
c.mutex.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt these changes are necessary, I simply tried to mimic the order a bit closer to what's done in: https://github.com/tailscale/tailscale/blob/cd18bb68a49608b86f60e22cb081f0156d3d11b5/tsnet/tsnet.go#L152-L192
5667170
to
1a90599
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An amazing find 😍! I'm so happy you dug into this code; you found races I had extreme difficulty tracking down.
but since we're not using it, we probably need a better way to delay connections until the netStack is truly ready
How do you think we should do this?
coderd/coderd_test.go
Outdated
w2Ready := make(chan struct{}, 1) | ||
w1.SetNodeCallback(func(node *tailnet.Node) { | ||
w2.UpdateNodes([]*tailnet.Node{node}) | ||
select { | ||
case w2Ready <- struct{}{}: | ||
default: | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used once it might be cleaner to just close the channel instead of sending to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way to avoid double close of the channel since this will get called multiple times, but I can wrap it in a sync.Once
instead.
I've been on-and-off trying to figure out why we sometimes see a lot of failues in e.g.
TestSpeedtest
(cli
package).I feel this got worse after #6160, so a potential culprit is timing, slow things down and it "just works":tm:.
While combing over the
tailnet
package, I did notice we start thenetStack
pretty early, after delaying it connection issues were much easier to reproduce.It seems the following race exists in our code:
tailnet.NewConn
sets up a new connectiontailnet.Conn
, but it has no peers/nodes -> connection will failFor the most part, I've been able to improve robustness by using
agentConn.AwaitReachable
, however, this feels like a band-aid. The problem is quite well demonstrated by the fix inTestDERP
(coderd
).If we disregard the changes here, reproduction is very easy. Simply move this part to the bottom of
tailnet.NewConn
:And then run the test:
My best guess is that Tailscale handles this more seemlessly via
ipnlocal.LocalBackend
(https://github.com/tailscale/tailscale/blob/cd18bb68a49608b86f60e22cb081f0156d3d11b5/tsnet/tsnet.go#L341-L344), but since we're not using it, we probably need a better way to delay connections until thenetStack
is truly ready.