Skip to content

fix(tailnet): Improve tailnet setup and agentconn stability #6292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 24, 2023

Conversation

mafredri
Copy link
Member

I've been on-and-off trying to figure out why we sometimes see a lot of failues in e.g. TestSpeedtest (cli package).

I feel this got worse after #6160, so a potential culprit is timing, slow things down and it "just works":tm:.

While combing over the tailnet package, I did notice we start the netStack pretty early, after delaying it connection issues were much easier to reproduce.

It seems the following race exists in our code:

  1. tailnet.NewConn sets up a new connection
  2. Connection is up, but DERP server has not yet been contacted
  3. We try to use the tailnet.Conn, but it has no peers/nodes -> connection will fail

For the most part, I've been able to improve robustness by using agentConn.AwaitReachable, however, this feels like a band-aid. The problem is quite well demonstrated by the fix in TestDERP (coderd).

If we disregard the changes here, reproduction is very easy. Simply move this part to the bottom of tailnet.NewConn:

	err = netStack.Start(nil)
	if err != nil {
		return nil, xerrors.Errorf("start netstack: %w", err)
	}

And then run the test:

> go test ./coderd -run=TestDERP\$ -count=1
    coderd_test.go:102:
        	Error Trace:	/home/mafredri/src/coder/coder/coderd/coderd_test.go:102
        	Error:      	Received unexpected error:
        	            	connect tcp [fd7a:115c:a1e0:4999:a9ca:45e9:c861:2e43]:35565: no route to host
        	Test:       	TestDERP

My best guess is that Tailscale handles this more seemlessly via ipnlocal.LocalBackend (https://github.com/tailscale/tailscale/blob/cd18bb68a49608b86f60e22cb081f0156d3d11b5/tsnet/tsnet.go#L341-L344), but since we're not using it, we probably need a better way to delay connections until the netStack is truly ready.

  • fix(tailnet): Improve start and close to detect connection races
  • fix: Prevent agentConn use before ready via AwaitReachable

@mafredri mafredri self-assigned this Feb 21, 2023
}
c.listeners = nil
c.mutex.Unlock()
Copy link
Member Author

@mafredri mafredri Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt these changes are necessary, I simply tried to mimic the order a bit closer to what's done in: https://github.com/tailscale/tailscale/blob/cd18bb68a49608b86f60e22cb081f0156d3d11b5/tsnet/tsnet.go#L152-L192

@mafredri mafredri force-pushed the mafredri/fix-agentconn-races branch from 5667170 to 1a90599 Compare February 21, 2023 13:57
Copy link
Member

@kylecarbs kylecarbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An amazing find 😍! I'm so happy you dug into this code; you found races I had extreme difficulty tracking down.

but since we're not using it, we probably need a better way to delay connections until the netStack is truly ready

How do you think we should do this?

Comment on lines 81 to 87
w2Ready := make(chan struct{}, 1)
w1.SetNodeCallback(func(node *tailnet.Node) {
w2.UpdateNodes([]*tailnet.Node{node})
select {
case w2Ready <- struct{}{}:
default:
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only used once it might be cleaner to just close the channel instead of sending to it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it this way to avoid double close of the channel since this will get called multiple times, but I can wrap it in a sync.Once instead.

@mafredri mafredri merged commit a414de9 into main Feb 24, 2023
@mafredri mafredri deleted the mafredri/fix-agentconn-races branch February 24, 2023 11:11
@github-actions github-actions bot locked and limited conversation to collaborators Feb 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants