fix(tailnet): Improve tailnet setup and agentconn stability #6292

mafredri · 2023-02-21T13:11:45Z

I've been on-and-off trying to figure out why we sometimes see a lot of failues in e.g. TestSpeedtest (cli package).

I feel this got worse after #6160, so a potential culprit is timing, slow things down and it "just works":tm:.

While combing over the tailnet package, I did notice we start the netStack pretty early, after delaying it connection issues were much easier to reproduce.

It seems the following race exists in our code:

tailnet.NewConn sets up a new connection
Connection is up, but DERP server has not yet been contacted
We try to use the tailnet.Conn, but it has no peers/nodes -> connection will fail

For the most part, I've been able to improve robustness by using agentConn.AwaitReachable, however, this feels like a band-aid. The problem is quite well demonstrated by the fix in TestDERP (coderd).

If we disregard the changes here, reproduction is very easy. Simply move this part to the bottom of tailnet.NewConn:

	err = netStack.Start(nil)
	if err != nil {
		return nil, xerrors.Errorf("start netstack: %w", err)
	}

And then run the test:

> go test ./coderd -run=TestDERP\$ -count=1
    coderd_test.go:102:
        	Error Trace:	/home/mafredri/src/coder/coder/coderd/coderd_test.go:102
        	Error:      	Received unexpected error:
        	            	connect tcp [fd7a:115c:a1e0:4999:a9ca:45e9:c861:2e43]:35565: no route to host
        	Test:       	TestDERP

My best guess is that Tailscale handles this more seemlessly via ipnlocal.LocalBackend (https://github.com/tailscale/tailscale/blob/cd18bb68a49608b86f60e22cb081f0156d3d11b5/tsnet/tsnet.go#L341-L344), but since we're not using it, we probably need a better way to delay connections until the netStack is truly ready.

fix(tailnet): Improve start and close to detect connection races
fix: Prevent agentConn use before ready via AwaitReachable

mafredri · 2023-02-21T13:14:30Z

tailnet/conn.go

 	}
+	c.listeners = nil
+	c.mutex.Unlock()


I doubt these changes are necessary, I simply tried to mimic the order a bit closer to what's done in: https://github.com/tailscale/tailscale/blob/cd18bb68a49608b86f60e22cb081f0156d3d11b5/tsnet/tsnet.go#L152-L192

kylecarbs

An amazing find 😍! I'm so happy you dug into this code; you found races I had extreme difficulty tracking down.

but since we're not using it, we probably need a better way to delay connections until the netStack is truly ready

How do you think we should do this?

coadler · 2023-02-22T21:09:08Z

coderd/coderd_test.go

+	w2Ready := make(chan struct{}, 1)
 	w1.SetNodeCallback(func(node *tailnet.Node) {
 		w2.UpdateNodes([]*tailnet.Node{node})
+		select {
+		case w2Ready <- struct{}{}:
+		default:
+		}


Since this is only used once it might be cleaner to just close the channel instead of sending to it.

I did it this way to avoid double close of the channel since this will get called multiple times, but I can wrap it in a sync.Once instead.

mafredri added 2 commits February 21, 2023 13:05

fix(tailnet): Improve start and close to detect connection races

03b0917

fix: Prevent agentConn use before ready via AwaitReachable

18aabca

mafredri self-assigned this Feb 21, 2023

mafredri commented Feb 21, 2023

View reviewed changes

mafredri requested review from kylecarbs, coadler and deansheather February 21, 2023 13:15

deansheather approved these changes Feb 21, 2023

View reviewed changes

fix(tailnet): Ensure connstats are closed on conn close

1a90599

mafredri force-pushed the mafredri/fix-agentconn-races branch from 5667170 to 1a90599 Compare February 21, 2023 13:57

kylecarbs approved these changes Feb 21, 2023

View reviewed changes

mafredri added 2 commits February 21, 2023 18:27

fix(codersdk): Use AwaitReachable in DialWorkspaceAgent

19023e7

Improve logging via slog.Helper()

cdf01f8

coadler approved these changes Feb 22, 2023

View reviewed changes

Use sync.Once in test

ac9dd2b

mafredri mentioned this pull request Feb 23, 2023

fix(tailnet): Skip nodes without DERP, avoid use of RemoveAllPeers #6320

Merged

Merge branch 'main' into mafredri/fix-agentconn-races

fb05f70

mafredri merged commit a414de9 into main Feb 24, 2023

mafredri deleted the mafredri/fix-agentconn-races branch February 24, 2023 11:11

github-actions bot locked and limited conversation to collaborators Feb 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tailnet): Improve tailnet setup and agentconn stability #6292

fix(tailnet): Improve tailnet setup and agentconn stability #6292

Uh oh!

mafredri commented Feb 21, 2023

Uh oh!

mafredri Feb 21, 2023 •

edited

Loading

Uh oh!

kylecarbs left a comment

Uh oh!

coadler Feb 22, 2023

Uh oh!

mafredri Feb 23, 2023

Uh oh!

Uh oh!

fix(tailnet): Improve tailnet setup and agentconn stability #6292

fix(tailnet): Improve tailnet setup and agentconn stability #6292

Uh oh!

Conversation

mafredri commented Feb 21, 2023

Uh oh!

mafredri Feb 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylecarbs left a comment

Choose a reason for hiding this comment

Uh oh!

coadler Feb 22, 2023

Choose a reason for hiding this comment

Uh oh!

mafredri Feb 23, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mafredri Feb 21, 2023 •

edited

Loading