Skip to content

bug: Agent can deadlock if closed while creating tailnet listeners #17328

Closed
@spikecurtis

Description

@spikecurtis

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Flaky test with deadlock: https://github.com/coder/coder/actions/runs/14367439231/job/40283613891

What happens is that we create several network listeners in agent.createTailnet, and cooresponding tracked goroutines. Creating tracked goroutines requires holding the agent.closeMutex. However, if a call to Close comes in during this process, that can also come and take and hold the agent.closeMutex while waiting for all tracked goroutines to finish.

Because we are in the process of creating the tailnet, the Close routine doesn't have available a reference to close it, meaning the listeners are not stopped. We deadlock between

  1. a listener accept loop.
  2. the close routine, waiting for 1.
  3. the createTailnet, waiting for the closeMutex held by 2.

Relevant Log Output

Exerpts from deadlock stack traces:

2025-04-09T21:51:34.2931808Z goroutine 159511 [sync.Mutex.Lock, 9 minutes]:
2025-04-09T21:51:34.2932190Z internal/sync.runtime_SemacquireMutex(0xffffffff?, 0xa9?, 0xc029cf7508?)
2025-04-09T21:51:34.2932649Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/runtime/sema.go:95 +0x25
2025-04-09T21:51:34.2933014Z internal/sync.(*Mutex).lockSlow(0xc06e6bd2c0)
2025-04-09T21:51:34.2933400Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/internal/sync/mutex.go:149 +0x210
2025-04-09T21:51:34.2933807Z internal/sync.(*Mutex).Lock(0xc06e6bd2c0)
2025-04-09T21:51:34.2934186Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/internal/sync/mutex.go:70 +0x55
2025-04-09T21:51:34.2934549Z sync.(*Mutex).Lock(0xc06e6bd2c0)
2025-04-09T21:51:34.2934907Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/sync/mutex.go:46 +0x29
2025-04-09T21:51:34.2935343Z github.com/coder/coder/v2/agent.(*agent).trackGoroutine(0xc06e6bd188, 0xc02b2a3da0)
2025-04-09T21:51:34.2935803Z 	/home/runner/work/coder/coder/agent/agent.go:1329 +0x5d
2025-04-09T21:51:34.2936383Z github.com/coder/coder/v2/agent.(*agent).createTailnet(0xc06e6bd188, {0x61c3260, 0xc05db90d20}, {0xd5, 0x63, 0xa3, 0x91, 0xb, 0x17, 0x4c, ...}, ...)
2025-04-09T21:51:34.2936952Z 	/home/runner/work/coder/coder/agent/agent.go:1426 +0x1569
2025-04-09T21:51:34.2937516Z github.com/coder/coder/v2/agent.(*agent).run.(*agent).createOrUpdateNetwork.func10({0x61c3260, 0xc0cbc763c0}, {0x1?, 0x0?})
2025-04-09T21:51:34.2938046Z 	/home/runner/work/coder/coder/agent/agent.go:1195 +0x72d
2025-04-09T21:51:34.2938481Z github.com/coder/coder/v2/agent.(*apiConnRoutineManager).startAgentAPI.func1()
2025-04-09T21:51:34.2938987Z 	/home/runner/work/coder/coder/agent/agent.go:1996 +0x146
2025-04-09T21:51:34.2939339Z golang.org/x/sync/errgroup.(*Group).Go.func1()
2025-04-09T21:51:34.2939785Z 	/home/runner/go/pkg/mod/golang.org/x/sync@v0.13.0/errgroup/errgroup.go:79 +0x92
2025-04-09T21:51:34.2940246Z created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 159392
2025-04-09T21:51:34.2940707Z 	/home/runner/go/pkg/mod/golang.org/x/sync@v0.13.0/errgroup/errgroup.go:76 +0x125

2025-04-09T21:51:34.3475122Z goroutine 55865 [sync.WaitGroup.Wait, 9 minutes]:
2025-04-09T21:51:34.3475233Z sync.runtime_SemacquireWaitGroup(0xc06e6bd2b0?)
2025-04-09T21:51:34.3475396Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/runtime/sema.go:110 +0x25
2025-04-09T21:51:34.3475487Z sync.(*WaitGroup).Wait(0xc06e6bd2b0)
2025-04-09T21:51:34.3475653Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/sync/waitgroup.go:118 +0x89
2025-04-09T21:51:34.3475791Z github.com/coder/coder/v2/agent.(*agent).Close(0xc06e6bd188)
2025-04-09T21:51:34.3475932Z 	/home/runner/work/coder/coder/agent/agent.go:1866 +0x1110
2025-04-09T21:51:34.3476164Z github.com/coder/coder/v2/coderd/workspaceapps/apptest.createWorkspaceWithApps.func2()
2025-04-09T21:51:34.3476371Z 	/home/runner/work/coder/coder/coderd/workspaceapps/apptest/setup.go:445 +0x3a
2025-04-09T21:51:34.3476460Z testing.(*common).Cleanup.func1()
2025-04-09T21:51:34.3476636Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1211 +0x170
2025-04-09T21:51:34.3476746Z testing.(*common).runCleanup(0xc05503ee00, 0x0)
2025-04-09T21:51:34.3476919Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1445 +0x2b4
2025-04-09T21:51:34.3477000Z testing.tRunner.func2()
2025-04-09T21:51:34.3477168Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1786 +0x4d
2025-04-09T21:51:34.3477267Z testing.tRunner(0xc05503ee00, 0xc05e11af18)
2025-04-09T21:51:34.3477429Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1798 +0x25f
2025-04-09T21:51:34.3477532Z created by testing.(*T).Run in goroutine 3900
2025-04-09T21:51:34.3477695Z 	/opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1851 +0x8f3

2025-04-09T21:51:34.3730997Z goroutine 160625 [select, 9 minutes]:
2025-04-09T21:51:34.3731144Z github.com/coder/coder/v2/tailnet.(*listener).Accept(0xc00855ac00)
2025-04-09T21:51:34.3731278Z 	/home/runner/work/coder/coder/tailnet/conn.go:949 +0xe7
2025-04-09T21:51:34.3731736Z github.com/coder/coder/v2/agent/reconnectingpty.(*Server).Serve(0xc065ca1a00, {0x61c3260, 0xc05db90d20}, {0x61c3260, 0xc05db90cd0}, {0x61baa90, 0xc00855ac00})
2025-04-09T21:51:34.3731933Z 	/home/runner/work/coder/coder/agent/reconnectingpty/server.go:68 +0xdf
2025-04-09T21:51:34.3732072Z github.com/coder/coder/v2/agent.(*agent).createTailnet.func5()
2025-04-09T21:51:34.3732211Z 	/home/runner/work/coder/coder/agent/agent.go:1407 +0x119
2025-04-09T21:51:34.3732351Z github.com/coder/coder/v2/agent.(*agent).trackGoroutine.func1()
2025-04-09T21:51:34.3732487Z 	/home/runner/work/coder/coder/agent/agent.go:1337 +0x93
2025-04-09T21:51:34.3732679Z created by github.com/coder/coder/v2/agent.(*agent).trackGoroutine in goroutine 159511
2025-04-09T21:51:34.3732811Z 	/home/runner/work/coder/coder/agent/agent.go:1335 +0x2d3

Expected Behavior

Should close cleanly.

Steps to Reproduce

Close the agent while the tailnet is being created.

Environment

  • Host OS:
  • Coder version:

Additional Context

No response

Metadata

Metadata

Assignees

Labels

s3Bugs that confuse, annoy, or are purely cosmetic

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions