Description
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
Flaky test with deadlock: https://github.com/coder/coder/actions/runs/14367439231/job/40283613891
What happens is that we create several network listeners in agent.createTailnet
, and cooresponding tracked goroutines. Creating tracked goroutines requires holding the agent.closeMutex
. However, if a call to Close
comes in during this process, that can also come and take and hold the agent.closeMutex
while waiting for all tracked goroutines to finish.
Because we are in the process of creating the tailnet, the Close routine doesn't have available a reference to close it, meaning the listeners are not stopped. We deadlock between
- a listener accept loop.
- the close routine, waiting for 1.
- the
createTailnet
, waiting for thecloseMutex
held by 2.
Relevant Log Output
Exerpts from deadlock stack traces:
2025-04-09T21:51:34.2931808Z goroutine 159511 [sync.Mutex.Lock, 9 minutes]:
2025-04-09T21:51:34.2932190Z internal/sync.runtime_SemacquireMutex(0xffffffff?, 0xa9?, 0xc029cf7508?)
2025-04-09T21:51:34.2932649Z /opt/hostedtoolcache/go/1.24.1/x64/src/runtime/sema.go:95 +0x25
2025-04-09T21:51:34.2933014Z internal/sync.(*Mutex).lockSlow(0xc06e6bd2c0)
2025-04-09T21:51:34.2933400Z /opt/hostedtoolcache/go/1.24.1/x64/src/internal/sync/mutex.go:149 +0x210
2025-04-09T21:51:34.2933807Z internal/sync.(*Mutex).Lock(0xc06e6bd2c0)
2025-04-09T21:51:34.2934186Z /opt/hostedtoolcache/go/1.24.1/x64/src/internal/sync/mutex.go:70 +0x55
2025-04-09T21:51:34.2934549Z sync.(*Mutex).Lock(0xc06e6bd2c0)
2025-04-09T21:51:34.2934907Z /opt/hostedtoolcache/go/1.24.1/x64/src/sync/mutex.go:46 +0x29
2025-04-09T21:51:34.2935343Z github.com/coder/coder/v2/agent.(*agent).trackGoroutine(0xc06e6bd188, 0xc02b2a3da0)
2025-04-09T21:51:34.2935803Z /home/runner/work/coder/coder/agent/agent.go:1329 +0x5d
2025-04-09T21:51:34.2936383Z github.com/coder/coder/v2/agent.(*agent).createTailnet(0xc06e6bd188, {0x61c3260, 0xc05db90d20}, {0xd5, 0x63, 0xa3, 0x91, 0xb, 0x17, 0x4c, ...}, ...)
2025-04-09T21:51:34.2936952Z /home/runner/work/coder/coder/agent/agent.go:1426 +0x1569
2025-04-09T21:51:34.2937516Z github.com/coder/coder/v2/agent.(*agent).run.(*agent).createOrUpdateNetwork.func10({0x61c3260, 0xc0cbc763c0}, {0x1?, 0x0?})
2025-04-09T21:51:34.2938046Z /home/runner/work/coder/coder/agent/agent.go:1195 +0x72d
2025-04-09T21:51:34.2938481Z github.com/coder/coder/v2/agent.(*apiConnRoutineManager).startAgentAPI.func1()
2025-04-09T21:51:34.2938987Z /home/runner/work/coder/coder/agent/agent.go:1996 +0x146
2025-04-09T21:51:34.2939339Z golang.org/x/sync/errgroup.(*Group).Go.func1()
2025-04-09T21:51:34.2939785Z /home/runner/go/pkg/mod/golang.org/x/sync@v0.13.0/errgroup/errgroup.go:79 +0x92
2025-04-09T21:51:34.2940246Z created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 159392
2025-04-09T21:51:34.2940707Z /home/runner/go/pkg/mod/golang.org/x/sync@v0.13.0/errgroup/errgroup.go:76 +0x125
2025-04-09T21:51:34.3475122Z goroutine 55865 [sync.WaitGroup.Wait, 9 minutes]:
2025-04-09T21:51:34.3475233Z sync.runtime_SemacquireWaitGroup(0xc06e6bd2b0?)
2025-04-09T21:51:34.3475396Z /opt/hostedtoolcache/go/1.24.1/x64/src/runtime/sema.go:110 +0x25
2025-04-09T21:51:34.3475487Z sync.(*WaitGroup).Wait(0xc06e6bd2b0)
2025-04-09T21:51:34.3475653Z /opt/hostedtoolcache/go/1.24.1/x64/src/sync/waitgroup.go:118 +0x89
2025-04-09T21:51:34.3475791Z github.com/coder/coder/v2/agent.(*agent).Close(0xc06e6bd188)
2025-04-09T21:51:34.3475932Z /home/runner/work/coder/coder/agent/agent.go:1866 +0x1110
2025-04-09T21:51:34.3476164Z github.com/coder/coder/v2/coderd/workspaceapps/apptest.createWorkspaceWithApps.func2()
2025-04-09T21:51:34.3476371Z /home/runner/work/coder/coder/coderd/workspaceapps/apptest/setup.go:445 +0x3a
2025-04-09T21:51:34.3476460Z testing.(*common).Cleanup.func1()
2025-04-09T21:51:34.3476636Z /opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1211 +0x170
2025-04-09T21:51:34.3476746Z testing.(*common).runCleanup(0xc05503ee00, 0x0)
2025-04-09T21:51:34.3476919Z /opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1445 +0x2b4
2025-04-09T21:51:34.3477000Z testing.tRunner.func2()
2025-04-09T21:51:34.3477168Z /opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1786 +0x4d
2025-04-09T21:51:34.3477267Z testing.tRunner(0xc05503ee00, 0xc05e11af18)
2025-04-09T21:51:34.3477429Z /opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1798 +0x25f
2025-04-09T21:51:34.3477532Z created by testing.(*T).Run in goroutine 3900
2025-04-09T21:51:34.3477695Z /opt/hostedtoolcache/go/1.24.1/x64/src/testing/testing.go:1851 +0x8f3
2025-04-09T21:51:34.3730997Z goroutine 160625 [select, 9 minutes]:
2025-04-09T21:51:34.3731144Z github.com/coder/coder/v2/tailnet.(*listener).Accept(0xc00855ac00)
2025-04-09T21:51:34.3731278Z /home/runner/work/coder/coder/tailnet/conn.go:949 +0xe7
2025-04-09T21:51:34.3731736Z github.com/coder/coder/v2/agent/reconnectingpty.(*Server).Serve(0xc065ca1a00, {0x61c3260, 0xc05db90d20}, {0x61c3260, 0xc05db90cd0}, {0x61baa90, 0xc00855ac00})
2025-04-09T21:51:34.3731933Z /home/runner/work/coder/coder/agent/reconnectingpty/server.go:68 +0xdf
2025-04-09T21:51:34.3732072Z github.com/coder/coder/v2/agent.(*agent).createTailnet.func5()
2025-04-09T21:51:34.3732211Z /home/runner/work/coder/coder/agent/agent.go:1407 +0x119
2025-04-09T21:51:34.3732351Z github.com/coder/coder/v2/agent.(*agent).trackGoroutine.func1()
2025-04-09T21:51:34.3732487Z /home/runner/work/coder/coder/agent/agent.go:1337 +0x93
2025-04-09T21:51:34.3732679Z created by github.com/coder/coder/v2/agent.(*agent).trackGoroutine in goroutine 159511
2025-04-09T21:51:34.3732811Z /home/runner/work/coder/coder/agent/agent.go:1335 +0x2d3
Expected Behavior
Should close cleanly.
Steps to Reproduce
Close the agent while the tailnet is being created.
Environment
- Host OS:
- Coder version:
Additional Context
No response