Description
causes flakes like here: https://github.com/coder/coder/runs/32527171455
The problem is that since coder/coder#14923, GetWorkspaceAgentScriptTimingsByBuildID()
is implemented in dbmem
by calling other Get*
functions from within the main function. This causes us to attempt to lock the RWMutex a second time, event though we have already locked it.
goroutine 74405 [sync.RWMutex.RLock, 9 minutes]:
sync.runtime_SemacquireRWMutexR(0x0?, 0x0?, 0x1401f64d0c8?)
/Users/runner/hostedtoolcache/go/1.22.8/arm64/src/runtime/sema.go:82 +0x28
sync.(*RWMutex).RLock(0x5?)
/Users/runner/hostedtoolcache/go/1.22.8/arm64/src/sync/rwmutex.go:70 +0x64
github.com/coder/coder/v2/coderd/database/dbmem.(*FakeQuerier).GetWorkspaceBuildByID(_, {_, _}, {0x6, 0xf0, 0xd7, 0x9f, 0x7f, 0xa7, 0x4d, ...})
/Users/runner/work/coder/coder/coderd/database/dbmem/dbmem.go:6406 +0x64
github.com/coder/coder/v2/coderd/database/dbmem.(*FakeQuerier).GetWorkspaceAgentScriptTimingsByBuildID(0x140044200d8, {0x1078e3298, 0x1401f7a1770}, {0x6, 0xf0, 0xd7, 0x9f, 0x7f, 0xa7, 0x4d, ...})
/Users/runner/work/coder/coder/coderd/database/dbmem/dbmem.go:5857 +0xc8
github.com/coder/coder/v2/coderd/database/dbauthz.(*querier).GetWorkspaceAgentScriptTimingsByBuildID(0x14002a19830, {0x1078e3298, 0x1401f7a1770}, {0x6, 0xf0, 0xd7, 0x9f, 0x7f, 0xa7, 0x4d, ...})
At first glance, this might seem to work, since it's an RWMutex, and that allows multiple calls to RLock()
to succeed simultaneously. However, if anyone attempts to Lock()
the mutex for writing, this causes all further RLock()
calls to wait, giving the write lock priority. (If you have a system with many readers running concurrently, it could take an arbitrarily long time for them all to unlock simultaneously, and writes could hang indefinitely if they are not given priority.) This then causes a deadlock, because the re-entrant RLock
call blocks waiting for the write, which is blocked by the RLock
further up the stack.
So, the rule is: you must never attempt to lock the mutex if it is already locked.