fix: coordinator node update race #7345

spikecurtis · 2023-05-01T12:06:20Z

This fixes the most pressing bug on #7295 but leaves the haCoordinator mainly as is, meaning that we're still subject to various distributed races leaving us in a bad state where clients can't connect.

A full fix for haCoordinator is quite a large change, so I plan to write an RFC for it.

The main idea of this fix is that we establish an output queue of updates on each connection. This allows us to queue up the initial node update while holding the lock, ensuring we don't miss any updates coming from the other side just as we are connecting. The queue is buffered, and actual I/O on the connection is done on a separate goroutine, where it will never contend for the lock.

There was also a lot of complex manipulation of locking and unlocking the mutex in the existing routines, so I refactored the coordinator to contain a mutex-protected memory "core". All operations on this core are via methods that

c.mutex.Lock()
defer c.mutex.Unlock()

so it should be much easier to follow what things are holding the lock and what things aren't (core operations vs the rest of the coordinator methods).

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs · 2023-05-01T16:39:49Z

tailnet/coordinator.go

@@ -159,8 +158,30 @@ type coordinator struct {
 	agentNameCache *lru.Cache[uuid.UUID, string]
 }

+func NewCore(logger slog.Logger) *Core {


Thoughts on putting the coordinator in it's own package? The name NewCore and Core doesn't make much sense in the context of tailnet itself for providing connections.

An alternative would be putting the in-memory coordinator in it's own package.

I originally exported those names thinking we'd want the Core for the haCoordinator but I'm now thinking they won't be that reusable. So, if the concern is that the public tailnet API contains these and it's not clear, we could just make them private.

Makes sense to me!

Signed-off-by: Spike Curtis <spike@coder.com>

fix: coordinator node update race

a7ff00c

Signed-off-by: Spike Curtis <spike@coder.com>

spikecurtis requested review from coadler and kylecarbs May 1, 2023 12:06

github-actions bot assigned spikecurtis May 1, 2023

kylecarbs reviewed May 1, 2023

View reviewed changes

spikecurtis added 2 commits May 2, 2023 06:45

Lint fixes, make core private

cb69570

Signed-off-by: Spike Curtis <spike@coder.com>

Don't log broken connections as errors

370da6f

Signed-off-by: Spike Curtis <spike@coder.com>

kylecarbs approved these changes May 2, 2023

View reviewed changes

spikecurtis merged commit bd63011 into main May 2, 2023

spikecurtis deleted the spike/7295-coordinator-node-update branch May 2, 2023 16:58

github-actions bot locked and limited conversation to collaborators May 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: coordinator node update race #7345

fix: coordinator node update race #7345

spikecurtis commented May 1, 2023 •

edited

Loading

kylecarbs May 1, 2023

spikecurtis May 1, 2023

kylecarbs May 1, 2023

fix: coordinator node update race #7345

fix: coordinator node update race #7345

Conversation

spikecurtis commented May 1, 2023 • edited Loading

kylecarbs May 1, 2023

Choose a reason for hiding this comment

spikecurtis May 1, 2023

Choose a reason for hiding this comment

kylecarbs May 1, 2023

Choose a reason for hiding this comment

spikecurtis commented May 1, 2023 •

edited

Loading