Skip to content

fix: coordinator node update race #7345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 2, 2023

Conversation

spikecurtis
Copy link
Contributor

@spikecurtis spikecurtis commented May 1, 2023

This fixes the most pressing bug on #7295 but leaves the haCoordinator mainly as is, meaning that we're still subject to various distributed races leaving us in a bad state where clients can't connect.

A full fix for haCoordinator is quite a large change, so I plan to write an RFC for it.

The main idea of this fix is that we establish an output queue of updates on each connection. This allows us to queue up the initial node update while holding the lock, ensuring we don't miss any updates coming from the other side just as we are connecting. The queue is buffered, and actual I/O on the connection is done on a separate goroutine, where it will never contend for the lock.

There was also a lot of complex manipulation of locking and unlocking the mutex in the existing routines, so I refactored the coordinator to contain a mutex-protected memory "core". All operations on this core are via methods that

c.mutex.Lock()
defer c.mutex.Unlock()

so it should be much easier to follow what things are holding the lock and what things aren't (core operations vs the rest of the coordinator methods).

Signed-off-by: Spike Curtis <spike@coder.com>
@@ -159,8 +158,30 @@ type coordinator struct {
agentNameCache *lru.Cache[uuid.UUID, string]
}

func NewCore(logger slog.Logger) *Core {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on putting the coordinator in it's own package? The name NewCore and Core doesn't make much sense in the context of tailnet itself for providing connections.

An alternative would be putting the in-memory coordinator in it's own package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally exported those names thinking we'd want the Core for the haCoordinator but I'm now thinking they won't be that reusable. So, if the concern is that the public tailnet API contains these and it's not clear, we could just make them private.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me!

Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis merged commit bd63011 into main May 2, 2023
@spikecurtis spikecurtis deleted the spike/7295-coordinator-node-update branch May 2, 2023 16:58
@github-actions github-actions bot locked and limited conversation to collaborators May 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants