fix: coordinator node update race #7345
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes the most pressing bug on #7295 but leaves the
haCoordinator
mainly as is, meaning that we're still subject to various distributed races leaving us in a bad state where clients can't connect.A full fix for
haCoordinator
is quite a large change, so I plan to write an RFC for it.The main idea of this fix is that we establish an output queue of updates on each connection. This allows us to queue up the initial node update while holding the lock, ensuring we don't miss any updates coming from the other side just as we are connecting. The queue is buffered, and actual I/O on the connection is done on a separate goroutine, where it will never contend for the lock.
There was also a lot of complex manipulation of locking and unlocking the mutex in the existing routines, so I refactored the
coordinator
to contain a mutex-protected memory "core". All operations on thiscore
are via methods thatso it should be much easier to follow what things are holding the lock and what things aren't (
core
operations vs the rest of thecoordinator
methods).