Skip to content

fix: fix race in PGCoord at startup #9144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 18, 2023

Conversation

spikecurtis
Copy link
Contributor

I've noticed a couple of flakes in TestPGCoordinatorDual_Mainline, e.g. https://github.com/coder/coder/actions/runs/5869096650/job/15913511403

I've identified a race condition during the startup of PGCoord that could be responsible.

In PGCoord's querier, we start pubsub subscriptions and workers on separate goroutines. That means that right at startup, we could:

  1. get a mapping request
  2. query for it and find no data
  3. update hits the database, but we miss it because we aren't subscribed
  4. subscribe to updates

This can only happen if something connects right as we are starting up, which is basically impossible in the field, but happens all the time in our tests.

The fix is to ensure the subscription happens before step 2, so that if an update comes in after 2, we will get notified by our subscription.

I can't be 100% sure this is the cause of the test failures, because we don't have logs that detail the ordering of these events. I've also added better logging so if something like this happens again we can debug.

Signed-off-by: Spike Curtis <spike@coder.com>
@spikecurtis spikecurtis merged commit 2f46f23 into main Aug 18, 2023
@spikecurtis spikecurtis deleted the spike/pgcoord-dual-mainline-flake branch August 18, 2023 05:53
@github-actions github-actions bot locked and limited conversation to collaborators Aug 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants