Skip to content

PGCoord fails to get pubsub updates #11950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spikecurtis opened this issue Jan 31, 2024 · 5 comments · Fixed by #12518
Closed

PGCoord fails to get pubsub updates #11950

spikecurtis opened this issue Jan 31, 2024 · 5 comments · Fixed by #12518
Assignees
Labels
networking Area: networking s2 Broken use cases or features (with a workaround). Only humans may set this.

Comments

@spikecurtis
Copy link
Contributor

reproduced on our dogfood instance.

One of our 3 replicas is no longer receiving peer and tunnel updates, so if your agent or client connects to that coordinator, it will fail to set up networking. I can see this via debug logging: 2 of 3 coordinators show peer and tunnel updates but the 3rd coordinator drops no logs about getting updates.

When I connected from a client and landed on the bad coordinator, my client never got the agent node update, but the agent, connected to a "good" coordinator did.

@spikecurtis spikecurtis added s0 Major regression, all-hands-on-deck to fix networking Area: networking labels Jan 31, 2024
@spikecurtis spikecurtis self-assigned this Jan 31, 2024
@cdr-bot cdr-bot bot added the bug label Jan 31, 2024
@spikecurtis
Copy link
Contributor Author

The broken instance gets some updates for like 10ms right at start of day, then doesn't get any more updates. Logs in and around that time seem fine.

SQL logs and connections look fine.

Unfortunately our pbPubsub is entirely uninstrumented: no logs, no metrics.

spikecurtis added a commit that referenced this issue Jan 31, 2024
Adds logging to unsubscribing from peer and tunnel updates in pgcoordinator, since #11950 seems to be problem with these subscriptions
@spikecurtis
Copy link
Contributor Author

After redeploy, everything seems to be working fine. I'm adding instrumentation to pubsub, and hopefully we'll see a recurrence that the new instrumentation will help us understand.

spikecurtis added a commit that referenced this issue Jan 31, 2024
Should be helpful for #11950

Adds a logger to pgPubsub and logs various events, most especially connection and disconnection from postgres.
spikecurtis added a commit that referenced this issue Feb 1, 2024
Adds prometheus metrics to PGPubsub for monitoring its health and performance in production.

Related to #11950 --- additional diagnostics to help figure out what's happening
@spikecurtis
Copy link
Contributor Author

spikecurtis commented Feb 5, 2024

Given we have seen this from multiple customers, but have yet to reproduce it in our own infrastructure since the logging & metrics upgrades to pubsub, I wonder how people would feel about a pubsub "watchdog" that periodically sends a heartbeat on a well known channel, subscribes to that channel, and then kills the whole process if we don't get messages for 3x the heartbeat period. (I'm thinking 15s period, thus 45s of zombie pubsub before we self-immolate)

It's definitely a workaround, rather than a fix. And, killing the process is fairly extreme. Anecdotally, we believe that restarting the process restores functionality, and a coderd without a working pubsub causes all kinds of problems. Before killing the process I'd want to dump goroutine stack traces to the log (a al the tailscale engine watchdog) so that if the pubsub is due to some deadlock in the process we'll have a prayer of finding it.

DRAFT PR: #12011

@kylecarbs @bpmct @ammario @sreya @mtojek I'm interested in your thoughts here.

spikecurtis added a commit that referenced this issue Feb 6, 2024
adds a watchdog to our pubsub and runs it for Coder server.

If the watchdog times out, it triggers a graceful exit in `coder server` to give any provisioner jobs a chance to shut down.

c.f. #11950
@bpmct bpmct added the s1 Bugs that break core workflows. Only humans may set this. label Feb 15, 2024
@cdr-bot cdr-bot bot removed the s1 Bugs that break core workflows. Only humans may set this. label Feb 15, 2024
@bpmct bpmct added s1 Bugs that break core workflows. Only humans may set this. s2 Broken use cases or features (with a workaround). Only humans may set this. and removed s0 Major regression, all-hands-on-deck to fix labels Feb 15, 2024
@cdr-bot cdr-bot bot removed the s2 Broken use cases or features (with a workaround). Only humans may set this. label Feb 15, 2024
@bpmct bpmct added s2 Broken use cases or features (with a workaround). Only humans may set this. and removed s1 Bugs that break core workflows. Only humans may set this. labels Feb 15, 2024
@bpmct
Copy link
Member

bpmct commented Feb 15, 2024

We're doing to set up alerts on our test instances to see if this occurs

@spikecurtis
Copy link
Contributor Author

The watchdog triggered and we got a stack trace.

tl;dr PGPubsub can enter into a deadlock with pq.Listener based on the way channels and locking works.


PGPubsub is protected by a single global mutex that is held (among other times) while

a) subscribing to a new event stream and
b) delivering notifications

The pq.Listener uses a single mainloop to read postgres responses both about subscriptions and notifications. The deadlock works like this:

  1. A new request to subscribe comes into PGPubsub and we lock the mutex
  2. While holding the mutex, we call pq.(*Listener).Listen() which sends a request to postgres and then waits for a response.
  3. Meanwhile, the PGPubsub attempts to deliver a notification, and blocks attempting to acquire the mutex.
  4. Meanwhile, the pq.Listener() main loop receives a notification, and attempts to send it over a channel to PGPubsub, but this blocks because the goroutine in 3 is blocked. The goroutines in 2, 3, 4 are deadlocked and cannot make forward progress.

A relatively simple fix is to release the mutex while we call pq.(*Listener).Listen(). This should allow the other goroutines to make forward progress and avoid the deadlock.

spikecurtis added a commit that referenced this issue Mar 12, 2024
fixes #11950

#11950 (comment) explains the bug

We were also calling into `Unlisten()` and `Close()` while holding the mutex.  I don't believe that `Close()` depends on the notification loop being unblocked, but it's hard to be sure, and the safest thing to do is assume it could block.

So, I added a unit test that fakes out `pq.Listener` and sends a bunch of notifies every time we call into it to hopefully prevent regression where we hold the mutex while calling into these functions.

It also removes the use of a `context.Context` to stop the PubSub -- it must be explicitly `Closed()`.  This simplifies a bunch of the logic, and is how we use the pubsub anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
networking Area: networking s2 Broken use cases or features (with a workaround). Only humans may set this.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants