-
Notifications
You must be signed in to change notification settings - Fork 887
PGCoord fails to get pubsub updates #11950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The broken instance gets some updates for like 10ms right at start of day, then doesn't get any more updates. Logs in and around that time seem fine. SQL logs and connections look fine. Unfortunately our pbPubsub is entirely uninstrumented: no logs, no metrics. |
Adds logging to unsubscribing from peer and tunnel updates in pgcoordinator, since #11950 seems to be problem with these subscriptions
After redeploy, everything seems to be working fine. I'm adding instrumentation to pubsub, and hopefully we'll see a recurrence that the new instrumentation will help us understand. |
Should be helpful for #11950 Adds a logger to pgPubsub and logs various events, most especially connection and disconnection from postgres.
Adds prometheus metrics to PGPubsub for monitoring its health and performance in production. Related to #11950 --- additional diagnostics to help figure out what's happening
Given we have seen this from multiple customers, but have yet to reproduce it in our own infrastructure since the logging & metrics upgrades to pubsub, I wonder how people would feel about a pubsub "watchdog" that periodically sends a heartbeat on a well known channel, subscribes to that channel, and then kills the whole process if we don't get messages for 3x the heartbeat period. (I'm thinking 15s period, thus 45s of zombie pubsub before we self-immolate) It's definitely a workaround, rather than a fix. And, killing the process is fairly extreme. Anecdotally, we believe that restarting the process restores functionality, and a coderd without a working pubsub causes all kinds of problems. Before killing the process I'd want to dump goroutine stack traces to the log (a al the tailscale engine watchdog) so that if the pubsub is due to some deadlock in the process we'll have a prayer of finding it. DRAFT PR: #12011 @kylecarbs @bpmct @ammario @sreya @mtojek I'm interested in your thoughts here. |
adds a watchdog to our pubsub and runs it for Coder server. If the watchdog times out, it triggers a graceful exit in `coder server` to give any provisioner jobs a chance to shut down. c.f. #11950
We're doing to set up alerts on our test instances to see if this occurs |
The watchdog triggered and we got a stack trace. tl;dr PGPubsub is protected by a single global mutex that is held (among other times) while a) subscribing to a new event stream and The
A relatively simple fix is to release the mutex while we call |
fixes #11950 #11950 (comment) explains the bug We were also calling into `Unlisten()` and `Close()` while holding the mutex. I don't believe that `Close()` depends on the notification loop being unblocked, but it's hard to be sure, and the safest thing to do is assume it could block. So, I added a unit test that fakes out `pq.Listener` and sends a bunch of notifies every time we call into it to hopefully prevent regression where we hold the mutex while calling into these functions. It also removes the use of a `context.Context` to stop the PubSub -- it must be explicitly `Closed()`. This simplifies a bunch of the logic, and is how we use the pubsub anyway.
reproduced on our dogfood instance.
One of our 3 replicas is no longer receiving peer and tunnel updates, so if your agent or client connects to that coordinator, it will fail to set up networking. I can see this via debug logging: 2 of 3 coordinators show peer and tunnel updates but the 3rd coordinator drops no logs about getting updates.
When I connected from a client and landed on the bad coordinator, my client never got the agent node update, but the agent, connected to a "good" coordinator did.
The text was updated successfully, but these errors were encountered: