Skip to content

Commit 7519bd1

Browse files
committed
Fix race condition between shutdown and unstarted background workers.
If a database shutdown (smart or fast) is commanded between the time some process decides to request a new background worker and the time that the postmaster can launch that worker, then nothing happens because the postmaster won't launch any bgworkers once it's exited PM_RUN state. This is fine ... unless the requesting process is waiting for that worker to finish (or even for it to start); in that case the requestor is stuck, and only manual intervention will get us to the point of being able to shut down. To fix, cancel pending requests for workers when the postmaster sends shutdown (SIGTERM) signals, and similarly cancel any new requests that arrive after that point. (We can optimize things slightly by only doing the cancellation for workers that have waiters.) To fit within the existing bgworker APIs, the "cancel" is made to look like the worker was started and immediately stopped, causing deregistration of the bgworker entry. Waiting processes would have to deal with premature worker exit anyway, so this should introduce no bugs that weren't there before. We do have a side effect that registration records for restartable bgworkers might disappear when theoretically they should have remained in place; but since we're shutting down, that shouldn't matter. Back-patch to v10. There might be value in putting this into 9.6 as well, but the management of bgworkers is a bit different there (notably see 8ff5186) and I'm not convinced it's worth the effort to validate the patch for that branch. Discussion: https://postgr.es/m/661570.1608673226@sss.pgh.pa.us
1 parent 7e784d1 commit 7519bd1

File tree

4 files changed

+90
-20
lines changed

4 files changed

+90
-20
lines changed

contrib/pg_prewarm/autoprewarm.c

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -396,12 +396,7 @@ apw_load_buffers(void)
396396

397397
/*
398398
* Likewise, don't launch if we've already been told to shut down.
399-
*
400-
* There is a race condition here: if the postmaster has received a
401-
* fast-shutdown signal, but we've not heard about it yet, then the
402-
* postmaster will ignore our worker start request and we'll wait
403-
* forever. However, that's a bug in the general background-worker
404-
* logic, not the fault of this module.
399+
* (The launch would fail anyway, but we might as well skip it.)
405400
*/
406401
if (ShutdownRequestPending)
407402
break;

src/backend/postmaster/bgworker.c

Lines changed: 71 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -231,13 +231,15 @@ FindRegisteredWorkerBySlotNumber(int slotno)
231231
}
232232

233233
/*
234-
* Notice changes to shared memory made by other backends. This code
235-
* runs in the postmaster, so we must be very careful not to assume that
236-
* shared memory contents are sane. Otherwise, a rogue backend could take
237-
* out the postmaster.
234+
* Notice changes to shared memory made by other backends.
235+
* Accept new worker requests only if allow_new_workers is true.
236+
*
237+
* This code runs in the postmaster, so we must be very careful not to assume
238+
* that shared memory contents are sane. Otherwise, a rogue backend could
239+
* take out the postmaster.
238240
*/
239241
void
240-
BackgroundWorkerStateChange(void)
242+
BackgroundWorkerStateChange(bool allow_new_workers)
241243
{
242244
int slotno;
243245

@@ -297,6 +299,15 @@ BackgroundWorkerStateChange(void)
297299
continue;
298300
}
299301

302+
/*
303+
* If we aren't allowing new workers, then immediately mark it for
304+
* termination; the next stanza will take care of cleaning it up.
305+
* Doing this ensures that any process waiting for the worker will get
306+
* awoken, even though the worker will never be allowed to run.
307+
*/
308+
if (!allow_new_workers)
309+
slot->terminate = true;
310+
300311
/*
301312
* If the worker is marked for termination, we don't need to add it to
302313
* the registered workers list; we can just free the slot. However, if
@@ -503,12 +514,55 @@ BackgroundWorkerStopNotifications(pid_t pid)
503514
}
504515
}
505516

517+
/*
518+
* Cancel any not-yet-started worker requests that have waiting processes.
519+
*
520+
* This is called during a normal ("smart" or "fast") database shutdown.
521+
* After this point, no new background workers will be started, so anything
522+
* that might be waiting for them needs to be kicked off its wait. We do
523+
* that by cancelling the bgworker registration entirely, which is perhaps
524+
* overkill, but since we're shutting down it does not matter whether the
525+
* registration record sticks around.
526+
*
527+
* This function should only be called from the postmaster.
528+
*/
529+
void
530+
ForgetUnstartedBackgroundWorkers(void)
531+
{
532+
slist_mutable_iter iter;
533+
534+
slist_foreach_modify(iter, &BackgroundWorkerList)
535+
{
536+
RegisteredBgWorker *rw;
537+
BackgroundWorkerSlot *slot;
538+
539+
rw = slist_container(RegisteredBgWorker, rw_lnode, iter.cur);
540+
Assert(rw->rw_shmem_slot < max_worker_processes);
541+
slot = &BackgroundWorkerData->slot[rw->rw_shmem_slot];
542+
543+
/* If it's not yet started, and there's someone waiting ... */
544+
if (slot->pid == InvalidPid &&
545+
rw->rw_worker.bgw_notify_pid != 0)
546+
{
547+
/* ... then zap it, and notify the waiter */
548+
int notify_pid = rw->rw_worker.bgw_notify_pid;
549+
550+
ForgetBackgroundWorker(&iter);
551+
if (notify_pid != 0)
552+
kill(notify_pid, SIGUSR1);
553+
}
554+
}
555+
}
556+
506557
/*
507558
* Reset background worker crash state.
508559
*
509560
* We assume that, after a crash-and-restart cycle, background workers without
510561
* the never-restart flag should be restarted immediately, instead of waiting
511-
* for bgw_restart_time to elapse.
562+
* for bgw_restart_time to elapse. On the other hand, workers with that flag
563+
* should be forgotten immediately, since we won't ever restart them.
564+
*
565+
* This function should only be called from the postmaster.
512566
*/
513567
void
514568
ResetBackgroundWorkerCrashTimes(void)
@@ -548,6 +602,11 @@ ResetBackgroundWorkerCrashTimes(void)
548602
* resetting.
549603
*/
550604
rw->rw_crashed_at = 0;
605+
606+
/*
607+
* If there was anyone waiting for it, they're history.
608+
*/
609+
rw->rw_worker.bgw_notify_pid = 0;
551610
}
552611
}
553612
}
@@ -1077,6 +1136,9 @@ GetBackgroundWorkerPid(BackgroundWorkerHandle *handle, pid_t *pidp)
10771136
* returned. However, if the postmaster has died, we give up and return
10781137
* BGWH_POSTMASTER_DIED, since it that case we know that startup will not
10791138
* take place.
1139+
*
1140+
* The caller *must* have set our PID as the worker's bgw_notify_pid,
1141+
* else we will not be awoken promptly when the worker's state changes.
10801142
*/
10811143
BgwHandleStatus
10821144
WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pidp)
@@ -1119,6 +1181,9 @@ WaitForBackgroundWorkerStartup(BackgroundWorkerHandle *handle, pid_t *pidp)
11191181
* and then return BGWH_STOPPED. However, if the postmaster has died, we give
11201182
* up and return BGWH_POSTMASTER_DIED, because it's the postmaster that
11211183
* notifies us when a worker's state changes.
1184+
*
1185+
* The caller *must* have set our PID as the worker's bgw_notify_pid,
1186+
* else we will not be awoken promptly when the worker's state changes.
11221187
*/
11231188
BgwHandleStatus
11241189
WaitForBackgroundWorkerShutdown(BackgroundWorkerHandle *handle)

src/backend/postmaster/postmaster.c

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -3800,6 +3800,13 @@ PostmasterStateMachine(void)
38003800
*/
38013801
if (pmState == PM_STOP_BACKENDS)
38023802
{
3803+
/*
3804+
* Forget any pending requests for background workers, since we're no
3805+
* longer willing to launch any new workers. (If additional requests
3806+
* arrive, BackgroundWorkerStateChange will reject them.)
3807+
*/
3808+
ForgetUnstartedBackgroundWorkers();
3809+
38033810
/* Signal all backend children except walsenders */
38043811
SignalSomeChildren(SIGTERM,
38053812
BACKEND_TYPE_ALL - BACKEND_TYPE_WALSND);
@@ -5154,13 +5161,6 @@ sigusr1_handler(SIGNAL_ARGS)
51545161
PG_SETMASK(&BlockSig);
51555162
#endif
51565163

5157-
/* Process background worker state change. */
5158-
if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
5159-
{
5160-
BackgroundWorkerStateChange();
5161-
StartWorkerNeeded = true;
5162-
}
5163-
51645164
/*
51655165
* RECOVERY_STARTED and BEGIN_HOT_STANDBY signals are ignored in
51665166
* unexpected states. If the startup process quickly starts up, completes
@@ -5206,6 +5206,7 @@ sigusr1_handler(SIGNAL_ARGS)
52065206

52075207
pmState = PM_RECOVERY;
52085208
}
5209+
52095210
if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
52105211
pmState == PM_RECOVERY && Shutdown == NoShutdown)
52115212
{
@@ -5231,6 +5232,14 @@ sigusr1_handler(SIGNAL_ARGS)
52315232
StartWorkerNeeded = true;
52325233
}
52335234

5235+
/* Process background worker state changes. */
5236+
if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
5237+
{
5238+
/* Accept new worker requests only if not stopping. */
5239+
BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS);
5240+
StartWorkerNeeded = true;
5241+
}
5242+
52345243
if (StartWorkerNeeded || HaveCrashedWorker)
52355244
maybe_start_bgworkers();
52365245

src/include/postmaster/bgworker_internals.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,12 @@ extern slist_head BackgroundWorkerList;
4646

4747
extern Size BackgroundWorkerShmemSize(void);
4848
extern void BackgroundWorkerShmemInit(void);
49-
extern void BackgroundWorkerStateChange(void);
49+
extern void BackgroundWorkerStateChange(bool allow_new_workers);
5050
extern void ForgetBackgroundWorker(slist_mutable_iter *cur);
5151
extern void ReportBackgroundWorkerPID(RegisteredBgWorker *);
5252
extern void ReportBackgroundWorkerExit(slist_mutable_iter *cur);
5353
extern void BackgroundWorkerStopNotifications(pid_t pid);
54+
extern void ForgetUnstartedBackgroundWorkers(void);
5455
extern void ResetBackgroundWorkerCrashTimes(void);
5556

5657
/* Function to start a background worker, called from postmaster.c */

0 commit comments

Comments
 (0)