Skip to content

Commit c922ae2

Browse files
committed
Fix race with synchronous_standby_names at startup
synchronous_standby_names cannot be reloaded safely by backends, and the checkpointer is in charge of updating a state in shared memory if the GUC is enabled in WalSndCtl, to let the backends know if they should wait or not for a given LSN. This provides a strict control on the timing of the waiting queues if the GUC is enabled or disabled, then reloaded. The checkpointer is also in charge of waking up the backends that could be waiting for a LSN when the GUC is disabled. This logic had a race condition at startup, where it would be possible for backends to not wait for a LSN even if synchronous_standby_names is enabled. This would cause visibility issues with transactions that we should be waiting for but they were not. The problem lasts until the checkpointer does its initial update of the shared memory state when it loads synchronous_standby_names. In order to take care of this problem, the shared memory state in WalSndCtl is extended to detect if it has been initialized by the checkpointer, and not only check if synchronous_standby_names is defined. In WalSndCtlData, sync_standbys_defined is renamed to sync_standbys_status, a bits8 able to know about two states: - If the shared memory state has been initialized. This flag is set by the checkpointer at startup once, and never removed. - If synchronous_standby_names is known as defined in the shared memory state. This is the same as the previous sync_standbys_defined in WalSndCtl. This method gives a way for backends to decide what they should do until the shared memory area is initialized, and they now ultimately fall back to a check on the GUC value in this case, which is the best thing that can be done. Fortunately, SyncRepUpdateSyncStandbysDefined() is called immediately by the checkpointer when this process starts, so the window is very narrow. It is possible to enlarge the problematic window by making the checkpointer wait at the beginning of SyncRepUpdateSyncStandbysDefined() with a hardcoded sleep for example, and doing so has showed that a 2PC visibility test is indeed failing. On machines slow enough, this bug would cause spurious failures. In 17~, we have looked at the possibility of adding an injection point to have a reproducible test, but as the problematic window happens at early startup, we would need to invent a way to make an injection point optionally persistent across restarts when attached, something that would be fine for this case as it would involve the checkpointer. This issue is quite old, and can be reproduced on all the stable branches. Author: Melnikov Maksim <m.melnikov@postgrespro.ru> Co-authored-by: Michael Paquier <michael@paquier.xyz> Discussion: https://postgr.es/m/163fcbec-900b-4b07-beaa-d2ead8634bec@postgrespro.ru Backpatch-through: 13
1 parent 047495f commit c922ae2

File tree

3 files changed

+104
-22
lines changed

3 files changed

+104
-22
lines changed

src/backend/replication/syncrep.c

Lines changed: 82 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -162,16 +162,23 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
162162
* sync replication standby names defined.
163163
*
164164
* Since this routine gets called every commit time, it's important to
165-
* exit quickly if sync replication is not requested. So we check
166-
* WalSndCtl->sync_standbys_defined flag without the lock and exit
167-
* immediately if it's false. If it's true, we need to check it again
168-
* later while holding the lock, to check the flag and operate the sync
169-
* rep queue atomically. This is necessary to avoid the race condition
170-
* described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
171-
* it's false, the lock is not necessary because we don't touch the queue.
165+
* exit quickly if sync replication is not requested.
166+
*
167+
* We check WalSndCtl->sync_standbys_status flag without the lock and exit
168+
* immediately if SYNC_STANDBY_INIT is set (the checkpointer has
169+
* initialized this data) but SYNC_STANDBY_DEFINED is missing (no sync
170+
* replication requested).
171+
*
172+
* If SYNC_STANDBY_DEFINED is set, we need to check the status again later
173+
* while holding the lock, to check the flag and operate the sync rep
174+
* queue atomically. This is necessary to avoid the race condition
175+
* described in SyncRepUpdateSyncStandbysDefined(). On the other hand, if
176+
* SYNC_STANDBY_DEFINED is not set, the lock is not necessary because we
177+
* don't touch the queue.
172178
*/
173179
if (!SyncRepRequested() ||
174-
!((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
180+
((((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status) &
181+
(SYNC_STANDBY_INIT | SYNC_STANDBY_DEFINED)) == SYNC_STANDBY_INIT)
175182
return;
176183

177184
/* Cap the level for anything other than commit to remote flush only. */
@@ -187,16 +194,52 @@ SyncRepWaitForLSN(XLogRecPtr lsn, bool commit)
187194
Assert(MyProc->syncRepState == SYNC_REP_NOT_WAITING);
188195

189196
/*
190-
* We don't wait for sync rep if WalSndCtl->sync_standbys_defined is not
191-
* set. See SyncRepUpdateSyncStandbysDefined.
197+
* We don't wait for sync rep if SYNC_STANDBY_DEFINED is not set. See
198+
* SyncRepUpdateSyncStandbysDefined().
192199
*
193200
* Also check that the standby hasn't already replied. Unlikely race
194201
* condition but we'll be fetching that cache line anyway so it's likely
195202
* to be a low cost check.
203+
*
204+
* If the sync standby data has not been initialized yet
205+
* (SYNC_STANDBY_INIT is not set), fall back to a check based on the LSN,
206+
* then do a direct GUC check.
196207
*/
197-
if (!WalSndCtl->sync_standbys_defined ||
198-
lsn <= WalSndCtl->lsn[mode])
208+
if (WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT)
209+
{
210+
if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) == 0 ||
211+
lsn <= WalSndCtl->lsn[mode])
212+
{
213+
LWLockRelease(SyncRepLock);
214+
return;
215+
}
216+
}
217+
else if (lsn <= WalSndCtl->lsn[mode])
199218
{
219+
/*
220+
* The LSN is older than what we need to wait for. The sync standby
221+
* data has not been initialized yet, but we are OK to not wait
222+
* because we know that there is no point in doing so based on the
223+
* LSN.
224+
*/
225+
LWLockRelease(SyncRepLock);
226+
return;
227+
}
228+
else if (!SyncStandbysDefined())
229+
{
230+
/*
231+
* If we are here, the sync standby data has not been initialized yet,
232+
* and the LSN is newer than what need to wait for, so we have fallen
233+
* back to the best thing we could do in this case: a check on
234+
* SyncStandbysDefined() to see if the GUC is set or not.
235+
*
236+
* When the GUC has a value, we wait until the checkpointer updates
237+
* the status data because we cannot be sure yet if we should wait or
238+
* not. Here, the GUC has *no* value, we are sure that there is no
239+
* point to wait; this matters for example when initializing a
240+
* cluster, where we should never wait, and no sync standbys is the
241+
* default behavior.
242+
*/
200243
LWLockRelease(SyncRepLock);
201244
return;
202245
}
@@ -918,7 +961,7 @@ SyncRepWakeQueue(bool all, int mode)
918961

919962
/*
920963
* The checkpointer calls this as needed to update the shared
921-
* sync_standbys_defined flag, so that backends don't remain permanently wedged
964+
* sync_standbys_status flag, so that backends don't remain permanently wedged
922965
* if synchronous_standby_names is unset. It's safe to check the current value
923966
* without the lock, because it's only ever updated by one process. But we
924967
* must take the lock to change it.
@@ -928,7 +971,8 @@ SyncRepUpdateSyncStandbysDefined(void)
928971
{
929972
bool sync_standbys_defined = SyncStandbysDefined();
930973

931-
if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
974+
if (sync_standbys_defined !=
975+
((WalSndCtl->sync_standbys_status & SYNC_STANDBY_DEFINED) != 0))
932976
{
933977
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
934978

@@ -952,7 +996,30 @@ SyncRepUpdateSyncStandbysDefined(void)
952996
* backend that hasn't yet reloaded its config might go to sleep on
953997
* the queue (and never wake up). This prevents that.
954998
*/
955-
WalSndCtl->sync_standbys_defined = sync_standbys_defined;
999+
WalSndCtl->sync_standbys_status = SYNC_STANDBY_INIT |
1000+
(sync_standbys_defined ? SYNC_STANDBY_DEFINED : 0);
1001+
1002+
LWLockRelease(SyncRepLock);
1003+
}
1004+
else if ((WalSndCtl->sync_standbys_status & SYNC_STANDBY_INIT) == 0)
1005+
{
1006+
LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
1007+
1008+
/*
1009+
* Note that there is no need to wake up the queues here. We would
1010+
* reach this path only if SyncStandbysDefined() returns false, or it
1011+
* would mean that some backends are waiting with the GUC set. See
1012+
* SyncRepWaitForLSN().
1013+
*/
1014+
Assert(!SyncStandbysDefined());
1015+
1016+
/*
1017+
* Even if there is no sync standby defined, let the readers of this
1018+
* information know that the sync standby data has been initialized.
1019+
* This can just be done once, hence the previous check on
1020+
* SYNC_STANDBY_INIT to avoid useless work.
1021+
*/
1022+
WalSndCtl->sync_standbys_status |= SYNC_STANDBY_INIT;
9561023

9571024
LWLockRelease(SyncRepLock);
9581025
}

src/backend/replication/walsender.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1499,13 +1499,13 @@ WalSndUpdateProgress(LogicalDecodingContext *ctx, XLogRecPtr lsn, TransactionId
14991499
* When skipping empty transactions in synchronous replication, we send a
15001500
* keepalive message to avoid delaying such transactions.
15011501
*
1502-
* It is okay to check sync_standbys_defined flag without lock here as in
1503-
* the worst case we will just send an extra keepalive message when it is
1502+
* It is okay to check sync_standbys_status without lock here as in the
1503+
* worst case we will just send an extra keepalive message when it is
15041504
* really not required.
15051505
*/
15061506
if (skipped_xact &&
15071507
SyncRepRequested() &&
1508-
((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_defined)
1508+
(((volatile WalSndCtlData *) WalSndCtl)->sync_standbys_status & SYNC_STANDBY_DEFINED))
15091509
{
15101510
WalSndKeepalive(false, lsn);
15111511

src/include/replication/walsender_private.h

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -103,11 +103,11 @@ typedef struct
103103
XLogRecPtr lsn[NUM_SYNC_REP_WAIT_MODE];
104104

105105
/*
106-
* Are any sync standbys defined? Waiting backends can't reload the
107-
* config file safely, so checkpointer updates this value as needed.
108-
* Protected by SyncRepLock.
106+
* Status of data related to the synchronous standbys. Waiting backends
107+
* can't reload the config file safely, so checkpointer updates this value
108+
* as needed. Protected by SyncRepLock.
109109
*/
110-
bool sync_standbys_defined;
110+
bits8 sync_standbys_status;
111111

112112
/* used as a registry of physical / logical walsenders to wake */
113113
ConditionVariable wal_flush_cv;
@@ -116,6 +116,21 @@ typedef struct
116116
WalSnd walsnds[FLEXIBLE_ARRAY_MEMBER];
117117
} WalSndCtlData;
118118

119+
/* Flags for WalSndCtlData->sync_standbys_status */
120+
121+
/*
122+
* Is the synchronous standby data initialized from the GUC? This is set the
123+
* first time synchronous_standby_names is processed by the checkpointer.
124+
*/
125+
#define SYNC_STANDBY_INIT (1 << 0)
126+
127+
/*
128+
* Is the synchronous standby data defined? This is set when
129+
* synchronous_standby_names has some data, after being processed by the
130+
* checkpointer.
131+
*/
132+
#define SYNC_STANDBY_DEFINED (1 << 1)
133+
119134
extern PGDLLIMPORT WalSndCtlData *WalSndCtl;
120135

121136

0 commit comments

Comments
 (0)