Skip to content

Commit 161021d

Browse files
committed
Ensure recovery pause feature doesn't pause unless users can connect.
If we're not in hot standby mode, then there's no way for users to connect to reset the recoveryPause flag, so we shouldn't pause. The code was aware of this but the test to see if pausing was safe was seriously inadequate: it wasn't paying attention to reachedConsistency, and besides what it was testing was that we could legally enter hot standby, not that we have done so. Get rid of that in favor of checking LocalHotStandbyActive, which because of the coding in CheckRecoveryConsistency is tantamount to checking that we have told the postmaster to enter hot standby. Also, move the recoveryPausesHere() call that reacts to asynchronous recoveryPause requests so that it's not in the middle of application of a WAL record. I put it next to the recoveryStopsHere() call --- in future those are going to need to interact significantly, so this seems like a good waystation. Also, don't bother trying to read another WAL record if we've already decided not to continue recovery. This was no big deal when the code was written originally, but now that reading a record might entail actions like fetching an archive file, it seems a bit silly to do it like that. Per report from Jeff Janes and subsequent discussion. The pause feature needs quite a lot more work, but this gets rid of some indisputable bugs, and seems safe enough to back-patch.
1 parent d56b629 commit 161021d

File tree

1 file changed

+34
-16
lines changed
  • src/backend/access/transam

1 file changed

+34
-16
lines changed

src/backend/access/transam/xlog.c

Lines changed: 34 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5916,13 +5916,19 @@ recoveryStopsHere(XLogRecord *record, bool *includeThis)
59165916
}
59175917

59185918
/*
5919-
* Recheck shared recoveryPause by polling.
5919+
* Wait until shared recoveryPause flag is cleared.
59205920
*
5921-
* XXX Can also be done with shared latch.
5921+
* XXX Could also be done with shared latch, avoiding the pg_usleep loop.
5922+
* Probably not worth the trouble though. This state shouldn't be one that
5923+
* anyone cares about server power consumption in.
59225924
*/
59235925
static void
59245926
recoveryPausesHere(void)
59255927
{
5928+
/* Don't pause unless users can connect! */
5929+
if (!LocalHotStandbyActive)
5930+
return;
5931+
59265932
ereport(LOG,
59275933
(errmsg("recovery has paused"),
59285934
errhint("Execute pg_xlog_replay_resume() to continue.")));
@@ -6650,7 +6656,6 @@ StartupXLOG(void)
66506656
{
66516657
bool recoveryContinue = true;
66526658
bool recoveryApply = true;
6653-
bool recoveryPause = false;
66546659
ErrorContextCallback errcontext;
66556660
TimestampTz xtime;
66566661

@@ -6692,22 +6697,36 @@ StartupXLOG(void)
66926697
/* Allow read-only connections if we're consistent now */
66936698
CheckRecoveryConsistency();
66946699

6700+
/*
6701+
* Pause WAL replay, if requested by a hot-standby session via
6702+
* SetRecoveryPause().
6703+
*
6704+
* Note that we intentionally don't take the info_lck spinlock
6705+
* here. We might therefore read a slightly stale value of
6706+
* the recoveryPause flag, but it can't be very stale (no
6707+
* worse than the last spinlock we did acquire). Since a
6708+
* pause request is a pretty asynchronous thing anyway,
6709+
* possibly responding to it one WAL record later than we
6710+
* otherwise would is a minor issue, so it doesn't seem worth
6711+
* adding another spinlock cycle to prevent that.
6712+
*/
6713+
if (xlogctl->recoveryPause)
6714+
recoveryPausesHere();
6715+
66956716
/*
66966717
* Have we reached our recovery target?
66976718
*/
66986719
if (recoveryStopsHere(record, &recoveryApply))
66996720
{
6700-
/*
6701-
* Pause only if users can connect to send a resume
6702-
* message
6703-
*/
6704-
if (recoveryPauseAtTarget && standbyState == STANDBY_SNAPSHOT_READY)
6721+
if (recoveryPauseAtTarget)
67056722
{
67066723
SetRecoveryPause(true);
67076724
recoveryPausesHere();
67086725
}
67096726
reachedStopPoint = true; /* see below */
67106727
recoveryContinue = false;
6728+
6729+
/* Exit loop if we reached non-inclusive recovery target */
67116730
if (!recoveryApply)
67126731
break;
67136732
}
@@ -6740,15 +6759,8 @@ StartupXLOG(void)
67406759
*/
67416760
SpinLockAcquire(&xlogctl->info_lck);
67426761
xlogctl->replayEndRecPtr = EndRecPtr;
6743-
recoveryPause = xlogctl->recoveryPause;
67446762
SpinLockRelease(&xlogctl->info_lck);
67456763

6746-
/*
6747-
* Pause only if users can connect to send a resume message
6748-
*/
6749-
if (recoveryPause && standbyState == STANDBY_SNAPSHOT_READY)
6750-
recoveryPausesHere();
6751-
67526764
/*
67536765
* If we are attempting to enter Hot Standby mode, process
67546766
* XIDs we see
@@ -6792,10 +6804,16 @@ StartupXLOG(void)
67926804
xlogctl->recoveryLastRecPtr = EndRecPtr;
67936805
SpinLockRelease(&xlogctl->info_lck);
67946806

6807+
/* Remember this record as the last-applied one */
67956808
LastRec = ReadRecPtr;
67966809

6810+
/* Exit loop if we reached inclusive recovery target */
6811+
if (!recoveryContinue)
6812+
break;
6813+
6814+
/* Else, try to fetch the next WAL record */
67976815
record = ReadRecord(NULL, LOG, false);
6798-
} while (record != NULL && recoveryContinue);
6816+
} while (record != NULL);
67996817

68006818
/*
68016819
* end of main redo apply loop

0 commit comments

Comments
 (0)