Skip to content

Commit f352f43

Browse files
committedJul 5, 2018
Prevent references to invalid relation pages after fresh promotion
If a standby crashes after promotion before having completed its first post-recovery checkpoint, then the minimal recovery point which marks the LSN position where the cluster is able to reach consistency may be set to a position older than the first end-of-recovery checkpoint while all the WAL available should be replayed. This leads to the instance thinking that it contains inconsistent pages, causing a PANIC and a hard instance crash even if all the WAL available has not been replayed for certain sets of records replayed. When in crash recovery, minRecoveryPoint is expected to always be set to InvalidXLogRecPtr, which forces the recovery to replay all the WAL available, so this commit makes sure that the local copy of minRecoveryPoint from the control file is initialized properly and stays as it is while crash recovery is performed. Once switching to archive recovery or if crash recovery finishes, then the local copy minRecoveryPoint can be safely updated. Pavan Deolasee has reported and diagnosed the failure in the first place, and the base fix idea to rely on the local copy of minRecoveryPoint comes from Kyotaro Horiguchi, which has been expanded into a full-fledged patch by me. The test included in this commit has been written by Álvaro Herrera and Pavan Deolasee, which I have modified to make it faster and more reliable with sleep phases. Backpatch down to all supported versions where the bug appears, aka 9.3 which is where the end-of-recovery checkpoint is not run by the startup process anymore. The test gets easily supported down to 10, still it has been tested on all branches. Reported-by: Pavan Deolasee Diagnosed-by: Pavan Deolasee Reviewed-by: Pavan Deolasee, Kyotaro Horiguchi Author: Michael Paquier, Kyotaro Horiguchi, Pavan Deolasee, Álvaro Herrera Discussion: https://postgr.es/m/CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com
1 parent 8c8c9f3 commit f352f43

File tree

1 file changed

+70
-31
lines changed
  • src/backend/access/transam

1 file changed

+70
-31
lines changed
 

‎src/backend/access/transam/xlog.c

Lines changed: 70 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -749,8 +749,14 @@ static XLogSource XLogReceiptSource = 0; /* XLOG_FROM_* code */
749749
static XLogRecPtr ReadRecPtr; /* start of last record read */
750750
static XLogRecPtr EndRecPtr; /* end+1 of last record read */
751751

752-
static XLogRecPtr minRecoveryPoint; /* local copy of
753-
* ControlFile->minRecoveryPoint */
752+
/*
753+
* Local copies of equivalent fields in the control file. When running
754+
* crash recovery, minRecoveryPoint is set to InvalidXLogRecPtr as we
755+
* expect to replay all the WAL available, and updateMinRecoveryPoint is
756+
* switched to false to prevent any updates while replaying records.
757+
* Those values are kept consistent as long as crash recovery runs.
758+
*/
759+
static XLogRecPtr minRecoveryPoint;
754760
static TimeLineID minRecoveryPointTLI;
755761
static bool updateMinRecoveryPoint = true;
756762

@@ -2706,20 +2712,26 @@ UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
27062712
if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
27072713
return;
27082714

2715+
/*
2716+
* An invalid minRecoveryPoint means that we need to recover all the WAL,
2717+
* i.e., we're doing crash recovery. We never modify the control file's
2718+
* value in that case, so we can short-circuit future checks here too. The
2719+
* local values of minRecoveryPoint and minRecoveryPointTLI should not be
2720+
* updated until crash recovery finishes.
2721+
*/
2722+
if (XLogRecPtrIsInvalid(minRecoveryPoint))
2723+
{
2724+
updateMinRecoveryPoint = false;
2725+
return;
2726+
}
2727+
27092728
LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
27102729

27112730
/* update local copy */
27122731
minRecoveryPoint = ControlFile->minRecoveryPoint;
27132732
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
27142733

2715-
/*
2716-
* An invalid minRecoveryPoint means that we need to recover all the WAL,
2717-
* i.e., we're doing crash recovery. We never modify the control file's
2718-
* value in that case, so we can short-circuit future checks here too.
2719-
*/
2720-
if (minRecoveryPoint == 0)
2721-
updateMinRecoveryPoint = false;
2722-
else if (force || minRecoveryPoint < lsn)
2734+
if (force || minRecoveryPoint < lsn)
27232735
{
27242736
/* use volatile pointer to prevent code rearrangement */
27252737
volatile XLogCtlData *xlogctl = XLogCtl;
@@ -3069,7 +3081,16 @@ XLogNeedsFlush(XLogRecPtr record)
30693081
*/
30703082
if (RecoveryInProgress())
30713083
{
3072-
/* Quick exit if already known updated */
3084+
/*
3085+
* An invalid minRecoveryPoint means that we need to recover all the
3086+
* WAL, i.e., we're doing crash recovery. We never modify the control
3087+
* file's value in that case, so we can short-circuit future checks
3088+
* here too.
3089+
*/
3090+
if (XLogRecPtrIsInvalid(minRecoveryPoint))
3091+
updateMinRecoveryPoint = false;
3092+
3093+
/* Quick exit if already known to be updated or cannot be updated */
30733094
if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
30743095
return false;
30753096

@@ -3083,20 +3104,8 @@ XLogNeedsFlush(XLogRecPtr record)
30833104
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
30843105
LWLockRelease(ControlFileLock);
30853106

3086-
/*
3087-
* An invalid minRecoveryPoint means that we need to recover all the
3088-
* WAL, i.e., we're doing crash recovery. We never modify the control
3089-
* file's value in that case, so we can short-circuit future checks
3090-
* here too.
3091-
*/
3092-
if (minRecoveryPoint == 0)
3093-
updateMinRecoveryPoint = false;
3094-
30953107
/* check again */
3096-
if (record <= minRecoveryPoint || !updateMinRecoveryPoint)
3097-
return false;
3098-
else
3099-
return true;
3108+
return record > minRecoveryPoint;
31003109
}
31013110

31023111
/* Quick exit if already known flushed */
@@ -4259,6 +4268,12 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
42594268
minRecoveryPoint = ControlFile->minRecoveryPoint;
42604269
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
42614270

4271+
/*
4272+
* The startup process can update its local copy of
4273+
* minRecoveryPoint from this point.
4274+
*/
4275+
updateMinRecoveryPoint = true;
4276+
42624277
UpdateControlFile();
42634278
LWLockRelease(ControlFileLock);
42644279

@@ -6639,9 +6654,26 @@ StartupXLOG(void)
66396654
/* No need to hold ControlFileLock yet, we aren't up far enough */
66406655
UpdateControlFile();
66416656

6642-
/* initialize our local copy of minRecoveryPoint */
6643-
minRecoveryPoint = ControlFile->minRecoveryPoint;
6644-
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
6657+
/*
6658+
* Initialize our local copy of minRecoveryPoint. When doing crash
6659+
* recovery we want to replay up to the end of WAL. Particularly, in
6660+
* the case of a promoted standby minRecoveryPoint value in the
6661+
* control file is only updated after the first checkpoint. However,
6662+
* if the instance crashes before the first post-recovery checkpoint
6663+
* is completed then recovery will use a stale location causing the
6664+
* startup process to think that there are still invalid page
6665+
* references when checking for data consistency.
6666+
*/
6667+
if (InArchiveRecovery)
6668+
{
6669+
minRecoveryPoint = ControlFile->minRecoveryPoint;
6670+
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
6671+
}
6672+
else
6673+
{
6674+
minRecoveryPoint = InvalidXLogRecPtr;
6675+
minRecoveryPointTLI = 0;
6676+
}
66456677

66466678
/*
66476679
* Reset pgstat data, because it may be invalid after recovery.
@@ -7463,6 +7495,8 @@ CheckRecoveryConsistency(void)
74637495
if (XLogRecPtrIsInvalid(minRecoveryPoint))
74647496
return;
74657497

7498+
Assert(InArchiveRecovery);
7499+
74667500
/*
74677501
* assume that we are called in the startup process, and hence don't need
74687502
* a lock to read lastReplayedEndRecPtr
@@ -9640,11 +9674,16 @@ xlog_redo(XLogRecPtr lsn, XLogRecord *record)
96409674
* This is particularly important if wal_level was set to 'archive'
96419675
* before, and is now 'hot_standby', to ensure you don't run queries
96429676
* against the WAL preceding the wal_level change. Same applies to
9643-
* decreasing max_* settings.
9677+
* decreasing max_* settings. The local copies cannot be updated as
9678+
* long as crash recovery is happening and we expect all the WAL to
9679+
* be replayed.
96449680
*/
9645-
minRecoveryPoint = ControlFile->minRecoveryPoint;
9646-
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
9647-
if (minRecoveryPoint != 0 && minRecoveryPoint < lsn)
9681+
if (InArchiveRecovery)
9682+
{
9683+
minRecoveryPoint = ControlFile->minRecoveryPoint;
9684+
minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
9685+
}
9686+
if (minRecoveryPoint != InvalidXLogRecPtr && minRecoveryPoint < lsn)
96489687
{
96499688
ControlFile->minRecoveryPoint = lsn;
96509689
ControlFile->minRecoveryPointTLI = ThisTimeLineID;

0 commit comments

Comments
 (0)