Fix replica backward scan race condition.

petergeoghegan · petergeoghegan · commit 9a9db08ae462 · 2020-08-03T15:54:38.000-07:00
It was possible for the logic used by backward scans (which must reason about concurrent page splits/deletions in its own peculiar way) to become confused when running on a replica. Concurrent replay of a WAL record that describes the second phase of page deletion could cause _bt_walk_left() to get confused. btree_xlog_unlink_page() simply failed to adhere to the same locking protocol that we use on the primary, which is obviously wrong once you consider these two disparate functions together. This bug is present in all stable branches. More concretely, the problem was that nothing stopped _bt_walk_left() from observing inconsistencies between the deletion's target page and its original sibling pages when running on a replica. This is true even though the second phase of page deletion is supposed to work as a single atomic action. Queries running on replicas raised "could not find left sibling of block %u in index %s" can't-happen errors when they went back to their scan's "original" page and observed that the page has not been marked deleted (even though it really was concurrently deleted). There is no evidence that this actually happened in the real world. The issue came to light during unrelated feature development work. Note that _bt_walk_left() is the only code that cares about the difference between a half-dead page and a fully deleted page that isn't also exclusively used by nbtree VACUUM (unless you include contrib/amcheck code). It seems very likely that backward scans are the only thing that could become confused by the inconsistency. Even amcheck's complex bt_right_page_check_scankey() dance was unaffected. To fix, teach btree_xlog_unlink_page() to lock the left sibling, target, and right sibling pages in that order before releasing any locks (just like _bt_unlink_halfdead_page()). This is the simplest possible approach. There doesn't seem to be any opportunity to be more clever about lock acquisition in the REDO routine, and it hardly seems worth the trouble in any case. This fix might enable contrib/amcheck verification of leaf page sibling links with only an AccessShareLock on the relation. An amcheck patch from Andrey Borodin was rejected back in January because it clashed with btree_xlog_unlink_page()'s lax approach to locking pages. It now seems likely that the real problem was with btree_xlog_unlink_page(), not the patch. This is a low severity, low likelihood bug, so no backpatch. Author: Michail Nikolaev Diagnosed-By: Michail Nikolaev Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
@@ -572,6 +572,24 @@ replay of page deletion records does not hold a write lock on the target
 leaf page throughout; only the primary needs to block out concurrent
 writers that insert on to the page being deleted.)
 
+There are also locking differences between the primary and WAL replay
+for the first stage of a page split (i.e. same-level differences in
+locking).  Replay of the first phase of a page split can get away with
+locking and updating the original right sibling page (which is also the
+new right sibling page's right sibling) after locks on the original page
+and its new right sibling have been released.  Again, this is okay
+because there are no writers.  Page deletion WAL replay cannot get away
+with being lax about same-level locking during replay, though -- doing
+so risks confusing concurrent backwards scans.
+
+Page deletion's second phase locks the left sibling page, target page,
+and right page in order on the standby, just like on the primary.  This
+allows backwards scans running on a standby to reason about page
+deletion on the leaf level; a page cannot appear deleted without that
+being reflected in the sibling pages.  It's probably possible to be more
+lax about how locks are acquired on the standby during the second phase
+of page deletion, but that hardly seems worth it.
+
 During recovery all index scans start with ignore_killed_tuples = false
 and we never set kill_prior_tuple. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the primary
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
@@ -774,7 +774,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
 	BlockNumber leftsib;
 	BlockNumber rightsib;
-	Buffer		buffer;
+	Buffer		leftbuf;
+	Buffer		target;
+	Buffer		rightbuf;
 	Page		page;
 	BTPageOpaque pageop;
 
@@ -783,46 +785,39 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 
 	/*
 	 * In normal operation, we would lock all the pages this WAL record
-	 * touches before changing any of them.  In WAL replay, it should be okay
-	 * to lock just one page at a time, since no concurrent index updates can
-	 * be happening, and readers should not care whether they arrive at the
-	 * target page or not (since it's surely empty).
+	 * touches before changing any of them.  In WAL replay, we at least lock
+	 * the pages in the same standard left-to-right order (leftsib, target,
+	 * rightsib), and don't release the sibling locks until the target is
+	 * marked deleted.
+	 *
+	 * btree_xlog_split() can get away with fixing its right sibling page's
+	 * left link last of all, after dropping all other locks.  We prefer to
+	 * avoid dropping locks on same-level pages early compared to normal
+	 * operation.  This keeps things simple for backwards scans.  See
+	 * nbtree/README.
 	 */
 
-	/* Fix left-link of right sibling */
-	if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
-	{
-		page = (Page) BufferGetPage(buffer);
-		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
-		pageop->btpo_prev = leftsib;
-
-		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-	}
-	if (BufferIsValid(buffer))
-		UnlockReleaseBuffer(buffer);
-
 	/* Fix right-link of left sibling, if any */
 	if (leftsib != P_NONE)
 	{
-		if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
+		if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
 		{
-			page = (Page) BufferGetPage(buffer);
+			page = (Page) BufferGetPage(leftbuf);
 			pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 			pageop->btpo_next = rightsib;
 
 			PageSetLSN(page, lsn);
-			MarkBufferDirty(buffer);
+			MarkBufferDirty(leftbuf);
 		}
-		if (BufferIsValid(buffer))
-			UnlockReleaseBuffer(buffer);
 	}
+	else
+		leftbuf = InvalidBuffer;
 
 	/* Rewrite target page as empty deleted page */
-	buffer = XLogInitBufferForRedo(record, 0);
-	page = (Page) BufferGetPage(buffer);
+	target = XLogInitBufferForRedo(record, 0);
+	page = (Page) BufferGetPage(target);
 
-	_bt_pageinit(page, BufferGetPageSize(buffer));
+	_bt_pageinit(page, BufferGetPageSize(target));
 	pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 	pageop->btpo_prev = leftsib;
@@ -832,8 +827,27 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 	pageop->btpo_cycleid = 0;
 
 	PageSetLSN(page, lsn);
-	MarkBufferDirty(buffer);
-	UnlockReleaseBuffer(buffer);
+	MarkBufferDirty(target);
+
+	/* Fix left-link of right sibling */
+	if (XLogReadBufferForRedo(record, 2, &rightbuf) == BLK_NEEDS_REDO)
+	{
+		page = (Page) BufferGetPage(rightbuf);
+		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
+		pageop->btpo_prev = leftsib;
+
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(rightbuf);
+	}
+
+	/* Release siblings */
+	if (BufferIsValid(leftbuf))
+		UnlockReleaseBuffer(leftbuf);
+	if (BufferIsValid(rightbuf))
+		UnlockReleaseBuffer(rightbuf);
+
+	/* Release target */
+	UnlockReleaseBuffer(target);
 
 	/*
 	 * If we deleted a parent of the targeted leaf page, instead of the leaf
@@ -845,13 +859,19 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 		/*
 		 * There is no real data on the page, so we just re-create it from
 		 * scratch using the information from the WAL record.
+		 *
+		 * Note that we don't end up here when the target page is also the
+		 * leafbuf page.  There is no need to add a dummy hikey item with a
+		 * top parent link when deleting leafbuf because it's the last page
+		 * we'll delete in the subtree undergoing deletion.
 		 */
-		IndexTupleData trunctuple;
+		Buffer				leafbuf;
+		IndexTupleData		trunctuple;
 
-		buffer = XLogInitBufferForRedo(record, 3);
-		page = (Page) BufferGetPage(buffer);
+		leafbuf = XLogInitBufferForRedo(record, 3);
+		page = (Page) BufferGetPage(leafbuf);
 
-		_bt_pageinit(page, BufferGetPageSize(buffer));
+		_bt_pageinit(page, BufferGetPageSize(leafbuf));
 		pageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
 		pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -870,8 +890,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
 			elog(ERROR, "could not add dummy high key to half-dead page");
 
 		PageSetLSN(page, lsn);
-		MarkBufferDirty(buffer);
-		UnlockReleaseBuffer(buffer);
+		MarkBufferDirty(leafbuf);
+		UnlockReleaseBuffer(leafbuf);
 	}
 
 	/* Update metapage if needed */