Skip to content

Commit 9a9db08

Browse files
Fix replica backward scan race condition.
It was possible for the logic used by backward scans (which must reason about concurrent page splits/deletions in its own peculiar way) to become confused when running on a replica. Concurrent replay of a WAL record that describes the second phase of page deletion could cause _bt_walk_left() to get confused. btree_xlog_unlink_page() simply failed to adhere to the same locking protocol that we use on the primary, which is obviously wrong once you consider these two disparate functions together. This bug is present in all stable branches. More concretely, the problem was that nothing stopped _bt_walk_left() from observing inconsistencies between the deletion's target page and its original sibling pages when running on a replica. This is true even though the second phase of page deletion is supposed to work as a single atomic action. Queries running on replicas raised "could not find left sibling of block %u in index %s" can't-happen errors when they went back to their scan's "original" page and observed that the page has not been marked deleted (even though it really was concurrently deleted). There is no evidence that this actually happened in the real world. The issue came to light during unrelated feature development work. Note that _bt_walk_left() is the only code that cares about the difference between a half-dead page and a fully deleted page that isn't also exclusively used by nbtree VACUUM (unless you include contrib/amcheck code). It seems very likely that backward scans are the only thing that could become confused by the inconsistency. Even amcheck's complex bt_right_page_check_scankey() dance was unaffected. To fix, teach btree_xlog_unlink_page() to lock the left sibling, target, and right sibling pages in that order before releasing any locks (just like _bt_unlink_halfdead_page()). This is the simplest possible approach. There doesn't seem to be any opportunity to be more clever about lock acquisition in the REDO routine, and it hardly seems worth the trouble in any case. This fix might enable contrib/amcheck verification of leaf page sibling links with only an AccessShareLock on the relation. An amcheck patch from Andrey Borodin was rejected back in January because it clashed with btree_xlog_unlink_page()'s lax approach to locking pages. It now seems likely that the real problem was with btree_xlog_unlink_page(), not the patch. This is a low severity, low likelihood bug, so no backpatch. Author: Michail Nikolaev Diagnosed-By: Michail Nikolaev Discussion: https://postgr.es/m/CANtu0ohkR-evAWbpzJu54V8eCOtqjJyYp3PQ_SGoBTRGXWhWRw@mail.gmail.com
1 parent a451b7d commit 9a9db08

File tree

2 files changed

+72
-34
lines changed

2 files changed

+72
-34
lines changed

src/backend/access/nbtree/README

+18
Original file line numberDiff line numberDiff line change
@@ -572,6 +572,24 @@ replay of page deletion records does not hold a write lock on the target
572572
leaf page throughout; only the primary needs to block out concurrent
573573
writers that insert on to the page being deleted.)
574574

575+
There are also locking differences between the primary and WAL replay
576+
for the first stage of a page split (i.e. same-level differences in
577+
locking). Replay of the first phase of a page split can get away with
578+
locking and updating the original right sibling page (which is also the
579+
new right sibling page's right sibling) after locks on the original page
580+
and its new right sibling have been released. Again, this is okay
581+
because there are no writers. Page deletion WAL replay cannot get away
582+
with being lax about same-level locking during replay, though -- doing
583+
so risks confusing concurrent backwards scans.
584+
585+
Page deletion's second phase locks the left sibling page, target page,
586+
and right page in order on the standby, just like on the primary. This
587+
allows backwards scans running on a standby to reason about page
588+
deletion on the leaf level; a page cannot appear deleted without that
589+
being reflected in the sibling pages. It's probably possible to be more
590+
lax about how locks are acquired on the standby during the second phase
591+
of page deletion, but that hardly seems worth it.
592+
575593
During recovery all index scans start with ignore_killed_tuples = false
576594
and we never set kill_prior_tuple. We do this because the oldest xmin
577595
on the standby server can be older than the oldest xmin on the primary

src/backend/access/nbtree/nbtxlog.c

+54-34
Original file line numberDiff line numberDiff line change
@@ -774,7 +774,9 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
774774
xl_btree_unlink_page *xlrec = (xl_btree_unlink_page *) XLogRecGetData(record);
775775
BlockNumber leftsib;
776776
BlockNumber rightsib;
777-
Buffer buffer;
777+
Buffer leftbuf;
778+
Buffer target;
779+
Buffer rightbuf;
778780
Page page;
779781
BTPageOpaque pageop;
780782

@@ -783,46 +785,39 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
783785

784786
/*
785787
* In normal operation, we would lock all the pages this WAL record
786-
* touches before changing any of them. In WAL replay, it should be okay
787-
* to lock just one page at a time, since no concurrent index updates can
788-
* be happening, and readers should not care whether they arrive at the
789-
* target page or not (since it's surely empty).
788+
* touches before changing any of them. In WAL replay, we at least lock
789+
* the pages in the same standard left-to-right order (leftsib, target,
790+
* rightsib), and don't release the sibling locks until the target is
791+
* marked deleted.
792+
*
793+
* btree_xlog_split() can get away with fixing its right sibling page's
794+
* left link last of all, after dropping all other locks. We prefer to
795+
* avoid dropping locks on same-level pages early compared to normal
796+
* operation. This keeps things simple for backwards scans. See
797+
* nbtree/README.
790798
*/
791799

792-
/* Fix left-link of right sibling */
793-
if (XLogReadBufferForRedo(record, 2, &buffer) == BLK_NEEDS_REDO)
794-
{
795-
page = (Page) BufferGetPage(buffer);
796-
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
797-
pageop->btpo_prev = leftsib;
798-
799-
PageSetLSN(page, lsn);
800-
MarkBufferDirty(buffer);
801-
}
802-
if (BufferIsValid(buffer))
803-
UnlockReleaseBuffer(buffer);
804-
805800
/* Fix right-link of left sibling, if any */
806801
if (leftsib != P_NONE)
807802
{
808-
if (XLogReadBufferForRedo(record, 1, &buffer) == BLK_NEEDS_REDO)
803+
if (XLogReadBufferForRedo(record, 1, &leftbuf) == BLK_NEEDS_REDO)
809804
{
810-
page = (Page) BufferGetPage(buffer);
805+
page = (Page) BufferGetPage(leftbuf);
811806
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
812807
pageop->btpo_next = rightsib;
813808

814809
PageSetLSN(page, lsn);
815-
MarkBufferDirty(buffer);
810+
MarkBufferDirty(leftbuf);
816811
}
817-
if (BufferIsValid(buffer))
818-
UnlockReleaseBuffer(buffer);
819812
}
813+
else
814+
leftbuf = InvalidBuffer;
820815

821816
/* Rewrite target page as empty deleted page */
822-
buffer = XLogInitBufferForRedo(record, 0);
823-
page = (Page) BufferGetPage(buffer);
817+
target = XLogInitBufferForRedo(record, 0);
818+
page = (Page) BufferGetPage(target);
824819

825-
_bt_pageinit(page, BufferGetPageSize(buffer));
820+
_bt_pageinit(page, BufferGetPageSize(target));
826821
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
827822

828823
pageop->btpo_prev = leftsib;
@@ -832,8 +827,27 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
832827
pageop->btpo_cycleid = 0;
833828

834829
PageSetLSN(page, lsn);
835-
MarkBufferDirty(buffer);
836-
UnlockReleaseBuffer(buffer);
830+
MarkBufferDirty(target);
831+
832+
/* Fix left-link of right sibling */
833+
if (XLogReadBufferForRedo(record, 2, &rightbuf) == BLK_NEEDS_REDO)
834+
{
835+
page = (Page) BufferGetPage(rightbuf);
836+
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
837+
pageop->btpo_prev = leftsib;
838+
839+
PageSetLSN(page, lsn);
840+
MarkBufferDirty(rightbuf);
841+
}
842+
843+
/* Release siblings */
844+
if (BufferIsValid(leftbuf))
845+
UnlockReleaseBuffer(leftbuf);
846+
if (BufferIsValid(rightbuf))
847+
UnlockReleaseBuffer(rightbuf);
848+
849+
/* Release target */
850+
UnlockReleaseBuffer(target);
837851

838852
/*
839853
* If we deleted a parent of the targeted leaf page, instead of the leaf
@@ -845,13 +859,19 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
845859
/*
846860
* There is no real data on the page, so we just re-create it from
847861
* scratch using the information from the WAL record.
862+
*
863+
* Note that we don't end up here when the target page is also the
864+
* leafbuf page. There is no need to add a dummy hikey item with a
865+
* top parent link when deleting leafbuf because it's the last page
866+
* we'll delete in the subtree undergoing deletion.
848867
*/
849-
IndexTupleData trunctuple;
868+
Buffer leafbuf;
869+
IndexTupleData trunctuple;
850870

851-
buffer = XLogInitBufferForRedo(record, 3);
852-
page = (Page) BufferGetPage(buffer);
871+
leafbuf = XLogInitBufferForRedo(record, 3);
872+
page = (Page) BufferGetPage(leafbuf);
853873

854-
_bt_pageinit(page, BufferGetPageSize(buffer));
874+
_bt_pageinit(page, BufferGetPageSize(leafbuf));
855875
pageop = (BTPageOpaque) PageGetSpecialPointer(page);
856876

857877
pageop->btpo_flags = BTP_HALF_DEAD | BTP_LEAF;
@@ -870,8 +890,8 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
870890
elog(ERROR, "could not add dummy high key to half-dead page");
871891

872892
PageSetLSN(page, lsn);
873-
MarkBufferDirty(buffer);
874-
UnlockReleaseBuffer(buffer);
893+
MarkBufferDirty(leafbuf);
894+
UnlockReleaseBuffer(leafbuf);
875895
}
876896

877897
/* Update metapage if needed */

0 commit comments

Comments
 (0)