Skip to content

Commit 0a7d771

Browse files
Make nbtree split REDO locking match original execution.
Make the nbtree page split REDO routine consistent with original execution in its approach to acquiring and releasing buffer locks (at least for pages on the tree level of the page being split). This brings btree_xlog_split() in line with btree_xlog_unlink_page(), which was taught to couple buffer locks by commit 9a9db08. Note that the precise order in which we both acquire and release sibling buffer locks in btree_xlog_split() now matches original execution exactly (the precise order in which the locks are released probably doesn't matter much, but we might as well be consistent about it). The rule for nbtree REDO routines from here on is that same-level locks should be acquired in an order that's consistent with original execution. It's not practical to have a similar rule for cross-level page locks, since for the most part original execution holds those locks for a period that spans multiple atomic actions/WAL records. It's also not necessary, because clearly the cross-level lock coupling is only truly needed during original execution because of the presence of concurrent inserters. This is not a bug fix (unlike the similar aforementioned commit, commit 9a9db08). The immediate reason to tighten things up in this area is to enable an upcoming enhancement to contrib/amcheck that allows it to verify that sibling links are in agreement with only an AccessShareLock (this check produced false positives when run on a replica server on account of the inconsistency fixed by this commit). But that's not the only reason to be stricter here. It is generally useful to make locking on replicas be as close to what happens during original execution as practically possible. It makes it less likely that hard to catch bugs will slip in in the future. The previous state of affairs seems to be a holdover from before the introduction of Hot Standby, when buffer lock acquisitions during recovery were totally unnecessary. See also: commit 3bbf668, which tightened things up in this area a few years after the introduction of Hot Standby. Discussion: https://postgr.es/m/CAH2-Wz=465cJj11YXD9RKH8z=nhQa2dofOZ_23h67EXUGOJ00Q@mail.gmail.com
1 parent cea3d55 commit 0a7d771

File tree

2 files changed

+35
-46
lines changed

2 files changed

+35
-46
lines changed

src/backend/access/nbtree/README

+6-17
Original file line numberDiff line numberDiff line change
@@ -572,23 +572,12 @@ replay of page deletion records does not hold a write lock on the target
572572
leaf page throughout; only the primary needs to block out concurrent
573573
writers that insert on to the page being deleted.)
574574

575-
There are also locking differences between the primary and WAL replay
576-
for the first stage of a page split (i.e. same-level differences in
577-
locking). Replay of the first phase of a page split can get away with
578-
locking and updating the original right sibling page (which is also the
579-
new right sibling page's right sibling) after locks on the original page
580-
and its new right sibling have been released. Again, this is okay
581-
because there are no writers. Page deletion WAL replay cannot get away
582-
with being lax about same-level locking during replay, though -- doing
583-
so risks confusing concurrent backwards scans.
584-
585-
Page deletion's second phase locks the left sibling page, target page,
586-
and right page in order on the standby, just like on the primary. This
587-
allows backwards scans running on a standby to reason about page
588-
deletion on the leaf level; a page cannot appear deleted without that
589-
being reflected in the sibling pages. It's probably possible to be more
590-
lax about how locks are acquired on the standby during the second phase
591-
of page deletion, but that hardly seems worth it.
575+
WAL replay holds same-level locks in a way that matches the approach
576+
taken during original execution, though. This prevent readers from
577+
observing same-level inconsistencies. It's probably possible to be more
578+
lax about how same-level locks are acquired during recovery (most kinds
579+
of readers could still move right to recover if we didn't couple
580+
same-level locks), but we prefer to be conservative here.
592581

593582
During recovery all index scans start with ignore_killed_tuples = false
594583
and we never set kill_prior_tuple. We do this because the oldest xmin

src/backend/access/nbtree/nbtxlog.c

+29-29
Original file line numberDiff line numberDiff line change
@@ -172,10 +172,10 @@ btree_xlog_insert(bool isleaf, bool ismeta, bool posting,
172172
* Insertion to an internal page finishes an incomplete split at the child
173173
* level. Clear the incomplete-split flag in the child. Note: during
174174
* normal operation, the child and parent pages are locked at the same
175-
* time, so that clearing the flag and inserting the downlink appear
176-
* atomic to other backends. We don't bother with that during replay,
177-
* because readers don't care about the incomplete-split flag and there
178-
* cannot be updates happening.
175+
* time (the locks are coupled), so that clearing the flag and inserting
176+
* the downlink appear atomic to other backends. We don't bother with
177+
* that during replay, because readers don't care about the
178+
* incomplete-split flag and there cannot be updates happening.
179179
*/
180180
if (!isleaf)
181181
_bt_clear_incomplete_split(record, 1);
@@ -272,9 +272,17 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
272272
spagenumber = P_NONE;
273273

274274
/*
275-
* Clear the incomplete split flag on the left sibling of the child page
276-
* this is a downlink for. (Like in btree_xlog_insert, this can be done
277-
* before locking the other pages)
275+
* Clear the incomplete split flag on the appropriate child page one level
276+
* down when origpage/buf is an internal page (there must have been
277+
* cascading page splits during original execution in the event of an
278+
* internal page split). This is like the corresponding btree_xlog_insert
279+
* call for internal pages. We're not clearing the incomplete split flag
280+
* for the current page split here (you can think of this as part of the
281+
* insert of newitem that the page split action needs to perform in
282+
* passing).
283+
*
284+
* Like in btree_xlog_insert, this can be done before locking other pages.
285+
* We never need to couple cross-level locks in REDO routines.
278286
*/
279287
if (!isleaf)
280288
_bt_clear_incomplete_split(record, 3);
@@ -427,22 +435,7 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
427435
MarkBufferDirty(buf);
428436
}
429437

430-
/*
431-
* We no longer need the buffers. They must be released together, so that
432-
* readers cannot observe two inconsistent halves.
433-
*/
434-
if (BufferIsValid(buf))
435-
UnlockReleaseBuffer(buf);
436-
UnlockReleaseBuffer(rbuf);
437-
438-
/*
439-
* Fix left-link of the page to the right of the new right sibling.
440-
*
441-
* Note: in normal operation, we do this while still holding lock on the
442-
* two split pages. However, that's not necessary for correctness in WAL
443-
* replay, because no other index update can be in progress, and readers
444-
* will cope properly when following an obsolete left-link.
445-
*/
438+
/* Fix left-link of the page to the right of the new right sibling */
446439
if (spagenumber != P_NONE)
447440
{
448441
Buffer sbuf;
@@ -460,6 +453,14 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record)
460453
if (BufferIsValid(sbuf))
461454
UnlockReleaseBuffer(sbuf);
462455
}
456+
457+
/*
458+
* Finally, release the remaining buffers. sbuf, rbuf, and buf must be
459+
* released together, so that readers cannot observe inconsistencies.
460+
*/
461+
UnlockReleaseBuffer(rbuf);
462+
if (BufferIsValid(buf))
463+
UnlockReleaseBuffer(buf);
463464
}
464465

465466
static void
@@ -733,6 +734,11 @@ btree_xlog_mark_page_halfdead(uint8 info, XLogReaderState *record)
733734
PageSetLSN(page, lsn);
734735
MarkBufferDirty(buffer);
735736
}
737+
738+
/*
739+
* Don't need to couple cross-level locks in REDO routines, so release
740+
* lock on internal page immediately
741+
*/
736742
if (BufferIsValid(buffer))
737743
UnlockReleaseBuffer(buffer);
738744

@@ -789,12 +795,6 @@ btree_xlog_unlink_page(uint8 info, XLogReaderState *record)
789795
* the pages in the same standard left-to-right order (leftsib, target,
790796
* rightsib), and don't release the sibling locks until the target is
791797
* marked deleted.
792-
*
793-
* btree_xlog_split() can get away with fixing its right sibling page's
794-
* left link last of all, after dropping all other locks. We prefer to
795-
* avoid dropping locks on same-level pages early compared to normal
796-
* operation. This keeps things simple for backwards scans. See
797-
* nbtree/README.
798798
*/
799799

800800
/* Fix right-link of left sibling, if any */

0 commit comments

Comments
 (0)