Skip to content

Commit c0db1ec

Browse files
committed
Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the “pin scan” that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here. No tests included. Manual tests using an additional patch to view WAL records and their timing have shown the change in WAL records and their handling has successfully reduced replication delay. This is a back-patch of commits 687f2cd, 3e4b7d8, b602842 by Simon Riggs, to branches 9.4 and 9.5. No further backpatch is possible because this depends on catalog scans being MVCC. I (Álvaro) additionally updated a slight problem in the README, which explains why this touches the 9.6 and master branches.
1 parent 8951f92 commit c0db1ec

File tree

4 files changed

+59
-9
lines changed

4 files changed

+59
-9
lines changed

src/backend/access/nbtree/README

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -520,6 +520,28 @@ normal running after recovery has completed. This is a key capability
520520
because it allows running applications to continue while the standby
521521
changes state into a normally running server.
522522

523+
The interlocking required to avoid returning incorrect results from
524+
non-MVCC scans is not required on standby nodes. That is because
525+
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
526+
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
527+
ever used during write transactions, which cannot exist on the standby.
528+
MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
529+
is not a problem. That leaves concern only for HeapTupleSatisfiesToast().
530+
HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
531+
because it doesn't need to - if the main heap row is visible then the
532+
toast rows will also be visible. So as long as we follow a toast
533+
pointer from a visible (live) tuple the corresponding toast rows
534+
will also be visible, so we do not need to recheck MVCC on them.
535+
There is one minor exception, which is that the optimizer sometimes
536+
looks at the boundaries of value ranges using SnapshotDirty, which
537+
could result in returning a newer value for query statistics; this
538+
would affect the query plan in rare cases, but not the correctness.
539+
The risk window is small since the stats look at the min and max values
540+
in the index, so the scan retrieves a tid then immediately uses it
541+
to look in the heap. It is unlikely that the tid could have been
542+
deleted, vacuumed and re-inserted in the time taken to look in the heap
543+
via direct tid access. So we ignore that scan type as a problem.
544+
523545
Other Things That Are Handy to Know
524546
-----------------------------------
525547

src/backend/access/nbtree/nbtree.c

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -804,6 +804,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
804804
}
805805

806806
/*
807+
* Check to see if we need to issue one final WAL record for this index,
808+
* which may be needed for correctness on a hot standby node when non-MVCC
809+
* index scans could take place.
810+
*
807811
* If the WAL is replayed in hot standby, the replay process needs to get
808812
* cleanup locks on all index leaf pages, just as we've been doing here.
809813
* However, we won't issue any WAL records about pages that have no items
@@ -1013,10 +1017,13 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
10131017
if (ndeletable > 0)
10141018
{
10151019
/*
1016-
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
1017-
* instruction to the replay code to get cleanup lock on all pages
1018-
* between the previous lastBlockVacuumed and this page. This
1019-
* ensures that WAL replay locks all leaf pages at some point.
1020+
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
1021+
* all information to the replay code to allow it to get a cleanup
1022+
* lock on all pages between the previous lastBlockVacuumed and
1023+
* this page. This ensures that WAL replay locks all leaf pages at
1024+
* some point, which is important should non-MVCC scans be
1025+
* requested. This is currently unused on standby, but we record
1026+
* it anyway, so that the WAL contains the required information.
10201027
*
10211028
* Since we can visit leaf pages out-of-order when recursing,
10221029
* replay might end up locking such pages an extra time, but it

src/backend/access/nbtree/nbtxlog.c

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -385,12 +385,29 @@ static void
385385
btree_xlog_vacuum(XLogReaderState *record)
386386
{
387387
XLogRecPtr lsn = record->EndRecPtr;
388-
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
389388
Buffer buffer;
390389
Page page;
391390
BTPageOpaque opaque;
391+
#ifdef UNUSED
392+
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
392393

393394
/*
395+
* This section of code is thought to be no longer needed, after analysis
396+
* of the calling paths. It is retained to allow the code to be reinstated
397+
* if a flaw is revealed in that thinking.
398+
*
399+
* If we are running non-MVCC scans using this index we need to do some
400+
* additional work to ensure correctness, which is known as a "pin scan"
401+
* described in more detail in next paragraphs. We used to do the extra
402+
* work in all cases, whereas we now avoid that work in most cases. If
403+
* lastBlockVacuumed is set to InvalidBlockNumber then we skip the
404+
* additional work required for the pin scan.
405+
*
406+
* Avoiding this extra work is important since it requires us to touch
407+
* every page in the index, so is an O(N) operation. Worse, it is an
408+
* operation performed in the foreground during redo, so it delays
409+
* replication directly.
410+
*
394411
* If queries might be active then we need to ensure every leaf page is
395412
* unpinned between the lastBlockVacuumed and the current block, if there
396413
* are any. This prevents replay of the VACUUM from reaching the stage of
@@ -412,7 +429,7 @@ btree_xlog_vacuum(XLogReaderState *record)
412429
* isn't yet consistent; so we need not fear reading still-corrupt blocks
413430
* here during crash recovery.
414431
*/
415-
if (HotStandbyActiveInReplay())
432+
if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
416433
{
417434
RelFileNode thisrnode;
418435
BlockNumber thisblkno;
@@ -433,7 +450,8 @@ btree_xlog_vacuum(XLogReaderState *record)
433450
* XXX we don't actually need to read the block, we just need to
434451
* confirm it is unpinned. If we had a special call into the
435452
* buffer manager we could optimise this so that if the block is
436-
* not in shared_buffers we confirm it as unpinned.
453+
* not in shared_buffers we confirm it as unpinned. Optimizing
454+
* this is now moot, since in most cases we avoid the scan.
437455
*/
438456
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
439457
RBM_NORMAL_NO_LOG);
@@ -444,6 +462,7 @@ btree_xlog_vacuum(XLogReaderState *record)
444462
}
445463
}
446464
}
465+
#endif
447466

448467
/*
449468
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf

src/include/access/nbtree.h

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -331,8 +331,10 @@ typedef struct xl_btree_reuse_page
331331
* The WAL record can represent deletion of any number of index tuples on a
332332
* single index page when executed by VACUUM.
333333
*
334-
* The correctness requirement for applying these changes during recovery is
335-
* that we must do one of these two things for every block in the index:
334+
* For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
335+
* For a non-MVCC index scans there is an additional correctness requirement
336+
* for applying these changes during recovery, which is that we must do one
337+
* of these two things for every block in the index:
336338
* * lock the block for cleanup and apply any required changes
337339
* * EnsureBlockUnpinned()
338340
* The purpose of this is to ensure that no index scans started before we

0 commit comments

Comments
 (0)