Skip to content

Commit 3e4b7d8

Browse files
Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require complex interlocking that matched the requirements on the master. This required an O(N) operation that became a significant problem with large indexes, causing replication delays of seconds or in some cases minutes while the XLOG_BTREE_VACUUM was replayed. This commit skips the pin scan that was previously required, by observing in detail when and how it is safe to do so, with full documentation. The pin scan is skipped only in replay; the VACUUM code path on master is not touched here and WAL is identical. The current commit applies in all cases, effectively replacing commit 687f2cd.
1 parent 3cc38ca commit 3e4b7d8

File tree

3 files changed

+23
-35
lines changed

3 files changed

+23
-35
lines changed

src/backend/access/nbtree/README

+6-9
Original file line numberDiff line numberDiff line change
@@ -525,8 +525,12 @@ MVCC scans is not required on standby nodes. That is because
525525
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
526526
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
527527
ever used during write transactions, which cannot exist on the standby.
528-
This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast(), so
529-
HeapTupleSatisfiesToast() is the only non-MVCC scan type used on standbys.
528+
This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast().
529+
HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
530+
because it doesn't need to - if the main heap row is visible then the
531+
toast rows will also be visible. So as long as we follow a toast
532+
pointer from a visible (live) tuple the corresponding toast rows
533+
will also be visible, so we do not need to recheck MVCC on them.
530534
There is one minor exception, which is that the optimizer sometimes
531535
looks at the boundaries of value ranges using SnapshotDirty, which
532536
could result in returning a newer value for query statistics; this
@@ -536,13 +540,6 @@ in the index, so the scan retrieves a tid then immediately uses it
536540
to look in the heap. It is unlikely that the tid could have been
537541
deleted, vacuumed and re-inserted in the time taken to look in the heap
538542
via direct tid access. So we ignore that scan type as a problem.
539-
This means if we re-check the results of any scan of a toast index we
540-
will be able to completely avoid performing the "pin scan" operation
541-
during replay of VACUUM WAL records.
542-
543-
XXX FIXME: Toast re-checks are not yet added, so we still perform the
544-
pin scan when replaying vacuum records of toast indexes.
545-
546543

547544
Other Things That Are Handy to Know
548545
-----------------------------------

src/backend/access/nbtree/nbtree.c

+9-23
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,6 @@
2222
#include "access/relscan.h"
2323
#include "access/xlog.h"
2424
#include "catalog/index.h"
25-
#include "catalog/pg_namespace.h"
2625
#include "commands/vacuum.h"
2726
#include "storage/indexfsm.h"
2827
#include "storage/ipc.h"
@@ -833,8 +832,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
833832
/*
834833
* Check to see if we need to issue one final WAL record for this index,
835834
* which may be needed for correctness on a hot standby node when
836-
* non-MVCC index scans could take place. This now only occurs when we
837-
* perform a TOAST scan, so only occurs for TOAST indexes.
835+
* non-MVCC index scans could take place.
838836
*
839837
* If the WAL is replayed in hot standby, the replay process needs to get
840838
* cleanup locks on all index leaf pages, just as we've been doing here.
@@ -846,7 +844,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
846844
* against the last leaf page in the index, if that one wasn't vacuumed.
847845
*/
848846
if (XLogStandbyInfoActive() &&
849-
rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE &&
850847
vstate.lastBlockVacuumed < vstate.lastBlockLocked)
851848
{
852849
Buffer buf;
@@ -1045,33 +1042,22 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
10451042
*/
10461043
if (ndeletable > 0)
10471044
{
1048-
BlockNumber lastBlockVacuumed = InvalidBlockNumber;
1049-
1050-
/*
1051-
* We may need to record the lastBlockVacuumed for use when
1052-
* non-MVCC scans might be performed on the index on a
1053-
* hot standby. See explanation in btree_xlog_vacuum().
1054-
*
1055-
* On a hot standby, a non-MVCC scan can only take place
1056-
* when we access a Toast Index, so we need only record
1057-
* the lastBlockVacuumed if we are vacuuming a Toast Index.
1058-
*/
1059-
if (rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE)
1060-
lastBlockVacuumed = vstate->lastBlockVacuumed;
1061-
10621045
/*
1063-
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
1064-
* instruction to the replay code to get cleanup lock on all pages
1065-
* between the previous lastBlockVacuumed and this page. This
1066-
* ensures that WAL replay locks all leaf pages at some point.
1046+
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes all
1047+
* information to the replay code to allow it to get a cleanup lock
1048+
* on all pages between the previous lastBlockVacuumed and this page.
1049+
* This ensures that WAL replay locks all leaf pages at some point,
1050+
* which is important should non-MVCC scans be requested.
1051+
* This is currently unused on standby, but we record it anyway, so
1052+
* that the WAL contains the required information.
10671053
*
10681054
* Since we can visit leaf pages out-of-order when recursing,
10691055
* replay might end up locking such pages an extra time, but it
10701056
* doesn't seem worth the amount of bookkeeping it'd take to avoid
10711057
* that.
10721058
*/
10731059
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
1074-
lastBlockVacuumed);
1060+
vstate->lastBlockVacuumed);
10751061

10761062
/*
10771063
* Remember highest leaf page number we've issued a

src/backend/access/nbtree/nbtxlog.c

+8-3
Original file line numberDiff line numberDiff line change
@@ -385,17 +385,21 @@ static void
385385
btree_xlog_vacuum(XLogReaderState *record)
386386
{
387387
XLogRecPtr lsn = record->EndRecPtr;
388-
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
389388
Buffer buffer;
390389
Page page;
391390
BTPageOpaque opaque;
391+
#ifdef UNUSED
392+
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
392393

393394
/*
395+
* This section of code is thought to be no longer needed, after
396+
* analysis of the calling paths. It is retained to allow the code
397+
* to be reinstated if a flaw is revealed in that thinking.
398+
*
394399
* If we are running non-MVCC scans using this index we need to do some
395400
* additional work to ensure correctness, which is known as a "pin scan"
396401
* described in more detail in next paragraphs. We used to do the extra
397-
* work in all cases, whereas we now avoid that work except when the index
398-
* is a toast index, since toast scans aren't fully MVCC compliant.
402+
* work in all cases, whereas we now avoid that work in most cases.
399403
* If lastBlockVacuumed is set to InvalidBlockNumber then we skip the
400404
* additional work required for the pin scan.
401405
*
@@ -458,6 +462,7 @@ btree_xlog_vacuum(XLogReaderState *record)
458462
}
459463
}
460464
}
465+
#endif
461466

462467
/*
463468
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf

0 commit comments

Comments
 (0)