Skip to content

Commit 9f83468

Browse files
Remove unneeded "pin scan" nbtree VACUUM code.
The REDO routine for nbtree's xl_btree_vacuum record type hasn't performed a "pin scan" since commit 3e4b7d8 went in, so clearly there isn't any point in VACUUM WAL-logging information that won't actually be used. Finish off the work of commit 3e4b7d8 (and the closely related preceding commit 687f2cd) by removing the code that generates this unused information. Also remove the REDO routine code disabled by commit 3e4b7d8. Replace the unneeded lastBlockVacuumed field in xl_btree_vacuum with a new "ndeleted" field. The new field isn't actually needed right now, since we could continue to infer the array length from the overall record length. However, an upcoming patch to add deduplication to nbtree needs to add an "items updated" field to xl_btree_vacuum, so we might as well start being explicit about the number of items now. (Besides, it doesn't seem like a good idea to leave the xl_btree_vacuum struct without any fields; the C standard says that that's undefined.) nbtree VACUUM no longer forces writing a WAL record for the last block in the index. Writing out a WAL record with no items for the final block was supposed to force processing of a lastBlockVacuumed field by a pin scan. Bump XLOG_PAGE_MAGIC because xl_btree_vacuum changed. Discussion: https://postgr.es/m/CAH2-WzmY_mT7UnTzFB5LBQDBkKpdV5UxP3B5bLb7uP%3D%3D6UQJRQ%40mail.gmail.com
1 parent b93e9a5 commit 9f83468

File tree

8 files changed

+101
-245
lines changed

8 files changed

+101
-245
lines changed

src/backend/access/nbtree/README

+43-20
Original file line numberDiff line numberDiff line change
@@ -508,7 +508,9 @@ the parent is finished and the flag in the child cleared, but can be
508508
released immediately after that, before recursing up the tree if the parent
509509
also needs to be split. This ensures that incompletely split pages should
510510
not be seen under normal circumstances; only if insertion to the parent
511-
has failed for some reason.
511+
has failed for some reason. (It's also possible for a reader to observe
512+
a page with the incomplete split flag set during recovery; see later
513+
section on "Scans during Recovery" for details.)
512514

513515
We flag the left page, even though it's the right page that's missing the
514516
downlink, because it's more convenient to know already when following the
@@ -528,7 +530,7 @@ next VACUUM will find the half-dead leaf page and continue the deletion.
528530

529531
Before 9.4, we used to keep track of incomplete splits and page deletions
530532
during recovery and finish them immediately at end of recovery, instead of
531-
doing it lazily at the next insertion or vacuum. However, that made the
533+
doing it lazily at the next insertion or vacuum. However, that made the
532534
recovery much more complicated, and only fixed the problem when crash
533535
recovery was performed. An incomplete split can also occur if an otherwise
534536
recoverable error, like out-of-memory or out-of-disk-space, happens while
@@ -537,23 +539,41 @@ inserting the downlink to the parent.
537539
Scans during Recovery
538540
---------------------
539541

540-
The btree index type can be safely used during recovery. During recovery
541-
we have at most one writer and potentially many readers. In that
542-
situation the locking requirements can be relaxed and we do not need
543-
double locking during block splits. Each WAL record makes changes to a
544-
single level of the btree using the correct locking sequence and so
545-
is safe for concurrent readers. Some readers may observe a block split
546-
in progress as they descend the tree, but they will simply move right
547-
onto the correct page.
542+
nbtree indexes support read queries in Hot Standby mode. Every atomic
543+
action/WAL record makes isolated changes that leave the tree in a
544+
consistent state for readers. Readers lock pages according to the same
545+
rules that readers follow on the primary. (Readers may have to move
546+
right to recover from a "concurrent" page split or page deletion, just
547+
like on the primary.)
548+
549+
However, there are a couple of differences in how pages are locked by
550+
replay/the startup process as compared to the original write operation
551+
on the primary. The exceptions involve page splits and page deletions.
552+
The first phase and second phase of a page split are processed
553+
independently during replay, since they are independent atomic actions.
554+
We do not attempt to recreate the coupling of parent and child page
555+
write locks that took place on the primary. This is safe because readers
556+
never care about the incomplete split flag anyway. Holding on to an
557+
extra write lock on the primary is only necessary so that a second
558+
writer cannot observe the incomplete split flag before the first writer
559+
finishes the split. If we let concurrent writers on the primary observe
560+
an incomplete split flag on the same page, each writer would attempt to
561+
complete the unfinished split, corrupting the parent page. (Similarly,
562+
replay of page deletion records does not hold a write lock on the leaf
563+
page throughout; only the primary needs to blocks out concurrent writers
564+
that insert on to the page being deleted.)
548565

549566
During recovery all index scans start with ignore_killed_tuples = false
550567
and we never set kill_prior_tuple. We do this because the oldest xmin
551568
on the standby server can be older than the oldest xmin on the master
552-
server, which means tuples can be marked as killed even when they are
553-
still visible on the standby. We don't WAL log tuple killed bits, but
569+
server, which means tuples can be marked LP_DEAD even when they are
570+
still visible on the standby. We don't WAL log tuple LP_DEAD bits, but
554571
they can still appear in the standby because of full page writes. So
555572
we must always ignore them in standby, and that means it's not worth
556-
setting them either.
573+
setting them either. (When LP_DEAD-marked tuples are eventually deleted
574+
on the primary, the deletion is WAL-logged. Queries that run on a
575+
standby therefore get much of the benefit of any LP_DEAD setting that
576+
takes place on the primary.)
557577

558578
Note that we talk about scans that are started during recovery. We go to
559579
a little trouble to allow a scan to start during recovery and end during
@@ -562,14 +582,17 @@ because it allows running applications to continue while the standby
562582
changes state into a normally running server.
563583

564584
The interlocking required to avoid returning incorrect results from
565-
non-MVCC scans is not required on standby nodes. That is because
585+
non-MVCC scans is not required on standby nodes. We still get a
586+
super-exclusive lock ("cleanup lock") when replaying VACUUM records
587+
during recovery, but recovery does not need to lock every leaf page
588+
(only those leaf pages that have items to delete). That is safe because
566589
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
567-
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
568-
ever used during write transactions, which cannot exist on the standby.
569-
MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
570-
is not a problem. The optimizer looks at the boundaries of value ranges
571-
using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which is
572-
also safe. That leaves concern only for HeapTupleSatisfiesToast().
590+
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
591+
used during write transactions, which cannot exist on the standby. MVCC
592+
scans are already protected by definition, so HeapTupleSatisfiesMVCC()
593+
is not a problem. The optimizer looks at the boundaries of value ranges
594+
using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
595+
is also safe. That leaves concern only for HeapTupleSatisfiesToast().
573596

574597
HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
575598
because it doesn't need to - if the main heap row is visible then the

src/backend/access/nbtree/nbtpage.c

+15-19
Original file line numberDiff line numberDiff line change
@@ -968,32 +968,28 @@ _bt_page_recyclable(Page page)
968968
* deleting the page it points to.
969969
*
970970
* This routine assumes that the caller has pinned and locked the buffer.
971-
* Also, the given itemnos *must* appear in increasing order in the array.
971+
* Also, the given deletable array *must* be sorted in ascending order.
972972
*
973-
* We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
974-
* we need to be able to pin all of the blocks in the btree in physical
975-
* order when replaying the effects of a VACUUM, just as we do for the
976-
* original VACUUM itself. lastBlockVacuumed allows us to tell whether an
977-
* intermediate range of blocks has had no changes at all by VACUUM,
978-
* and so must be scanned anyway during replay. We always write a WAL record
979-
* for the last block in the index, whether or not it contained any items
980-
* to be removed. This allows us to scan right up to end of index to
981-
* ensure correct locking.
973+
* We record VACUUMs and b-tree deletes differently in WAL. Deletes must
974+
* generate recovery conflicts by accessing the heap inline, whereas VACUUMs
975+
* can rely on the initial heap scan taking care of the problem (pruning would
976+
* have generated the conflicts needed for hot standby already).
982977
*/
983978
void
984979
_bt_delitems_vacuum(Relation rel, Buffer buf,
985-
OffsetNumber *itemnos, int nitems,
986-
BlockNumber lastBlockVacuumed)
980+
OffsetNumber *deletable, int ndeletable)
987981
{
988982
Page page = BufferGetPage(buf);
989983
BTPageOpaque opaque;
990984

985+
/* Shouldn't be called unless there's something to do */
986+
Assert(ndeletable > 0);
987+
991988
/* No ereport(ERROR) until changes are logged */
992989
START_CRIT_SECTION();
993990

994991
/* Fix the page */
995-
if (nitems > 0)
996-
PageIndexMultiDelete(page, itemnos, nitems);
992+
PageIndexMultiDelete(page, deletable, ndeletable);
997993

998994
/*
999995
* We can clear the vacuum cycle ID since this page has certainly been
@@ -1019,7 +1015,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
10191015
XLogRecPtr recptr;
10201016
xl_btree_vacuum xlrec_vacuum;
10211017

1022-
xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
1018+
xlrec_vacuum.ndeleted = ndeletable;
10231019

10241020
XLogBeginInsert();
10251021
XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1030,8 +1026,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
10301026
* is. When XLogInsert stores the whole buffer, the offsets array
10311027
* need not be stored too.
10321028
*/
1033-
if (nitems > 0)
1034-
XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
1029+
XLogRegisterBufData(0, (char *) deletable,
1030+
ndeletable * sizeof(OffsetNumber));
10351031

10361032
recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
10371033

@@ -1050,8 +1046,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
10501046
* Also, the given itemnos *must* appear in increasing order in the array.
10511047
*
10521048
* This is nearly the same as _bt_delitems_vacuum as far as what it does to
1053-
* the page, but the WAL logging considerations are quite different. See
1054-
* comments for _bt_delitems_vacuum.
1049+
* the page, but it needs to generate its own recovery conflicts by accessing
1050+
* the heap. See comments for _bt_delitems_vacuum.
10551051
*/
10561052
void
10571053
_bt_delitems_delete(Relation rel, Buffer buf,

src/backend/access/nbtree/nbtree.c

+23-88
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,6 @@ typedef struct
4646
IndexBulkDeleteCallback callback;
4747
void *callback_state;
4848
BTCycleId cycleid;
49-
BlockNumber lastBlockVacuumed; /* highest blkno actually vacuumed */
50-
BlockNumber lastBlockLocked; /* highest blkno we've cleanup-locked */
5149
BlockNumber totFreePages; /* true total # of free pages */
5250
TransactionId oldestBtpoXact;
5351
MemoryContext pagedelcontext;
@@ -978,8 +976,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
978976
vstate.callback = callback;
979977
vstate.callback_state = callback_state;
980978
vstate.cycleid = cycleid;
981-
vstate.lastBlockVacuumed = BTREE_METAPAGE; /* Initialise at first block */
982-
vstate.lastBlockLocked = BTREE_METAPAGE;
983979
vstate.totFreePages = 0;
984980
vstate.oldestBtpoXact = InvalidTransactionId;
985981

@@ -1040,39 +1036,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
10401036
}
10411037
}
10421038

1043-
/*
1044-
* Check to see if we need to issue one final WAL record for this index,
1045-
* which may be needed for correctness on a hot standby node when non-MVCC
1046-
* index scans could take place.
1047-
*
1048-
* If the WAL is replayed in hot standby, the replay process needs to get
1049-
* cleanup locks on all index leaf pages, just as we've been doing here.
1050-
* However, we won't issue any WAL records about pages that have no items
1051-
* to be deleted. For pages between pages we've vacuumed, the replay code
1052-
* will take locks under the direction of the lastBlockVacuumed fields in
1053-
* the XLOG_BTREE_VACUUM WAL records. To cover pages after the last one
1054-
* we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
1055-
* against the last leaf page in the index, if that one wasn't vacuumed.
1056-
*/
1057-
if (XLogStandbyInfoActive() &&
1058-
vstate.lastBlockVacuumed < vstate.lastBlockLocked)
1059-
{
1060-
Buffer buf;
1061-
1062-
/*
1063-
* The page should be valid, but we can't use _bt_getbuf() because we
1064-
* want to use a nondefault buffer access strategy. Since we aren't
1065-
* going to delete any items, getting cleanup lock again is probably
1066-
* overkill, but for consistency do that anyway.
1067-
*/
1068-
buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
1069-
RBM_NORMAL, info->strategy);
1070-
LockBufferForCleanup(buf);
1071-
_bt_checkpage(rel, buf);
1072-
_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
1073-
_bt_relbuf(rel, buf);
1074-
}
1075-
10761039
MemoryContextDelete(vstate.pagedelcontext);
10771040

10781041
/*
@@ -1203,13 +1166,6 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
12031166
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
12041167
LockBufferForCleanup(buf);
12051168

1206-
/*
1207-
* Remember highest leaf page number we've taken cleanup lock on; see
1208-
* notes in btvacuumscan
1209-
*/
1210-
if (blkno > vstate->lastBlockLocked)
1211-
vstate->lastBlockLocked = blkno;
1212-
12131169
/*
12141170
* Check whether we need to recurse back to earlier pages. What we
12151171
* are concerned about is a page split that happened since we started
@@ -1225,8 +1181,10 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
12251181
recurse_to = opaque->btpo_next;
12261182

12271183
/*
1228-
* Scan over all items to see which ones need deleted according to the
1229-
* callback function.
1184+
* When each VACUUM begins, it determines an OldestXmin cutoff value.
1185+
* Tuples before the cutoff are removed by VACUUM. Scan over all
1186+
* items to see which ones need to be deleted according to cutoff
1187+
* point using callback.
12301188
*/
12311189
ndeletable = 0;
12321190
minoff = P_FIRSTDATAKEY(opaque);
@@ -1245,25 +1203,24 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
12451203
htup = &(itup->t_tid);
12461204

12471205
/*
1248-
* During Hot Standby we currently assume that
1249-
* XLOG_BTREE_VACUUM records do not produce conflicts. That is
1250-
* only true as long as the callback function depends only
1251-
* upon whether the index tuple refers to heap tuples removed
1252-
* in the initial heap scan. When vacuum starts it derives a
1253-
* value of OldestXmin. Backends taking later snapshots could
1254-
* have a RecentGlobalXmin with a later xid than the vacuum's
1255-
* OldestXmin, so it is possible that row versions deleted
1256-
* after OldestXmin could be marked as killed by other
1257-
* backends. The callback function *could* look at the index
1258-
* tuple state in isolation and decide to delete the index
1259-
* tuple, though currently it does not. If it ever did, we
1260-
* would need to reconsider whether XLOG_BTREE_VACUUM records
1261-
* should cause conflicts. If they did cause conflicts they
1262-
* would be fairly harsh conflicts, since we haven't yet
1263-
* worked out a way to pass a useful value for
1264-
* latestRemovedXid on the XLOG_BTREE_VACUUM records. This
1265-
* applies to *any* type of index that marks index tuples as
1266-
* killed.
1206+
* Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
1207+
* records do not produce their own conflicts. This is safe
1208+
* as long as the callback function only considers whether the
1209+
* index tuple refers to pre-cutoff heap tuples that were
1210+
* certainly already pruned away during VACUUM's initial heap
1211+
* scan by the time we get here. (We can rely on conflicts
1212+
* produced by heap pruning, rather than producing our own
1213+
* now.)
1214+
*
1215+
* Backends with snapshots acquired after a VACUUM starts but
1216+
* before it finishes could have a RecentGlobalXmin with a
1217+
* later xid than the VACUUM's OldestXmin cutoff. These
1218+
* backends might happen to opportunistically mark some index
1219+
* tuples LP_DEAD before we reach them, even though they may
1220+
* be after our cutoff. We don't try to kill these "extra"
1221+
* index tuples in _bt_delitems_vacuum(). This keep things
1222+
* simple, and allows us to always avoid generating our own
1223+
* conflicts.
12671224
*/
12681225
if (callback(htup, callback_state))
12691226
deletable[ndeletable++] = offnum;
@@ -1276,29 +1233,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
12761233
*/
12771234
if (ndeletable > 0)
12781235
{
1279-
/*
1280-
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes
1281-
* all information to the replay code to allow it to get a cleanup
1282-
* lock on all pages between the previous lastBlockVacuumed and
1283-
* this page. This ensures that WAL replay locks all leaf pages at
1284-
* some point, which is important should non-MVCC scans be
1285-
* requested. This is currently unused on standby, but we record
1286-
* it anyway, so that the WAL contains the required information.
1287-
*
1288-
* Since we can visit leaf pages out-of-order when recursing,
1289-
* replay might end up locking such pages an extra time, but it
1290-
* doesn't seem worth the amount of bookkeeping it'd take to avoid
1291-
* that.
1292-
*/
1293-
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
1294-
vstate->lastBlockVacuumed);
1295-
1296-
/*
1297-
* Remember highest leaf page number we've issued a
1298-
* XLOG_BTREE_VACUUM WAL record for.
1299-
*/
1300-
if (blkno > vstate->lastBlockVacuumed)
1301-
vstate->lastBlockVacuumed = blkno;
1236+
_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
13021237

13031238
stats->tuples_removed += ndeletable;
13041239
/* must recompute maxoff */

0 commit comments

Comments
 (0)