postgres
diff --git a/‎src/backend/access/nbtree/README
Lines changed: 43 additions & 20 deletions b/‎src/backend/access/nbtree/README
Lines changed: 43 additions & 20 deletions
diff --git a/‎src/backend/access/nbtree/nbtpage.c
Lines changed: 15 additions & 19 deletions b/‎src/backend/access/nbtree/nbtpage.c
Lines changed: 15 additions & 19 deletions
diff --git a/‎src/backend/access/nbtree/nbtree.c
Lines changed: 23 additions & 88 deletions b/‎src/backend/access/nbtree/nbtree.c
Lines changed: 23 additions & 88 deletions
@@ -508,7 +508,9 @@ the parent is finished and the flag in the child cleared, but can be
 released immediately after that, before recursing up the tree if the parent
 also needs to be split.  This ensures that incompletely split pages should
 not be seen under normal circumstances; only if insertion to the parent
-has failed for some reason.
+has failed for some reason. (It's also possible for a reader to observe
+a page with the incomplete split flag set during recovery; see later
+section on "Scans during Recovery" for details.)
 
 We flag the left page, even though it's the right page that's missing the
 downlink, because it's more convenient to know already when following the
@@ -528,7 +530,7 @@ next VACUUM will find the half-dead leaf page and continue the deletion.
 
 Before 9.4, we used to keep track of incomplete splits and page deletions
 during recovery and finish them immediately at end of recovery, instead of
-doing it lazily at the next  insertion or vacuum.  However, that made the
+doing it lazily at the next insertion or vacuum.  However, that made the
 recovery much more complicated, and only fixed the problem when crash
 recovery was performed.  An incomplete split can also occur if an otherwise
 recoverable error, like out-of-memory or out-of-disk-space, happens while
@@ -537,23 +539,41 @@ inserting the downlink to the parent.
 Scans during Recovery
 ---------------------
 
-The btree index type can be safely used during recovery. During recovery
-we have at most one writer and potentially many readers. In that
-situation the locking requirements can be relaxed and we do not need
-double locking during block splits. Each WAL record makes changes to a
-single level of the btree using the correct locking sequence and so
-is safe for concurrent readers. Some readers may observe a block split
-in progress as they descend the tree, but they will simply move right
-onto the correct page.
+nbtree indexes support read queries in Hot Standby mode. Every atomic
+action/WAL record makes isolated changes that leave the tree in a
+consistent state for readers. Readers lock pages according to the same
+rules that readers follow on the primary. (Readers may have to move
+right to recover from a "concurrent" page split or page deletion, just
+like on the primary.)
+
+However, there are a couple of differences in how pages are locked by
+replay/the startup process as compared to the original write operation
+on the primary. The exceptions involve page splits and page deletions.
+The first phase and second phase of a page split are processed
+independently during replay, since they are independent atomic actions.
+We do not attempt to recreate the coupling of parent and child page
+write locks that took place on the primary. This is safe because readers
+never care about the incomplete split flag anyway. Holding on to an
+extra write lock on the primary is only necessary so that a second
+writer cannot observe the incomplete split flag before the first writer
+finishes the split. If we let concurrent writers on the primary observe
+an incomplete split flag on the same page, each writer would attempt to
+complete the unfinished split, corrupting the parent page.  (Similarly,
+replay of page deletion records does not hold a write lock on the leaf
+page throughout; only the primary needs to blocks out concurrent writers
+that insert on to the page being deleted.)
 
 During recovery all index scans start with ignore_killed_tuples = false
 and we never set kill_prior_tuple. We do this because the oldest xmin
 on the standby server can be older than the oldest xmin on the master
-server, which means tuples can be marked as killed even when they are
-still visible on the standby. We don't WAL log tuple killed bits, but
+server, which means tuples can be marked LP_DEAD even when they are
+still visible on the standby. We don't WAL log tuple LP_DEAD  bits, but
 they can still appear in the standby because of full page writes. So
 we must always ignore them in standby, and that means it's not worth
-setting them either.
+setting them either.  (When LP_DEAD-marked tuples are eventually deleted
+on the primary, the deletion is WAL-logged.  Queries that run on a
+standby therefore get much of the benefit of any LP_DEAD setting that
+takes place on the primary.)
 
 Note that we talk about scans that are started during recovery. We go to
 a little trouble to allow a scan to start during recovery and end during
@@ -562,14 +582,17 @@ because it allows running applications to continue while the standby
 changes state into a normally running server.
 
 The interlocking required to avoid returning incorrect results from
-non-MVCC scans is not required on standby nodes. That is because
+non-MVCC scans is not required on standby nodes. We still get a
+super-exclusive lock ("cleanup lock") when replaying VACUUM records
+during recovery, but recovery does not need to lock every leaf page
+(only those leaf pages that have items to delete). That is safe because
 HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
-HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
-ever used during write transactions, which cannot exist on the standby.
-MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
-is not a problem.  The optimizer looks at the boundaries of value ranges
-using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which is
-also safe.  That leaves concern only for HeapTupleSatisfiesToast().
+HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
+used during write transactions, which cannot exist on the standby. MVCC
+scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem. The optimizer looks at the boundaries of value ranges
+using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
+is also safe. That leaves concern only for HeapTupleSatisfiesToast().
 
 HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
 because it doesn't need to - if the main heap row is visible then the
 
@@ -968,32 +968,28 @@ _bt_page_recyclable(Page page)
  * deleting the page it points to.
  *
  * This routine assumes that the caller has pinned and locked the buffer.
- * Also, the given itemnos *must* appear in increasing order in the array.
+ * Also, the given deletable array *must* be sorted in ascending order.
  *
- * We record VACUUMs and b-tree deletes differently in WAL. InHotStandby
- * we need to be able to pin all of the blocks in the btree in physical
- * order when replaying the effects of a VACUUM, just as we do for the
- * original VACUUM itself. lastBlockVacuumed allows us to tell whether an
- * intermediate range of blocks has had no changes at all by VACUUM,
- * and so must be scanned anyway during replay. We always write a WAL record
- * for the last block in the index, whether or not it contained any items
- * to be removed. This allows us to scan right up to end of index to
- * ensure correct locking.
+ * We record VACUUMs and b-tree deletes differently in WAL.  Deletes must
+ * generate recovery conflicts by accessing the heap inline, whereas VACUUMs
+ * can rely on the initial heap scan taking care of the problem (pruning would
+ * have generated the conflicts needed for hot standby already).
  */
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
-					OffsetNumber *itemnos, int nitems,
-					BlockNumber lastBlockVacuumed)
+					OffsetNumber *deletable, int ndeletable)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
 
+	/* Shouldn't be called unless there's something to do */
+	Assert(ndeletable > 0);
+
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
 	/* Fix the page */
-	if (nitems > 0)
-		PageIndexMultiDelete(page, itemnos, nitems);
+	PageIndexMultiDelete(page, deletable, ndeletable);
 
 	/*
 	 * We can clear the vacuum cycle ID since this page has certainly been
@@ -1019,7 +1015,7 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		XLogRecPtr	recptr;
 		xl_btree_vacuum xlrec_vacuum;
 
-		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.ndeleted = ndeletable;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1030,8 +1026,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		 * is.  When XLogInsert stores the whole buffer, the offsets array
 		 * need not be stored too.
 		 */
-		if (nitems > 0)
-			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
+		XLogRegisterBufData(0, (char *) deletable,
+							ndeletable * sizeof(OffsetNumber));
 
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
@@ -1050,8 +1046,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
  * Also, the given itemnos *must* appear in increasing order in the array.
  *
  * This is nearly the same as _bt_delitems_vacuum as far as what it does to
- * the page, but the WAL logging considerations are quite different.  See
- * comments for _bt_delitems_vacuum.
+ * the page, but it needs to generate its own recovery conflicts by accessing
+ * the heap.  See comments for _bt_delitems_vacuum.
  */
 void
 _bt_delitems_delete(Relation rel, Buffer buf,
 
@@ -46,8 +46,6 @@ typedef struct
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
-	BlockNumber lastBlockVacuumed;	/* highest blkno actually vacuumed */
-	BlockNumber lastBlockLocked;	/* highest blkno we've cleanup-locked */
 	BlockNumber totFreePages;	/* true total # of free pages */
 	TransactionId oldestBtpoXact;
 	MemoryContext pagedelcontext;
@@ -978,8 +976,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	vstate.callback = callback;
 	vstate.callback_state = callback_state;
 	vstate.cycleid = cycleid;
-	vstate.lastBlockVacuumed = BTREE_METAPAGE;	/* Initialise at first block */
-	vstate.lastBlockLocked = BTREE_METAPAGE;
 	vstate.totFreePages = 0;
 	vstate.oldestBtpoXact = InvalidTransactionId;
 
@@ -1040,39 +1036,6 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 		}
 	}
 
-	/*
-	 * Check to see if we need to issue one final WAL record for this index,
-	 * which may be needed for correctness on a hot standby node when non-MVCC
-	 * index scans could take place.
-	 *
-	 * If the WAL is replayed in hot standby, the replay process needs to get
-	 * cleanup locks on all index leaf pages, just as we've been doing here.
-	 * However, we won't issue any WAL records about pages that have no items
-	 * to be deleted.  For pages between pages we've vacuumed, the replay code
-	 * will take locks under the direction of the lastBlockVacuumed fields in
-	 * the XLOG_BTREE_VACUUM WAL records.  To cover pages after the last one
-	 * we vacuum, we need to issue a dummy XLOG_BTREE_VACUUM WAL record
-	 * against the last leaf page in the index, if that one wasn't vacuumed.
-	 */
-	if (XLogStandbyInfoActive() &&
-		vstate.lastBlockVacuumed < vstate.lastBlockLocked)
-	{
-		Buffer		buf;
-
-		/*
-		 * The page should be valid, but we can't use _bt_getbuf() because we
-		 * want to use a nondefault buffer access strategy.  Since we aren't
-		 * going to delete any items, getting cleanup lock again is probably
-		 * overkill, but for consistency do that anyway.
-		 */
-		buf = ReadBufferExtended(rel, MAIN_FORKNUM, vstate.lastBlockLocked,
-								 RBM_NORMAL, info->strategy);
-		LockBufferForCleanup(buf);
-		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
-		_bt_relbuf(rel, buf);
-	}
-
 	MemoryContextDelete(vstate.pagedelcontext);
 
 	/*
@@ -1203,13 +1166,6 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 		LockBufferForCleanup(buf);
 
-		/*
-		 * Remember highest leaf page number we've taken cleanup lock on; see
-		 * notes in btvacuumscan
-		 */
-		if (blkno > vstate->lastBlockLocked)
-			vstate->lastBlockLocked = blkno;
-
 		/*
 		 * Check whether we need to recurse back to earlier pages.  What we
 		 * are concerned about is a page split that happened since we started
@@ -1225,8 +1181,10 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 			recurse_to = opaque->btpo_next;
 
 		/*
-		 * Scan over all items to see which ones need deleted according to the
-		 * callback function.
+		 * When each VACUUM begins, it determines an OldestXmin cutoff value.
+		 * Tuples before the cutoff are removed by VACUUM.  Scan over all
+		 * items to see which ones need to be deleted according to cutoff
+		 * point using callback.
 		 */
 		ndeletable = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
@@ -1245,25 +1203,24 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 				htup = &(itup->t_tid);
 
 				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
+				 * Hot Standby assumes that it's okay that XLOG_BTREE_VACUUM
+				 * records do not produce their own conflicts.  This is safe
+				 * as long as the callback function only considers whether the
+				 * index tuple refers to pre-cutoff heap tuples that were
+				 * certainly already pruned away during VACUUM's initial heap
+				 * scan by the time we get here. (We can rely on conflicts
+				 * produced by heap pruning, rather than producing our own
+				 * now.)
+				 *
+				 * Backends with snapshots acquired after a VACUUM starts but
+				 * before it finishes could have a RecentGlobalXmin with a
+				 * later xid than the VACUUM's OldestXmin cutoff.  These
+				 * backends might happen to opportunistically mark some index
+				 * tuples LP_DEAD before we reach them, even though they may
+				 * be after our cutoff.  We don't try to kill these "extra"
+				 * index tuples in _bt_delitems_vacuum().  This keep things
+				 * simple, and allows us to always avoid generating our own
+				 * conflicts.
 				 */
 				if (callback(htup, callback_state))
 					deletable[ndeletable++] = offnum;
@@ -1276,29 +1233,7 @@ btvacuumpage(BTVacState *vstate, BlockNumber blkno, BlockNumber orig_blkno)
 		 */
 		if (ndeletable > 0)
 		{
-			/*
-			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
-			 * all information to the replay code to allow it to get a cleanup
-			 * lock on all pages between the previous lastBlockVacuumed and
-			 * this page. This ensures that WAL replay locks all leaf pages at
-			 * some point, which is important should non-MVCC scans be
-			 * requested. This is currently unused on standby, but we record
-			 * it anyway, so that the WAL contains the required information.
-			 *
-			 * Since we can visit leaf pages out-of-order when recursing,
-			 * replay might end up locking such pages an extra time, but it
-			 * doesn't seem worth the amount of bookkeeping it'd take to avoid
-			 * that.
-			 */
-			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
-								vstate->lastBlockVacuumed);
-
-			/*
-			 * Remember highest leaf page number we've issued a
-			 * XLOG_BTREE_VACUUM WAL record for.
-			 */
-			if (blkno > vstate->lastBlockVacuumed)
-				vstate->lastBlockVacuumed = blkno;
+			_bt_delitems_vacuum(rel, buf, deletable, ndeletable);
 
 			stats->tuples_removed += ndeletable;
 			/* must recompute maxoff */