Skip to content

Commit dad1539

Browse files
committed
Fix possible HOT corruption when RECENTLY_DEAD changes to DEAD while pruning.
Since dc7420c the horizon used for pruning is determined "lazily". A more accurate horizon is built on-demand, rather than in GetSnapshotData(). If a horizon computation is triggered between two HeapTupleSatisfiesVacuum() calls for the same tuple, the result can change from RECENTLY_DEAD to DEAD. heap_page_prune() can process the same tid multiple times (once following an update chain, once "directly"). When the result of HeapTupleSatisfiesVacuum() of a tuple changes from RECENTLY_DEAD during the first access, to DEAD in the second, the "tuple is DEAD and doesn't chain to anything else" path in heap_prune_chain() can end up marking the target of a LP_REDIRECT ItemId unused. Initially not easily visible, Once the target of a LP_REDIRECT ItemId is marked unused, a new tuple version can reuse it. At that point the corruption may become visible, as index entries pointing to the "original" redirect item, now point to a unrelated tuple. To fix, compute HTSV for all tuples on a page only once. This fixes the entire class of problems of HTSV changing inside heap_page_prune(). However, visibility changes can obviously still occur between HTSV checks inside heap_page_prune() and outside (e.g. in lazy_scan_prune()). The computation of HTSV is now done in bulk, in heap_page_prune(), rather than on-demand in heap_prune_chain(). Besides being a bit simpler, it also is faster: Memory accesses can happen sequentially, rather than in the order of HOT chains. There are other causes of HeapTupleSatisfiesVacuum() results changing between two visibility checks for the same tuple, even before dc7420c. E.g. HEAPTUPLE_INSERT_IN_PROGRESS can change to HEAPTUPLE_DEAD when a transaction aborts between the two checks. None of the these other visibility status changes are known to cause corruption, but heap_page_prune()'s approach makes it hard to be confident. A patch implementing a more fundamental redesign of heap_page_prune(), which fixes this bug and simplifies pruning substantially, has been proposed by Peter Geoghegan in https://postgr.es/m/CAH2-WzmNk6V6tqzuuabxoxM8HJRaWU6h12toaS-bqYcLiht16A@mail.gmail.com However, that redesign is larger change than desirable for backpatching. As the new design still benefits from the batched visibility determination introduced in this commit, it makes sense to commit this narrower fix to 14 and master, and then commit Peter's improvement in master. The precise sequence required to trigger the bug is complicated and hard to do exercise in an isolation test (until we have wait points). Due to that the isolation test initially posted at https://postgr.es/m/20211119003623.d3jusiytzjqwb62p%40alap3.anarazel.de and updated in https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb%40alap3.anarazel.de isn't committable. A followup commit will introduce additional assertions, to detect problems like this more easily. Bug: #17255 Reported-By: Alexander Lakhin <exclusion@gmail.com> Debugged-By: Andres Freund <andres@anarazel.de> Debugged-By: Peter Geoghegan <pg@bowt.ie> Author: Andres Freund <andres@andres@anarazel.de> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/20211122175914.ayk6gg6nvdwuhrzb@alap3.anarazel.de Backpatch: 14-, the oldest branch containing dc7420c
1 parent ad5b6f2 commit dad1539

File tree

1 file changed

+82
-25
lines changed

1 file changed

+82
-25
lines changed

src/backend/access/heap/pruneheap.c

Lines changed: 82 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -56,11 +56,30 @@ typedef struct
5656
OffsetNumber redirected[MaxHeapTuplesPerPage * 2];
5757
OffsetNumber nowdead[MaxHeapTuplesPerPage];
5858
OffsetNumber nowunused[MaxHeapTuplesPerPage];
59-
/* marked[i] is true if item i is entered in one of the above arrays */
59+
60+
/*
61+
* marked[i] is true if item i is entered in one of the above arrays.
62+
*
63+
* This needs to be MaxHeapTuplesPerPage + 1 long as FirstOffsetNumber is
64+
* 1. Otherwise every access would need to subtract 1.
65+
*/
6066
bool marked[MaxHeapTuplesPerPage + 1];
67+
68+
/*
69+
* Tuple visibility is only computed once for each tuple, for correctness
70+
* and efficiency reasons; see comment in heap_page_prune() for
71+
* details. This is of type int8[,] intead of HTSV_Result[], so we can use
72+
* -1 to indicate no visibility has been computed, e.g. for LP_DEAD items.
73+
*
74+
* Same indexing as ->marked.
75+
*/
76+
int8 htsv[MaxHeapTuplesPerPage + 1];
6177
} PruneState;
6278

6379
/* Local functions */
80+
static HTSV_Result heap_prune_satisfies_vacuum(PruneState *prstate,
81+
HeapTuple tup,
82+
Buffer buffer);
6483
static int heap_prune_chain(Buffer buffer,
6584
OffsetNumber rootoffnum,
6685
PruneState *prstate);
@@ -228,6 +247,7 @@ heap_page_prune(Relation relation, Buffer buffer,
228247
OffsetNumber offnum,
229248
maxoff;
230249
PruneState prstate;
250+
HeapTupleData tup;
231251

232252
/*
233253
* Our strategy is to scan the page and make lists of items to change,
@@ -250,8 +270,60 @@ heap_page_prune(Relation relation, Buffer buffer,
250270
prstate.nredirected = prstate.ndead = prstate.nunused = 0;
251271
memset(prstate.marked, 0, sizeof(prstate.marked));
252272

253-
/* Scan the page */
254273
maxoff = PageGetMaxOffsetNumber(page);
274+
tup.t_tableOid = RelationGetRelid(prstate.rel);
275+
276+
/*
277+
* Determine HTSV for all tuples.
278+
*
279+
* This is required for correctness to deal with cases where running HTSV
280+
* twice could result in different results (e.g. RECENTLY_DEAD can turn to
281+
* DEAD if another checked item causes GlobalVisTestIsRemovableFullXid()
282+
* to update the horizon, INSERT_IN_PROGRESS can change to DEAD if the
283+
* inserting transaction aborts, ...). That in turn could cause
284+
* heap_prune_chain() to behave incorrectly if a tuple is reached twice,
285+
* once directly via a heap_prune_chain() and once following a HOT chain.
286+
*
287+
* It's also good for performance. Most commonly tuples within a page are
288+
* stored at decreasing offsets (while the items are stored at increasing
289+
* offsets). When processing all tuples on a page this leads to reading
290+
* memory at decreasing offsets within a page, with a variable stride.
291+
* That's hard for CPU prefetchers to deal with. Processing the items in
292+
* reverse order (and thus the tuples in increasing order) increases
293+
* prefetching efficiency significantly / decreases the number of cache
294+
* misses.
295+
*/
296+
for (offnum = maxoff;
297+
offnum >= FirstOffsetNumber;
298+
offnum = OffsetNumberPrev(offnum))
299+
{
300+
ItemId itemid = PageGetItemId(page, offnum);
301+
HeapTupleHeader htup;
302+
303+
/* Nothing to do if slot doesn't contain a tuple */
304+
if (!ItemIdIsNormal(itemid))
305+
{
306+
prstate.htsv[offnum] = -1;
307+
continue;
308+
}
309+
310+
htup = (HeapTupleHeader) PageGetItem(page, itemid);
311+
tup.t_data = htup;
312+
tup.t_len = ItemIdGetLength(itemid);
313+
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
314+
315+
/*
316+
* Set the offset number so that we can display it along with any
317+
* error that occurred while processing this tuple.
318+
*/
319+
if (off_loc)
320+
*off_loc = offnum;
321+
322+
prstate.htsv[offnum] = heap_prune_satisfies_vacuum(&prstate, &tup,
323+
buffer);
324+
}
325+
326+
/* Scan the page */
255327
for (offnum = FirstOffsetNumber;
256328
offnum <= maxoff;
257329
offnum = OffsetNumberNext(offnum))
@@ -262,10 +334,7 @@ heap_page_prune(Relation relation, Buffer buffer,
262334
if (prstate.marked[offnum])
263335
continue;
264336

265-
/*
266-
* Set the offset number so that we can display it along with any
267-
* error that occurred while processing this tuple.
268-
*/
337+
/* see preceding loop */
269338
if (off_loc)
270339
*off_loc = offnum;
271340

@@ -488,17 +557,14 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
488557
* the HOT chain is pruned by removing all DEAD tuples at the start of the HOT
489558
* chain. We also prune any RECENTLY_DEAD tuples preceding a DEAD tuple.
490559
* This is OK because a RECENTLY_DEAD tuple preceding a DEAD tuple is really
491-
* DEAD, the OldestXmin test is just too coarse to detect it.
560+
* DEAD, our visibility test is just too coarse to detect it.
492561
*
493562
* The root line pointer is redirected to the tuple immediately after the
494563
* latest DEAD tuple. If all tuples in the chain are DEAD, the root line
495564
* pointer is marked LP_DEAD. (This includes the case of a DEAD simple
496565
* tuple, which we treat as a chain of length 1.)
497566
*
498-
* OldestXmin is the cutoff XID used to identify dead tuples.
499-
*
500-
* We don't actually change the page here, except perhaps for hint-bit updates
501-
* caused by HeapTupleSatisfiesVacuum. We just add entries to the arrays in
567+
* We don't actually change the page here. We just add entries to the arrays in
502568
* prstate showing the changes to be made. Items to be redirected are added
503569
* to the redirected[] array (two entries per redirection); items to be set to
504570
* LP_DEAD state are added to nowdead[]; and items to be set to LP_UNUSED
@@ -520,9 +586,6 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
520586
OffsetNumber chainitems[MaxHeapTuplesPerPage];
521587
int nchain = 0,
522588
i;
523-
HeapTupleData tup;
524-
525-
tup.t_tableOid = RelationGetRelid(prstate->rel);
526589

527590
rootlp = PageGetItemId(dp, rootoffnum);
528591

@@ -531,12 +594,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
531594
*/
532595
if (ItemIdIsNormal(rootlp))
533596
{
597+
Assert(prstate->htsv[rootoffnum] != -1);
534598
htup = (HeapTupleHeader) PageGetItem(dp, rootlp);
535599

536-
tup.t_data = htup;
537-
tup.t_len = ItemIdGetLength(rootlp);
538-
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), rootoffnum);
539-
540600
if (HeapTupleHeaderIsHeapOnly(htup))
541601
{
542602
/*
@@ -557,8 +617,8 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
557617
* either here or while following a chain below. Whichever path
558618
* gets there first will mark the tuple unused.
559619
*/
560-
if (heap_prune_satisfies_vacuum(prstate, &tup, buffer)
561-
== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
620+
if (prstate->htsv[rootoffnum] == HEAPTUPLE_DEAD &&
621+
!HeapTupleHeaderIsHotUpdated(htup))
562622
{
563623
heap_prune_record_unused(prstate, rootoffnum);
564624
HeapTupleHeaderAdvanceLatestRemovedXid(htup,
@@ -618,12 +678,9 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
618678
break;
619679

620680
Assert(ItemIdIsNormal(lp));
681+
Assert(prstate->htsv[offnum] != -1);
621682
htup = (HeapTupleHeader) PageGetItem(dp, lp);
622683

623-
tup.t_data = htup;
624-
tup.t_len = ItemIdGetLength(lp);
625-
ItemPointerSet(&(tup.t_self), BufferGetBlockNumber(buffer), offnum);
626-
627684
/*
628685
* Check the tuple XMIN against prior XMAX, if any
629686
*/
@@ -641,7 +698,7 @@ heap_prune_chain(Buffer buffer, OffsetNumber rootoffnum, PruneState *prstate)
641698
*/
642699
tupdead = recent_dead = false;
643700

644-
switch (heap_prune_satisfies_vacuum(prstate, &tup, buffer))
701+
switch ((HTSV_Result) prstate->htsv[offnum])
645702
{
646703
case HEAPTUPLE_DEAD:
647704
tupdead = true;

0 commit comments

Comments
 (0)