Skip to content

Commit 634e148

Browse files
committed
Fix multiple problems in WAL replay.
Most of the replay functions for WAL record types that modify more than one page failed to ensure that those pages were locked correctly to ensure that concurrent queries could not see inconsistent page states. This is a hangover from coding decisions made long before Hot Standby was added, when it was hardly necessary to acquire buffer locks during WAL replay at all, let alone hold them for carefully-chosen periods. The key problem was that RestoreBkpBlocks was written to hold lock on each page restored from a full-page image for only as long as it took to update that page. This was guaranteed to break any WAL replay function in which there was any update-ordering constraint between pages, because even if the nominal order of the pages is the right one, any mixture of full-page and non-full-page updates in the same record would result in out-of-order updates. Moreover, it wouldn't work for situations where there's a requirement to maintain lock on one page while updating another. Failure to honor an update ordering constraint in this way is thought to be the cause of bug #7648 from Daniel Farina: what seems to have happened there is that a btree page being split was rewritten from a full-page image before the new right sibling page was written, and because lock on the original page was not maintained it was possible for hot standby queries to try to traverse the page's right-link to the not-yet-existing sibling page. To fix, get rid of RestoreBkpBlocks as such, and instead create a new function RestoreBackupBlock that restores just one full-page image at a time. This function can be invoked by WAL replay functions at the points where they would otherwise perform non-full-page updates; in this way, the physical order of page updates remains the same no matter which pages are replaced by full-page images. We can then further adjust the logic in individual replay functions if it is necessary to hold buffer locks for overlapping periods. A side benefit is that we can simplify the handling of concurrency conflict resolution by moving that code into the record-type-specfic functions; there's no more need to contort the code layout to keep conflict resolution in front of the RestoreBkpBlocks call. In connection with that, standardize on zero-based numbering rather than one-based numbering for referencing the full-page images. In HEAD, I removed the macros XLR_BKP_BLOCK_1 through XLR_BKP_BLOCK_4. They are still there in the header files in previous branches, but are no longer used by the code. In addition, fix some other bugs identified in the course of making these changes: spgRedoAddNode could fail to update the parent downlink at all, if the parent tuple is in the same page as either the old or new split tuple and we're not doing a full-page image: it would get fooled by the LSN having been advanced already. This would result in permanent index corruption, not just transient failure of concurrent queries. Also, ginHeapTupleFastInsert's "merge lists" case failed to mark the old tail page as a candidate for a full-page image; in the worst case this could result in torn-page corruption. heap_xlog_freeze() was inconsistent about using a cleanup lock or plain exclusive lock: it did the former in the normal path but the latter for a full-page image. A plain exclusive lock seems sufficient, so change to that. Also, remove gistRedoPageDeleteRecord(), which has been dead code since VACUUM FULL was rewritten. Back-patch to 9.0, where hot standby was introduced. Note however that 9.0 had a significantly different WAL-logging scheme for GIST index updates, and it doesn't appear possible to make that scheme safe for concurrent hot standby queries, because it can leave inconsistent states in the index even between WAL records. Given the lack of complaints from the field, we won't work too hard on fixing that branch.
1 parent f8ffe62 commit 634e148

File tree

8 files changed

+529
-319
lines changed

8 files changed

+529
-319
lines changed

src/backend/access/gin/ginfast.c

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -291,7 +291,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
291291
if (metadata->head == InvalidBlockNumber)
292292
{
293293
/*
294-
* Main list is empty, so just copy sublist into main list
294+
* Main list is empty, so just insert sublist as main list
295295
*/
296296
START_CRIT_SECTION();
297297

@@ -314,6 +314,14 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector)
314314
LockBuffer(buffer, GIN_EXCLUSIVE);
315315
page = BufferGetPage(buffer);
316316

317+
rdata[0].next = rdata + 1;
318+
319+
rdata[1].buffer = buffer;
320+
rdata[1].buffer_std = true;
321+
rdata[1].data = NULL;
322+
rdata[1].len = 0;
323+
rdata[1].next = NULL;
324+
317325
Assert(GinPageGetOpaque(page)->rightlink == InvalidBlockNumber);
318326

319327
START_CRIT_SECTION();

src/backend/access/gin/ginxlog.c

Lines changed: 85 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,9 @@ ginRedoCreateIndex(XLogRecPtr lsn, XLogRecord *record)
7878
MetaBuffer;
7979
Page page;
8080

81+
/* Backup blocks are not used in create_index records */
82+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
83+
8184
MetaBuffer = XLogReadBuffer(*node, GIN_METAPAGE_BLKNO, true);
8285
Assert(BufferIsValid(MetaBuffer));
8386
page = (Page) BufferGetPage(MetaBuffer);
@@ -110,6 +113,9 @@ ginRedoCreatePTree(XLogRecPtr lsn, XLogRecord *record)
110113
Buffer buffer;
111114
Page page;
112115

116+
/* Backup blocks are not used in create_ptree records */
117+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
118+
113119
buffer = XLogReadBuffer(data->node, data->blkno, true);
114120
Assert(BufferIsValid(buffer));
115121
page = (Page) BufferGetPage(buffer);
@@ -160,9 +166,12 @@ ginRedoInsert(XLogRecPtr lsn, XLogRecord *record)
160166
}
161167
}
162168

163-
/* nothing else to do if page was backed up */
164-
if (record->xl_info & XLR_BKP_BLOCK_1)
169+
/* If we have a full-page image, restore it and we're done */
170+
if (record->xl_info & XLR_BKP_BLOCK(0))
171+
{
172+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
165173
return;
174+
}
166175

167176
buffer = XLogReadBuffer(data->node, data->blkno, false);
168177
if (!BufferIsValid(buffer))
@@ -257,6 +266,9 @@ ginRedoSplit(XLogRecPtr lsn, XLogRecord *record)
257266
if (data->isData)
258267
flags |= GIN_DATA;
259268

269+
/* Backup blocks are not used in split records */
270+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
271+
260272
lbuffer = XLogReadBuffer(data->node, data->lblkno, true);
261273
Assert(BufferIsValid(lbuffer));
262274
lpage = (Page) BufferGetPage(lbuffer);
@@ -370,9 +382,12 @@ ginRedoVacuumPage(XLogRecPtr lsn, XLogRecord *record)
370382
Buffer buffer;
371383
Page page;
372384

373-
/* nothing to do if page was backed up (and no info to do it with) */
374-
if (record->xl_info & XLR_BKP_BLOCK_1)
385+
/* If we have a full-page image, restore it and we're done */
386+
if (record->xl_info & XLR_BKP_BLOCK(0))
387+
{
388+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
375389
return;
390+
}
376391

377392
buffer = XLogReadBuffer(data->node, data->blkno, false);
378393
if (!BufferIsValid(buffer))
@@ -421,63 +436,74 @@ static void
421436
ginRedoDeletePage(XLogRecPtr lsn, XLogRecord *record)
422437
{
423438
ginxlogDeletePage *data = (ginxlogDeletePage *) XLogRecGetData(record);
424-
Buffer buffer;
439+
Buffer dbuffer;
440+
Buffer pbuffer;
441+
Buffer lbuffer;
425442
Page page;
426443

427-
if (!(record->xl_info & XLR_BKP_BLOCK_1))
444+
if (record->xl_info & XLR_BKP_BLOCK(0))
445+
dbuffer = RestoreBackupBlock(lsn, record, 0, false, true);
446+
else
428447
{
429-
buffer = XLogReadBuffer(data->node, data->blkno, false);
430-
if (BufferIsValid(buffer))
448+
dbuffer = XLogReadBuffer(data->node, data->blkno, false);
449+
if (BufferIsValid(dbuffer))
431450
{
432-
page = BufferGetPage(buffer);
451+
page = BufferGetPage(dbuffer);
433452
if (!XLByteLE(lsn, PageGetLSN(page)))
434453
{
435454
Assert(GinPageIsData(page));
436455
GinPageGetOpaque(page)->flags = GIN_DELETED;
437456
PageSetLSN(page, lsn);
438457
PageSetTLI(page, ThisTimeLineID);
439-
MarkBufferDirty(buffer);
458+
MarkBufferDirty(dbuffer);
440459
}
441-
UnlockReleaseBuffer(buffer);
442460
}
443461
}
444462

445-
if (!(record->xl_info & XLR_BKP_BLOCK_2))
463+
if (record->xl_info & XLR_BKP_BLOCK(1))
464+
pbuffer = RestoreBackupBlock(lsn, record, 1, false, true);
465+
else
446466
{
447-
buffer = XLogReadBuffer(data->node, data->parentBlkno, false);
448-
if (BufferIsValid(buffer))
467+
pbuffer = XLogReadBuffer(data->node, data->parentBlkno, false);
468+
if (BufferIsValid(pbuffer))
449469
{
450-
page = BufferGetPage(buffer);
470+
page = BufferGetPage(pbuffer);
451471
if (!XLByteLE(lsn, PageGetLSN(page)))
452472
{
453473
Assert(GinPageIsData(page));
454474
Assert(!GinPageIsLeaf(page));
455475
GinPageDeletePostingItem(page, data->parentOffset);
456476
PageSetLSN(page, lsn);
457477
PageSetTLI(page, ThisTimeLineID);
458-
MarkBufferDirty(buffer);
478+
MarkBufferDirty(pbuffer);
459479
}
460-
UnlockReleaseBuffer(buffer);
461480
}
462481
}
463482

464-
if (!(record->xl_info & XLR_BKP_BLOCK_3) && data->leftBlkno != InvalidBlockNumber)
483+
if (record->xl_info & XLR_BKP_BLOCK(2))
484+
(void) RestoreBackupBlock(lsn, record, 2, false, false);
485+
else if (data->leftBlkno != InvalidBlockNumber)
465486
{
466-
buffer = XLogReadBuffer(data->node, data->leftBlkno, false);
467-
if (BufferIsValid(buffer))
487+
lbuffer = XLogReadBuffer(data->node, data->leftBlkno, false);
488+
if (BufferIsValid(lbuffer))
468489
{
469-
page = BufferGetPage(buffer);
490+
page = BufferGetPage(lbuffer);
470491
if (!XLByteLE(lsn, PageGetLSN(page)))
471492
{
472493
Assert(GinPageIsData(page));
473494
GinPageGetOpaque(page)->rightlink = data->rightLink;
474495
PageSetLSN(page, lsn);
475496
PageSetTLI(page, ThisTimeLineID);
476-
MarkBufferDirty(buffer);
497+
MarkBufferDirty(lbuffer);
477498
}
478-
UnlockReleaseBuffer(buffer);
499+
UnlockReleaseBuffer(lbuffer);
479500
}
480501
}
502+
503+
if (BufferIsValid(pbuffer))
504+
UnlockReleaseBuffer(pbuffer);
505+
if (BufferIsValid(dbuffer))
506+
UnlockReleaseBuffer(dbuffer);
481507
}
482508

483509
static void
@@ -506,7 +532,9 @@ ginRedoUpdateMetapage(XLogRecPtr lsn, XLogRecord *record)
506532
/*
507533
* insert into tail page
508534
*/
509-
if (!(record->xl_info & XLR_BKP_BLOCK_1))
535+
if (record->xl_info & XLR_BKP_BLOCK(0))
536+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
537+
else
510538
{
511539
buffer = XLogReadBuffer(data->node, data->metadata.tail, false);
512540
if (BufferIsValid(buffer))
@@ -554,20 +582,25 @@ ginRedoUpdateMetapage(XLogRecPtr lsn, XLogRecord *record)
554582
/*
555583
* New tail
556584
*/
557-
buffer = XLogReadBuffer(data->node, data->prevTail, false);
558-
if (BufferIsValid(buffer))
585+
if (record->xl_info & XLR_BKP_BLOCK(0))
586+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
587+
else
559588
{
560-
Page page = BufferGetPage(buffer);
561-
562-
if (!XLByteLE(lsn, PageGetLSN(page)))
589+
buffer = XLogReadBuffer(data->node, data->prevTail, false);
590+
if (BufferIsValid(buffer))
563591
{
564-
GinPageGetOpaque(page)->rightlink = data->newRightlink;
592+
Page page = BufferGetPage(buffer);
565593

566-
PageSetLSN(page, lsn);
567-
PageSetTLI(page, ThisTimeLineID);
568-
MarkBufferDirty(buffer);
594+
if (!XLByteLE(lsn, PageGetLSN(page)))
595+
{
596+
GinPageGetOpaque(page)->rightlink = data->newRightlink;
597+
598+
PageSetLSN(page, lsn);
599+
PageSetTLI(page, ThisTimeLineID);
600+
MarkBufferDirty(buffer);
601+
}
602+
UnlockReleaseBuffer(buffer);
569603
}
570-
UnlockReleaseBuffer(buffer);
571604
}
572605
}
573606

@@ -586,8 +619,12 @@ ginRedoInsertListPage(XLogRecPtr lsn, XLogRecord *record)
586619
tupsize;
587620
IndexTuple tuples = (IndexTuple) (XLogRecGetData(record) + sizeof(ginxlogInsertListPage));
588621

589-
if (record->xl_info & XLR_BKP_BLOCK_1)
622+
/* If we have a full-page image, restore it and we're done */
623+
if (record->xl_info & XLR_BKP_BLOCK(0))
624+
{
625+
(void) RestoreBackupBlock(lsn, record, 0, false, false);
590626
return;
627+
}
591628

592629
buffer = XLogReadBuffer(data->node, data->blkno, true);
593630
Assert(BufferIsValid(buffer));
@@ -633,6 +670,9 @@ ginRedoDeleteListPages(XLogRecPtr lsn, XLogRecord *record)
633670
Page metapage;
634671
int i;
635672

673+
/* Backup blocks are not used in delete_listpage records */
674+
Assert(!(record->xl_info & XLR_BKP_BLOCK_MASK));
675+
636676
metabuffer = XLogReadBuffer(data->node, GIN_METAPAGE_BLKNO, false);
637677
if (!BufferIsValid(metabuffer))
638678
return; /* assume index was deleted, nothing to do */
@@ -646,6 +686,16 @@ ginRedoDeleteListPages(XLogRecPtr lsn, XLogRecord *record)
646686
MarkBufferDirty(metabuffer);
647687
}
648688

689+
/*
690+
* In normal operation, shiftList() takes exclusive lock on all the
691+
* pages-to-be-deleted simultaneously. During replay, however, it should
692+
* be all right to lock them one at a time. This is dependent on the fact
693+
* that we are deleting pages from the head of the list, and that readers
694+
* share-lock the next page before releasing the one they are on. So we
695+
* cannot get past a reader that is on, or due to visit, any page we are
696+
* going to delete. New incoming readers will block behind our metapage
697+
* lock and then see a fully updated page list.
698+
*/
649699
for (i = 0; i < data->ndeleted; i++)
650700
{
651701
Buffer buffer = XLogReadBuffer(data->node, data->toDelete[i], false);
@@ -678,8 +728,6 @@ gin_redo(XLogRecPtr lsn, XLogRecord *record)
678728
* GIN indexes do not require any conflict processing.
679729
*/
680730

681-
RestoreBkpBlocks(lsn, record, false);
682-
683731
topCtx = MemoryContextSwitchTo(opCtx);
684732
switch (info)
685733
{

0 commit comments

Comments
 (0)