Skip to content

Commit 0d1fe9f

Browse files
committed
Move page initialization from RelationAddExtraBlocks() to use, take 2.
Previously we initialized pages when bulk extending in RelationAddExtraBlocks(). That has a major disadvantage: It ties RelationAddExtraBlocks() to heap, as other types of storage are likely to need different amounts of special space, have different amount of free space (previously determined by PageGetHeapFreeSpace()). That we're relying on initializing pages, but not WAL logging the initialization, also means the risk for getting "WARNING: relation \"%s\" page %u is uninitialized --- fixing" style warnings in vacuums after crashes/immediate shutdowns, is considerably higher. The warning sounds much more serious than what they are. Fix those two issues together by not initializing pages in RelationAddExtraPages() (but continue to do so in RelationGetBufferForTuple(), which is linked much more closely to heap), and accepting uninitialized pages as normal in vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it to the FSM, but does nothing else. We chose to not issue a debug message, much less a warning in that case - it seems rarely useful, and quite likely to scare people unnecessarily. For now empty pages aren't added to the VM, because standbys would not re-discover such pages after a promotion. In contrast to other sources for empty pages, there's no corresponding WAL records triggering FSM updates during replay. Previously when extending the relation, there was a moment between extending the relation, and acquiring an exclusive lock on the new page, in which another backend could lock the page. To avoid new content being put on that new page, vacuumlazy needed to acquire the extension lock for a brief moment when encountering a new page. A second corner case, only working somewhat by accident, was that RelationGetBufferForTuple() sometimes checks the last page in a relation for free space, without consulting the FSM; that only worked because PageGetHeapFreeSpace() interprets the zero page header in a new page as no free space. The lack of handling this properly required reverting the previous attempt in 6842005. This issue can be solved by using RBM_ZERO_AND_LOCK when extending the relation, thereby avoiding this window. There's some added complexity when RelationGetBufferForTuple() is called with another buffer (for updates), to avoid deadlocks, but that's rarely hit at runtime. Author: Andres Freund Reviewed-By: Tom Lane Discussion: https://postgr.es/m/20181219083945.6khtgm36mivonhva@alap3.anarazel.de
1 parent ac3a9af commit 0d1fe9f

File tree

2 files changed

+126
-75
lines changed

2 files changed

+126
-75
lines changed

src/backend/access/heap/hio.c

+82-37
Original file line numberDiff line numberDiff line change
@@ -74,23 +74,31 @@ RelationPutHeapTuple(Relation relation,
7474
}
7575

7676
/*
77-
* Read in a buffer, using bulk-insert strategy if bistate isn't NULL.
77+
* Read in a buffer in mode, using bulk-insert strategy if bistate isn't NULL.
7878
*/
7979
static Buffer
8080
ReadBufferBI(Relation relation, BlockNumber targetBlock,
81-
BulkInsertState bistate)
81+
ReadBufferMode mode, BulkInsertState bistate)
8282
{
8383
Buffer buffer;
8484

8585
/* If not bulk-insert, exactly like ReadBuffer */
8686
if (!bistate)
87-
return ReadBuffer(relation, targetBlock);
87+
return ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
88+
mode, NULL);
8889

8990
/* If we have the desired block already pinned, re-pin and return it */
9091
if (bistate->current_buf != InvalidBuffer)
9192
{
9293
if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
9394
{
95+
/*
96+
* Currently the LOCK variants are only used for extending
97+
* relation, which should never reach this branch.
98+
*/
99+
Assert(mode != RBM_ZERO_AND_LOCK &&
100+
mode != RBM_ZERO_AND_CLEANUP_LOCK);
101+
94102
IncrBufferRefCount(bistate->current_buf);
95103
return bistate->current_buf;
96104
}
@@ -101,7 +109,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
101109

102110
/* Perform a read using the buffer strategy */
103111
buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
104-
RBM_NORMAL, bistate->strategy);
112+
mode, bistate->strategy);
105113

106114
/* Save the selected block as target for future inserts */
107115
IncrBufferRefCount(buffer);
@@ -204,30 +212,29 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
204212
/*
205213
* Extend by one page. This should generally match the main-line
206214
* extension code in RelationGetBufferForTuple, except that we hold
207-
* the relation extension lock throughout.
215+
* the relation extension lock throughout, and we don't immediately
216+
* initialize the page (see below).
208217
*/
209-
buffer = ReadBufferBI(relation, P_NEW, bistate);
210-
211-
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
218+
buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
212219
page = BufferGetPage(buffer);
213220

214221
if (!PageIsNew(page))
215222
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
216223
BufferGetBlockNumber(buffer),
217224
RelationGetRelationName(relation));
218225

219-
PageInit(page, BufferGetPageSize(buffer), 0);
220-
221226
/*
222-
* We mark all the new buffers dirty, but do nothing to write them
223-
* out; they'll probably get used soon, and even if they are not, a
224-
* crash will leave an okay all-zeroes page on disk.
227+
* Add the page to the FSM without initializing. If we were to
228+
* initialize here, the page would potentially get flushed out to disk
229+
* before we add any useful content. There's no guarantee that that'd
230+
* happen before a potential crash, so we need to deal with
231+
* uninitialized pages anyway, thus avoid the potential for
232+
* unnecessary writes.
225233
*/
226-
MarkBufferDirty(buffer);
227234

228235
/* we'll need this info below */
229236
blockNum = BufferGetBlockNumber(buffer);
230-
freespace = PageGetHeapFreeSpace(page);
237+
freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
231238

232239
UnlockReleaseBuffer(buffer);
233240

@@ -412,7 +419,7 @@ RelationGetBufferForTuple(Relation relation, Size len,
412419
if (otherBuffer == InvalidBuffer)
413420
{
414421
/* easy case */
415-
buffer = ReadBufferBI(relation, targetBlock, bistate);
422+
buffer = ReadBufferBI(relation, targetBlock, RBM_NORMAL, bistate);
416423
if (PageIsAllVisible(BufferGetPage(buffer)))
417424
visibilitymap_pin(relation, targetBlock, vmbuffer);
418425
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -479,6 +486,19 @@ RelationGetBufferForTuple(Relation relation, Size len,
479486
* we're done.
480487
*/
481488
page = BufferGetPage(buffer);
489+
490+
/*
491+
* If necessary initialize page, it'll be used soon. We could avoid
492+
* dirtying the buffer here, and rely on the caller to do so whenever
493+
* it puts a tuple onto the page, but there seems not much benefit in
494+
* doing so.
495+
*/
496+
if (PageIsNew(page))
497+
{
498+
PageInit(page, BufferGetPageSize(buffer), 0);
499+
MarkBufferDirty(buffer);
500+
}
501+
482502
pageFreeSpace = PageGetHeapFreeSpace(page);
483503
if (len + saveFreeSpace <= pageFreeSpace)
484504
{
@@ -571,42 +591,67 @@ RelationGetBufferForTuple(Relation relation, Size len,
571591
* it worth keeping an accurate file length in shared memory someplace,
572592
* rather than relying on the kernel to do it for us?
573593
*/
574-
buffer = ReadBufferBI(relation, P_NEW, bistate);
594+
buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
575595

576596
/*
577-
* We can be certain that locking the otherBuffer first is OK, since it
578-
* must have a lower page number.
597+
* We need to initialize the empty new page. Double-check that it really
598+
* is empty (this should never happen, but if it does we don't want to
599+
* risk wiping out valid data).
579600
*/
580-
if (otherBuffer != InvalidBuffer)
581-
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
601+
page = BufferGetPage(buffer);
582602

583-
/*
584-
* Now acquire lock on the new page.
585-
*/
586-
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
603+
if (!PageIsNew(page))
604+
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
605+
BufferGetBlockNumber(buffer),
606+
RelationGetRelationName(relation));
607+
608+
PageInit(page, BufferGetPageSize(buffer), 0);
609+
MarkBufferDirty(buffer);
587610

588611
/*
589612
* Release the file-extension lock; it's now OK for someone else to extend
590-
* the relation some more. Note that we cannot release this lock before
591-
* we have buffer lock on the new page, or we risk a race condition
592-
* against vacuumlazy.c --- see comments therein.
613+
* the relation some more.
593614
*/
594615
if (needLock)
595616
UnlockRelationForExtension(relation, ExclusiveLock);
596617

597618
/*
598-
* We need to initialize the empty new page. Double-check that it really
599-
* is empty (this should never happen, but if it does we don't want to
600-
* risk wiping out valid data).
619+
* Lock the other buffer. It's guaranteed to be of a lower page number
620+
* than the new page. To conform with the deadlock prevent rules, we ought
621+
* to lock otherBuffer first, but that would give other backends a chance
622+
* to put tuples on our page. To reduce the likelihood of that, attempt to
623+
* lock the other buffer conditionally, that's very likely to work.
624+
* Otherwise we need to lock buffers in the correct order, and retry if
625+
* the space has been used in the mean time.
626+
*
627+
* Alternatively, we could acquire the lock on otherBuffer before
628+
* extending the relation, but that'd require holding the lock while
629+
* performing IO, which seems worse than an unlikely retry.
601630
*/
602-
page = BufferGetPage(buffer);
631+
if (otherBuffer != InvalidBuffer)
632+
{
633+
Assert(otherBuffer != buffer);
603634

604-
if (!PageIsNew(page))
605-
elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
606-
BufferGetBlockNumber(buffer),
607-
RelationGetRelationName(relation));
635+
if (unlikely(!ConditionalLockBuffer(otherBuffer)))
636+
{
637+
LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
638+
LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
639+
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
608640

609-
PageInit(page, BufferGetPageSize(buffer), 0);
641+
/*
642+
* Because the buffer was unlocked for a while, it's possible,
643+
* although unlikely, that the page was filled. If so, just retry
644+
* from start.
645+
*/
646+
if (len > PageGetHeapFreeSpace(page))
647+
{
648+
LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
649+
UnlockReleaseBuffer(buffer);
650+
651+
goto loop;
652+
}
653+
}
654+
}
610655

611656
if (len > PageGetHeapFreeSpace(page))
612657
{

src/backend/access/heap/vacuumlazy.c

+44-38
Original file line numberDiff line numberDiff line change
@@ -860,43 +860,46 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
860860

861861
if (PageIsNew(page))
862862
{
863+
bool still_new;
864+
863865
/*
864-
* An all-zeroes page could be left over if a backend extends the
865-
* relation but crashes before initializing the page. Reclaim such
866-
* pages for use.
867-
*
868-
* We have to be careful here because we could be looking at a
869-
* page that someone has just added to the relation and not yet
870-
* been able to initialize (see RelationGetBufferForTuple). To
871-
* protect against that, release the buffer lock, grab the
872-
* relation extension lock momentarily, and re-lock the buffer. If
873-
* the page is still uninitialized by then, it must be left over
874-
* from a crashed backend, and we can initialize it.
866+
* All-zeroes pages can be left over if either a backend extends
867+
* the relation by a single page, but crashes before the newly
868+
* initialized page has been written out, or when bulk-extending
869+
* the relation (which creates a number of empty pages at the tail
870+
* end of the relation, but enters them into the FSM).
875871
*
876-
* We don't really need the relation lock when this is a new or
877-
* temp relation, but it's probably not worth the code space to
878-
* check that, since this surely isn't a critical path.
872+
* Make sure these pages are in the FSM, to ensure they can be
873+
* reused. Do that by testing if there's any space recorded for
874+
* the page. If not, enter it.
879875
*
880-
* Note: the comparable code in vacuum.c need not worry because
881-
* it's got exclusive lock on the whole relation.
876+
* Note we do not enter the page into the visibilitymap. That has
877+
* the downside that we repeatedly visit this page in subsequent
878+
* vacuums, but otherwise we'll never not discover the space on a
879+
* promoted standby. The harm of repeated checking ought to
880+
* normally not be too bad - the space usually should be used at
881+
* some point, otherwise there wouldn't be any regular vacuums.
882882
*/
883-
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
884-
LockRelationForExtension(onerel, ExclusiveLock);
885-
UnlockRelationForExtension(onerel, ExclusiveLock);
886-
LockBufferForCleanup(buf);
887-
if (PageIsNew(page))
883+
884+
/*
885+
* Perform checking of FSM after releasing lock, the fsm is
886+
* approximate, after all.
887+
*/
888+
still_new = PageIsNew(page);
889+
UnlockReleaseBuffer(buf);
890+
891+
if (still_new)
888892
{
889-
ereport(WARNING,
890-
(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
891-
relname, blkno)));
892-
PageInit(page, BufferGetPageSize(buf), 0);
893893
empty_pages++;
894-
}
895-
freespace = PageGetHeapFreeSpace(page);
896-
MarkBufferDirty(buf);
897-
UnlockReleaseBuffer(buf);
898894

899-
RecordPageWithFreeSpace(onerel, blkno, freespace);
895+
if (GetRecordedFreeSpace(onerel, blkno) == 0)
896+
{
897+
Size freespace;
898+
899+
freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
900+
RecordPageWithFreeSpace(onerel, blkno, freespace);
901+
}
902+
}
900903
continue;
901904
}
902905

@@ -905,7 +908,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
905908
empty_pages++;
906909
freespace = PageGetHeapFreeSpace(page);
907910

908-
/* empty pages are always all-visible and all-frozen */
911+
/*
912+
* Empty pages are always all-visible and all-frozen (note that
913+
* the same is currently not true for new pages, see above).
914+
*/
909915
if (!PageIsAllVisible(page))
910916
{
911917
START_CRIT_SECTION();
@@ -1639,12 +1645,13 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
16391645

16401646
*hastup = false;
16411647

1642-
/* If we hit an uninitialized page, we want to force vacuuming it. */
1643-
if (PageIsNew(page))
1644-
return true;
1645-
1646-
/* Quick out for ordinary empty page. */
1647-
if (PageIsEmpty(page))
1648+
/*
1649+
* New and empty pages, obviously, don't contain tuples. We could make
1650+
* sure that the page is registered in the FSM, but it doesn't seem worth
1651+
* waiting for a cleanup lock just for that, especially because it's
1652+
* likely that the pin holder will do so.
1653+
*/
1654+
if (PageIsNew(page) || PageIsEmpty(page))
16481655
return false;
16491656

16501657
maxoff = PageGetMaxOffsetNumber(page);
@@ -2029,7 +2036,6 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
20292036

20302037
if (PageIsNew(page) || PageIsEmpty(page))
20312038
{
2032-
/* PageIsNew probably shouldn't happen... */
20332039
UnlockReleaseBuffer(buf);
20342040
continue;
20352041
}

0 commit comments

Comments
 (0)