Skip to content

Commit 05b7326

Browse files
Fix bug in nbtree VACUUM "skip full scan" feature.
Commit 857f9c3 (which taught nbtree VACUUM to skip a scan of the index from btcleanup in situations where it doesn't seem worth it) made VACUUM maintain the oldest btpo.xact among all deleted pages for the index as a whole. It failed to handle all the details surrounding pages that are deleted by the current VACUUM operation correctly (though pages deleted by some previous VACUUM operation were processed correctly). The most immediate problem was that the special area of the page was examined without a buffer pin at one point. More fundamentally, the handling failed to account for the full range of _bt_pagedel() behaviors. For example, _bt_pagedel() sometimes deletes internal pages in passing, as part of deleting an entire subtree with btvacuumpage() caller's page as the leaf level page. The original leaf page passed to _bt_pagedel() might not be the page that it deletes first in cases where deletion can take place. It's unclear how disruptive this bug may have been, or what symptoms users might want to look out for. The issue was spotted during unrelated code review. To fix, push down the logic for maintaining the oldest btpo.xact to _bt_pagedel(). btvacuumpage() is now responsible for pages that were fully deleted by a previous VACUUM operation, while _bt_pagedel() is now responsible for pages that were deleted by the current VACUUM operation (this includes half-dead pages from a previous interrupted VACUUM operation that become fully deleted in _bt_pagedel()). Note that _bt_pagedel() should never encounter an existing deleted page. This commit theoretically breaks the ABI of a stable release by changing the signature of _bt_pagedel(). However, if any third party extension is actually affected by this, then it must already be completely broken (since there are numerous assumptions made in _bt_pagedel() that cannot be met outside of VACUUM). It seems highly unlikely that such an extension actually exists, in any case. Author: Peter Geoghegan Reviewed-By: Masahiko Sawada Discussion: https://postgr.es/m/CAH2-WzkrXBcMQWAYUJMFTTvzx_r4q=pYSjDe07JnUXhe+OZnJA@mail.gmail.com Backpatch: 11-, where the "skip full scan" feature was introduced.
1 parent a08bfe7 commit 05b7326

File tree

3 files changed

+151
-83
lines changed

3 files changed

+151
-83
lines changed

src/backend/access/nbtree/nbtpage.c

Lines changed: 96 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,11 @@
3434
#include "utils/snapmgr.h"
3535

3636
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
37-
static bool _bt_mark_page_halfdead(Relation rel, Buffer buf, BTStack stack);
37+
static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
38+
BTStack stack);
3839
static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
39-
bool *rightsib_empty);
40+
bool *rightsib_empty,
41+
TransactionId *oldestBtpoXact);
4042
static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
4143
BTStack stack, Buffer *topparent, OffsetNumber *topoff,
4244
BlockNumber *target, BlockNumber *rightsib);
@@ -1282,27 +1284,35 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
12821284
}
12831285

12841286
/*
1285-
* _bt_pagedel() -- Delete a page from the b-tree, if legal to do so.
1287+
* _bt_pagedel() -- Delete a leaf page from the b-tree, if legal to do so.
12861288
*
1287-
* This action unlinks the page from the b-tree structure, removing all
1289+
* This action unlinks the leaf page from the b-tree structure, removing all
12881290
* pointers leading to it --- but not touching its own left and right links.
12891291
* The page cannot be physically reclaimed right away, since other processes
12901292
* may currently be trying to follow links leading to the page; they have to
12911293
* be allowed to use its right-link to recover. See nbtree/README.
12921294
*
12931295
* On entry, the target buffer must be pinned and locked (either read or write
1294-
* lock is OK). This lock and pin will be dropped before exiting.
1296+
* lock is OK). The page must be an empty leaf page, which may be half-dead
1297+
* already (a half-dead page should only be passed to us when an earlier
1298+
* VACUUM operation was interrupted, though). Note in particular that caller
1299+
* should never pass a buffer containing an existing deleted page here. The
1300+
* lock and pin on caller's buffer will be dropped before we return.
12951301
*
12961302
* Returns the number of pages successfully deleted (zero if page cannot
1297-
* be deleted now; could be more than one if parent or sibling pages were
1298-
* deleted too).
1303+
* be deleted now; could be more than one if parent or right sibling pages
1304+
* were deleted too).
1305+
*
1306+
* Maintains *oldestBtpoXact for any pages that get deleted. Caller is
1307+
* responsible for maintaining *oldestBtpoXact in the case of pages that were
1308+
* deleted by a previous VACUUM.
12991309
*
13001310
* NOTE: this leaks memory. Rather than trying to clean up everything
13011311
* carefully, it's better to run it in a temp context that can be reset
13021312
* frequently.
13031313
*/
13041314
int
1305-
_bt_pagedel(Relation rel, Buffer buf)
1315+
_bt_pagedel(Relation rel, Buffer leafbuf, TransactionId *oldestBtpoXact)
13061316
{
13071317
int ndeleted = 0;
13081318
BlockNumber rightsib;
@@ -1323,14 +1333,21 @@ _bt_pagedel(Relation rel, Buffer buf)
13231333

13241334
for (;;)
13251335
{
1326-
page = BufferGetPage(buf);
1336+
page = BufferGetPage(leafbuf);
13271337
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
13281338

13291339
/*
13301340
* Internal pages are never deleted directly, only as part of deleting
13311341
* the whole branch all the way down to leaf level.
1342+
*
1343+
* Also check for deleted pages here. Caller never passes us a fully
1344+
* deleted page. Only VACUUM can delete pages, so there can't have
1345+
* been a concurrent deletion. Assume that we reached any deleted
1346+
* page encountered here by following a sibling link, and that the
1347+
* index is corrupt.
13321348
*/
1333-
if (!P_ISLEAF(opaque))
1349+
Assert(!P_ISDELETED(opaque));
1350+
if (!P_ISLEAF(opaque) || P_ISDELETED(opaque))
13341351
{
13351352
/*
13361353
* Pre-9.4 page deletion only marked internal pages as half-dead,
@@ -1349,13 +1366,22 @@ _bt_pagedel(Relation rel, Buffer buf)
13491366
errmsg("index \"%s\" contains a half-dead internal page",
13501367
RelationGetRelationName(rel)),
13511368
errhint("This can be caused by an interrupted VACUUM in version 9.3 or older, before upgrade. Please REINDEX it.")));
1352-
_bt_relbuf(rel, buf);
1369+
1370+
if (P_ISDELETED(opaque))
1371+
ereport(LOG,
1372+
(errcode(ERRCODE_INDEX_CORRUPTED),
1373+
errmsg_internal("found deleted block %u while following right link in index \"%s\"",
1374+
BufferGetBlockNumber(leafbuf),
1375+
RelationGetRelationName(rel))));
1376+
1377+
_bt_relbuf(rel, leafbuf);
13531378
return ndeleted;
13541379
}
13551380

13561381
/*
13571382
* We can never delete rightmost pages nor root pages. While at it,
1358-
* check that page is not already deleted and is empty.
1383+
* check that page is empty, since it's possible that the leafbuf page
1384+
* was empty a moment ago, but has since had some inserts.
13591385
*
13601386
* To keep the algorithm simple, we also never delete an incompletely
13611387
* split page (they should be rare enough that this doesn't make any
@@ -1370,14 +1396,14 @@ _bt_pagedel(Relation rel, Buffer buf)
13701396
* to. On subsequent iterations, we know we stepped right from a page
13711397
* that passed these tests, so it's OK.
13721398
*/
1373-
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_ISDELETED(opaque) ||
1399+
if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
13741400
P_FIRSTDATAKEY(opaque) <= PageGetMaxOffsetNumber(page) ||
13751401
P_INCOMPLETE_SPLIT(opaque))
13761402
{
13771403
/* Should never fail to delete a half-dead page */
13781404
Assert(!P_ISHALFDEAD(opaque));
13791405

1380-
_bt_relbuf(rel, buf);
1406+
_bt_relbuf(rel, leafbuf);
13811407
return ndeleted;
13821408
}
13831409

@@ -1415,7 +1441,7 @@ _bt_pagedel(Relation rel, Buffer buf)
14151441
* To avoid deadlocks, we'd better drop the leaf page lock
14161442
* before going further.
14171443
*/
1418-
LockBuffer(buf, BUFFER_LOCK_UNLOCK);
1444+
LockBuffer(leafbuf, BUFFER_LOCK_UNLOCK);
14191445

14201446
/*
14211447
* Fetch the left sibling, to check that it's not marked with
@@ -1439,10 +1465,10 @@ _bt_pagedel(Relation rel, Buffer buf)
14391465
* incompletely-split page to be split again. So we don't
14401466
* need to walk right here.
14411467
*/
1442-
if (lopaque->btpo_next == BufferGetBlockNumber(buf) &&
1468+
if (lopaque->btpo_next == BufferGetBlockNumber(leafbuf) &&
14431469
P_INCOMPLETE_SPLIT(lopaque))
14441470
{
1445-
ReleaseBuffer(buf);
1471+
ReleaseBuffer(leafbuf);
14461472
_bt_relbuf(rel, lbuf);
14471473
return ndeleted;
14481474
}
@@ -1458,40 +1484,59 @@ _bt_pagedel(Relation rel, Buffer buf)
14581484
_bt_relbuf(rel, lbuf);
14591485

14601486
/*
1461-
* Re-lock the leaf page, and start over, to re-check that the
1462-
* page can still be deleted.
1487+
* Re-lock the leaf page, and start over to use our stack
1488+
* within _bt_mark_page_halfdead. We must do it that way
1489+
* because it's possible that leafbuf can no longer be
1490+
* deleted. We need to recheck.
14631491
*/
1464-
LockBuffer(buf, BT_WRITE);
1492+
LockBuffer(leafbuf, BT_WRITE);
14651493
continue;
14661494
}
14671495

1468-
if (!_bt_mark_page_halfdead(rel, buf, stack))
1496+
/*
1497+
* See if it's safe to delete the leaf page, and determine how
1498+
* many parent/internal pages above the leaf level will be
1499+
* deleted. If it's safe then _bt_mark_page_halfdead will also
1500+
* perform the first phase of deletion, which includes marking the
1501+
* leafbuf page half-dead.
1502+
*/
1503+
Assert(P_ISLEAF(opaque) && !P_IGNORE(opaque));
1504+
if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
14691505
{
1470-
_bt_relbuf(rel, buf);
1506+
_bt_relbuf(rel, leafbuf);
14711507
return ndeleted;
14721508
}
14731509
}
14741510

14751511
/*
14761512
* Then unlink it from its siblings. Each call to
14771513
* _bt_unlink_halfdead_page unlinks the topmost page from the branch,
1478-
* making it shallower. Iterate until the leaf page is gone.
1514+
* making it shallower. Iterate until the leafbuf page is deleted.
1515+
*
1516+
* _bt_unlink_halfdead_page should never fail, since we established
1517+
* that deletion is generally safe in _bt_mark_page_halfdead.
14791518
*/
14801519
rightsib_empty = false;
1520+
Assert(P_ISLEAF(opaque) && P_ISHALFDEAD(opaque));
14811521
while (P_ISHALFDEAD(opaque))
14821522
{
1483-
/* will check for interrupts, once lock is released */
1484-
if (!_bt_unlink_halfdead_page(rel, buf, &rightsib_empty))
1523+
/* Check for interrupts in _bt_unlink_halfdead_page */
1524+
if (!_bt_unlink_halfdead_page(rel, leafbuf, &rightsib_empty,
1525+
oldestBtpoXact))
14851526
{
1486-
/* _bt_unlink_halfdead_page already released buffer */
1527+
/* _bt_unlink_halfdead_page failed, released buffer */
14871528
return ndeleted;
14881529
}
14891530
ndeleted++;
14901531
}
14911532

1533+
Assert(P_ISLEAF(opaque) && P_ISDELETED(opaque));
1534+
Assert(TransactionIdFollowsOrEquals(opaque->btpo.xact,
1535+
*oldestBtpoXact));
1536+
14921537
rightsib = opaque->btpo_next;
14931538

1494-
_bt_relbuf(rel, buf);
1539+
_bt_relbuf(rel, leafbuf);
14951540

14961541
/*
14971542
* Check here, as calling loops will have locks held, preventing
@@ -1511,7 +1556,7 @@ _bt_pagedel(Relation rel, Buffer buf)
15111556
if (!rightsib_empty)
15121557
break;
15131558

1514-
buf = _bt_getbuf(rel, rightsib, BT_WRITE);
1559+
leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
15151560
}
15161561

15171562
return ndeleted;
@@ -1714,17 +1759,28 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
17141759
* of the whole branch, including the leaf page itself, iterate until the
17151760
* leaf page is deleted.
17161761
*
1717-
* Returns 'false' if the page could not be unlinked (shouldn't happen).
1718-
* If the (new) right sibling of the page is empty, *rightsib_empty is set
1719-
* to true.
1762+
* Returns 'false' if the page could not be unlinked (shouldn't happen). If
1763+
* the right sibling of the current target page is empty, *rightsib_empty is
1764+
* set to true, allowing caller to delete the target's right sibling page in
1765+
* passing. Note that *rightsib_empty is only actually used by caller when
1766+
* target page is leafbuf, following last call here for leafbuf/the subtree
1767+
* containing leafbuf. (We always set *rightsib_empty for caller, just to be
1768+
* consistent.)
1769+
*
1770+
* We maintain *oldestBtpoXact for pages that are deleted by the current
1771+
* VACUUM operation here. This must be handled here because we conservatively
1772+
* assume that there needs to be a new call to ReadNewTransactionId() each
1773+
* time a page gets deleted. See comments about the underlying assumption
1774+
* below.
17201775
*
17211776
* Must hold pin and lock on leafbuf at entry (read or write doesn't matter).
17221777
* On success exit, we'll be holding pin and write lock. On failure exit,
17231778
* we'll release both pin and lock before returning (we define it that way
17241779
* to avoid having to reacquire a lock we already released).
17251780
*/
17261781
static bool
1727-
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
1782+
_bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty,
1783+
TransactionId *oldestBtpoXact)
17281784
{
17291785
BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
17301786
BlockNumber leafleftsib;
@@ -1861,9 +1917,9 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
18611917
lbuf = InvalidBuffer;
18621918

18631919
/*
1864-
* Next write-lock the target page itself. It should be okay to take just
1865-
* a write lock not a superexclusive lock, since no scans would stop on an
1866-
* empty page.
1920+
* Next write-lock the target page itself. It's okay to take a write lock
1921+
* rather than a superexclusive lock, since no scan will stop on an empty
1922+
* page.
18671923
*/
18681924
LockBuffer(buf, BT_WRITE);
18691925
page = BufferGetPage(buf);
@@ -2001,6 +2057,7 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
20012057
*/
20022058
page = BufferGetPage(buf);
20032059
opaque = (BTPageOpaque) PageGetSpecialPointer(page);
2060+
Assert(P_ISHALFDEAD(opaque) || !P_ISLEAF(opaque));
20042061
opaque->btpo_flags &= ~BTP_HALF_DEAD;
20052062
opaque->btpo_flags |= BTP_DELETED;
20062063
opaque->btpo.xact = ReadNewTransactionId();
@@ -2105,6 +2162,10 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
21052162
_bt_relbuf(rel, lbuf);
21062163
_bt_relbuf(rel, rbuf);
21072164

2165+
if (!TransactionIdIsValid(*oldestBtpoXact) ||
2166+
TransactionIdPrecedes(opaque->btpo.xact, *oldestBtpoXact))
2167+
*oldestBtpoXact = opaque->btpo.xact;
2168+
21082169
/*
21092170
* Release the target, if it was not the leaf block. The leaf is always
21102171
* kept locked.

0 commit comments

Comments
 (0)