Skip to content

Commit 9dd963a

Browse files
Recycle nbtree pages deleted during same VACUUM.
Maintain a simple array of metadata about pages that were deleted during nbtree VACUUM's current btvacuumscan() call. Use this metadata at the end of btvacuumscan() to attempt to place newly deleted pages in the FSM without further delay. It might not yet be safe to place any of the pages in the FSM by then (they may not be deemed recyclable), but we have little to lose and plenty to gain by trying. In practice there is a very good chance that this will work out when vacuuming larger indexes, where scanning the index naturally takes quite a while. This commit doesn't change the page recycling invariants; it merely improves the efficiency of page recycling within the confines of the existing design. Recycle safety is a part of nbtree's implementation of what Lanin & Shasha call "the drain technique". The design happens to use transaction IDs (they're stored in deleted pages), but that in itself doesn't align the cutoff for recycle safety to any of the XID-based cutoffs used by VACUUM (e.g., OldestXmin). All that matters is whether or not _other_ backends might be able to observe various inconsistencies in the tree structure (that they cannot just detect and recover from by moving right). Recycle safety is purely a question of maintaining the consistency (or the apparent consistency) of a physical data structure. Note that running a simple serial test case involving a large range DELETE followed by a VACUUM VERBOSE will probably show that any newly deleted nbtree pages are not yet reusable/recyclable. This is expected in the absence of even one concurrent XID assignment. It is an old implementation restriction. In practice it's unlikely to be the thing that makes recycling remain unsafe, at least with larger indexes, where recycling newly deleted pages during the same VACUUM actually matters. An important high-level goal of this commit (as well as related recent commits e5d8a99 and 9f3665f) is to make expensive deferred cleanup operations in index AMs rare in general. If index vacuuming frequently depends on the next VACUUM operation finishing off work that the current operation started, then the general behavior of index vacuuming is hard to predict. This is relevant to ongoing work that adds a vacuumlazy.c mechanism to skip index vacuuming in certain cases. Anything that makes the real world behavior of index vacuuming simpler and more linear will also make top-down modeling in vacuumlazy.c more robust. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Masahiko Sawada <sawada.mshk@gmail.com> Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iUscb1UN44-gyZL-KgpsXbSxq_bdcMa7Q+wQ@mail.gmail.com
1 parent 4d399a6 commit 9dd963a

File tree

4 files changed

+260
-23
lines changed

4 files changed

+260
-23
lines changed

src/backend/access/nbtree/README

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -413,7 +413,30 @@ to be "visible to everyone". As collateral damage, we wait for snapshots
413413
taken until the next transaction to allocate an XID commits. We also wait
414414
for running XIDs with no snapshots.
415415

416-
The need for this additional indirection after a page deletion operation
416+
Prior to PostgreSQL 14, VACUUM would only place _old_ deleted pages that
417+
it encounters during its linear scan (pages deleted by a previous VACUUM
418+
operation) in the FSM. Newly deleted pages were never placed in the FSM,
419+
because that was assumed to _always_ be unsafe. That assumption was
420+
unnecessarily pessimistic in practice, though -- it often doesn't take
421+
very long for newly deleted pages to become safe to place in the FSM.
422+
There is no truly principled way to predict when deleted pages will become
423+
safe to place in the FSM for recycling -- it might become safe almost
424+
immediately (long before the current VACUUM completes), or it might not
425+
even be safe by the time the next VACUUM takes place. Recycle safety is
426+
purely a question of maintaining the consistency (or at least the apparent
427+
consistency) of a physical data structure. The state within the backend
428+
running VACUUM is simply not relevant.
429+
430+
PostgreSQL 14 added the ability for VACUUM to consider if it's possible to
431+
recycle newly deleted pages at the end of the full index scan where the
432+
page deletion took place. It is convenient to check if it's safe at that
433+
point. This does require that VACUUM keep around a little bookkeeping
434+
information about newly deleted pages, but that's very cheap. Using
435+
in-memory state for this avoids the need to revisit newly deleted pages a
436+
second time later on -- we can just use safexid values from the local
437+
bookkeeping state to determine recycle safety in a deferred fashion.
438+
439+
The need for additional FSM indirection after a page deletion operation
417440
takes place is a natural consequence of the highly permissive rules for
418441
index scans with Lehman and Yao's design. In general an index scan
419442
doesn't have to hold a lock or even a pin on any page when it descends the

src/backend/access/nbtree/nbtpage.c

Lines changed: 189 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,9 @@
3232
#include "storage/indexfsm.h"
3333
#include "storage/lmgr.h"
3434
#include "storage/predicate.h"
35+
#include "storage/procarray.h"
3536
#include "utils/memdebug.h"
37+
#include "utils/memutils.h"
3638
#include "utils/snapmgr.h"
3739

3840
static BTMetaPageData *_bt_getmeta(Relation rel, Buffer metabuf);
@@ -57,6 +59,8 @@ static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
5759
OffsetNumber *poffset,
5860
BlockNumber *topparent,
5961
BlockNumber *topparentrightsib);
62+
static void _bt_pendingfsm_add(BTVacState *vstate, BlockNumber target,
63+
FullTransactionId safexid);
6064

6165
/*
6266
* _bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -209,13 +213,9 @@ _bt_vacuum_needs_cleanup(Relation rel)
209213
* Trigger cleanup in rare cases where prev_num_delpages exceeds 5% of the
210214
* total size of the index. We can reasonably expect (though are not
211215
* guaranteed) to be able to recycle this many pages if we decide to do a
212-
* btvacuumscan call during the ongoing btvacuumcleanup.
213-
*
214-
* Our approach won't reliably avoid "wasted" cleanup-only btvacuumscan
215-
* calls. That is, we can end up scanning the entire index without ever
216-
* placing even 1 of the prev_num_delpages pages in the free space map, at
217-
* least in certain narrow cases (see nbtree/README section on recycling
218-
* deleted pages for details). This rarely comes up in practice.
216+
* btvacuumscan call during the ongoing btvacuumcleanup. For further
217+
* details see the nbtree/README section on placing deleted pages in the
218+
* FSM.
219219
*/
220220
if (prev_num_delpages > 0 &&
221221
prev_num_delpages > RelationGetNumberOfBlocks(rel) / 20)
@@ -2729,6 +2729,14 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
27292729
if (target <= scanblkno)
27302730
stats->pages_deleted++;
27312731

2732+
/*
2733+
* Remember information about the target page (now a newly deleted page)
2734+
* in dedicated vstate space for later. The page will be considered as a
2735+
* candidate to place in the FSM at the end of the current btvacuumscan()
2736+
* call.
2737+
*/
2738+
_bt_pendingfsm_add(vstate, target, safexid);
2739+
27322740
return true;
27332741
}
27342742

@@ -2873,3 +2881,177 @@ _bt_lock_subtree_parent(Relation rel, BlockNumber child, BTStack stack,
28732881
subtreeparent, poffset,
28742882
topparent, topparentrightsib);
28752883
}
2884+
2885+
/*
2886+
* Initialize local memory state used by VACUUM for _bt_pendingfsm_finalize
2887+
* optimization.
2888+
*
2889+
* Called at the start of a btvacuumscan(). Caller's cleanuponly argument
2890+
* indicates if ongoing VACUUM has not (and will not) call btbulkdelete().
2891+
*
2892+
* We expect to allocate memory inside VACUUM's top-level memory context here.
2893+
* The working buffer is subject to a limit based on work_mem. Our strategy
2894+
* when the array can no longer grow within the bounds of that limit is to
2895+
* stop saving additional newly deleted pages, while proceeding as usual with
2896+
* the pages that we can fit.
2897+
*/
2898+
void
2899+
_bt_pendingfsm_init(Relation rel, BTVacState *vstate, bool cleanuponly)
2900+
{
2901+
int64 maxbufsize;
2902+
2903+
/*
2904+
* Don't bother with optimization in cleanup-only case -- we don't expect
2905+
* any newly deleted pages. Besides, cleanup-only calls to btvacuumscan()
2906+
* can only take place because this optimization didn't work out during
2907+
* the last VACUUM.
2908+
*/
2909+
if (cleanuponly)
2910+
return;
2911+
2912+
/*
2913+
* Cap maximum size of array so that we always respect work_mem. Avoid
2914+
* int overflow here.
2915+
*/
2916+
vstate->bufsize = 256;
2917+
maxbufsize = (work_mem * 1024L) / sizeof(BTPendingFSM);
2918+
maxbufsize = Min(maxbufsize, INT_MAX);
2919+
maxbufsize = Min(maxbufsize, MaxAllocSize / sizeof(BTPendingFSM));
2920+
/* Stay sane with small work_mem */
2921+
maxbufsize = Max(maxbufsize, vstate->bufsize);
2922+
vstate->maxbufsize = maxbufsize;
2923+
2924+
/* Allocate buffer, indicate that there are currently 0 pending pages */
2925+
vstate->pendingpages = palloc(sizeof(BTPendingFSM) * vstate->bufsize);
2926+
vstate->npendingpages = 0;
2927+
}
2928+
2929+
/*
2930+
* Place any newly deleted pages (i.e. pages that _bt_pagedel() deleted during
2931+
* the ongoing VACUUM operation) into the free space map -- though only when
2932+
* it is actually safe to do so by now.
2933+
*
2934+
* Called at the end of a btvacuumscan(), just before free space map vacuuming
2935+
* takes place.
2936+
*
2937+
* Frees memory allocated by _bt_pendingfsm_init(), if any.
2938+
*/
2939+
void
2940+
_bt_pendingfsm_finalize(Relation rel, BTVacState *vstate)
2941+
{
2942+
IndexBulkDeleteResult *stats = vstate->stats;
2943+
2944+
Assert(stats->pages_newly_deleted >= vstate->npendingpages);
2945+
2946+
if (vstate->npendingpages == 0)
2947+
{
2948+
/* Just free memory when nothing to do */
2949+
if (vstate->pendingpages)
2950+
pfree(vstate->pendingpages);
2951+
2952+
return;
2953+
}
2954+
2955+
#ifdef DEBUG_BTREE_PENDING_FSM
2956+
2957+
/*
2958+
* Debugging aid: Sleep for 5 seconds to greatly increase the chances of
2959+
* placing pending pages in the FSM. Note that the optimization will
2960+
* never be effective without some other backend concurrently consuming an
2961+
* XID.
2962+
*/
2963+
pg_usleep(5000000L);
2964+
#endif
2965+
2966+
/*
2967+
* Recompute VACUUM XID boundaries.
2968+
*
2969+
* We don't actually care about the oldest non-removable XID. Computing
2970+
* the oldest such XID has a useful side-effect that we rely on: it
2971+
* forcibly updates the XID horizon state for this backend. This step is
2972+
* essential; GlobalVisCheckRemovableFullXid() will not reliably recognize
2973+
* that it is now safe to recycle newly deleted pages without this step.
2974+
*/
2975+
GetOldestNonRemovableTransactionId(NULL);
2976+
2977+
for (int i = 0; i < vstate->npendingpages; i++)
2978+
{
2979+
BlockNumber target = vstate->pendingpages[i].target;
2980+
FullTransactionId safexid = vstate->pendingpages[i].safexid;
2981+
2982+
/*
2983+
* Do the equivalent of checking BTPageIsRecyclable(), but without
2984+
* accessing the page again a second time.
2985+
*
2986+
* Give up on finding the first non-recyclable page -- all later pages
2987+
* must be non-recyclable too, since _bt_pendingfsm_add() adds pages
2988+
* to the array in safexid order.
2989+
*/
2990+
if (!GlobalVisCheckRemovableFullXid(NULL, safexid))
2991+
break;
2992+
2993+
RecordFreeIndexPage(rel, target);
2994+
stats->pages_free++;
2995+
}
2996+
2997+
pfree(vstate->pendingpages);
2998+
}
2999+
3000+
/*
3001+
* Maintain array of pages that were deleted during current btvacuumscan()
3002+
* call, for use in _bt_pendingfsm_finalize()
3003+
*/
3004+
static void
3005+
_bt_pendingfsm_add(BTVacState *vstate,
3006+
BlockNumber target,
3007+
FullTransactionId safexid)
3008+
{
3009+
Assert(vstate->npendingpages <= vstate->bufsize);
3010+
Assert(vstate->bufsize <= vstate->maxbufsize);
3011+
3012+
#ifdef USE_ASSERT_CHECKING
3013+
3014+
/*
3015+
* Verify an assumption made by _bt_pendingfsm_finalize(): pages from the
3016+
* array will always be in safexid order (since that is the order that we
3017+
* save them in here)
3018+
*/
3019+
if (vstate->npendingpages > 0)
3020+
{
3021+
FullTransactionId lastsafexid =
3022+
vstate->pendingpages[vstate->npendingpages - 1].safexid;
3023+
3024+
Assert(FullTransactionIdFollowsOrEquals(safexid, lastsafexid));
3025+
}
3026+
#endif
3027+
3028+
/*
3029+
* If temp buffer reaches maxbufsize/work_mem capacity then we discard
3030+
* information about this page.
3031+
*
3032+
* Note that this also covers the case where we opted to not use the
3033+
* optimization in _bt_pendingfsm_init().
3034+
*/
3035+
if (vstate->npendingpages == vstate->maxbufsize)
3036+
return;
3037+
3038+
/* Consider enlarging buffer */
3039+
if (vstate->npendingpages == vstate->bufsize)
3040+
{
3041+
int newbufsize = vstate->bufsize * 2;
3042+
3043+
/* Respect work_mem */
3044+
if (newbufsize > vstate->maxbufsize)
3045+
newbufsize = vstate->maxbufsize;
3046+
3047+
vstate->bufsize = newbufsize;
3048+
vstate->pendingpages =
3049+
repalloc(vstate->pendingpages,
3050+
sizeof(BTPendingFSM) * vstate->bufsize);
3051+
}
3052+
3053+
/* Save metadata for newly deleted page */
3054+
vstate->pendingpages[vstate->npendingpages].target = target;
3055+
vstate->pendingpages[vstate->npendingpages].safexid = safexid;
3056+
vstate->npendingpages++;
3057+
}

src/backend/access/nbtree/nbtree.c

Lines changed: 22 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -859,9 +859,13 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
859859
* Maintain num_delpages value in metapage for _bt_vacuum_needs_cleanup().
860860
*
861861
* num_delpages is the number of deleted pages now in the index that were
862-
* not safe to place in the FSM to be recycled just yet. We expect that
863-
* it will almost certainly be possible to place all of these pages in the
864-
* FSM during the next VACUUM operation.
862+
* not safe to place in the FSM to be recycled just yet. num_delpages is
863+
* greater than 0 only when _bt_pagedel() actually deleted pages during
864+
* our call to btvacuumscan(). Even then, _bt_pendingfsm_finalize() must
865+
* have failed to place any newly deleted pages in the FSM just moments
866+
* ago. (Actually, there are edge cases where recycling of the current
867+
* VACUUM's newly deleted pages does not even become safe by the time the
868+
* next VACUUM comes around. See nbtree/README.)
865869
*/
866870
Assert(stats->pages_deleted >= stats->pages_free);
867871
num_delpages = stats->pages_deleted - stats->pages_free;
@@ -937,6 +941,14 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
937941
"_bt_pagedel",
938942
ALLOCSET_DEFAULT_SIZES);
939943

944+
/* Initialize vstate fields used by _bt_pendingfsm_finalize */
945+
vstate.bufsize = 0;
946+
vstate.maxbufsize = 0;
947+
vstate.pendingpages = NULL;
948+
vstate.npendingpages = 0;
949+
/* Consider applying _bt_pendingfsm_finalize optimization */
950+
_bt_pendingfsm_init(rel, &vstate, (callback == NULL));
951+
940952
/*
941953
* The outer loop iterates over all index pages except the metapage, in
942954
* physical order (we hope the kernel will cooperate in providing
@@ -995,17 +1007,15 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
9951007
MemoryContextDelete(vstate.pagedelcontext);
9961008

9971009
/*
998-
* If we found any recyclable pages (and recorded them in the FSM), then
999-
* forcibly update the upper-level FSM pages to ensure that searchers can
1000-
* find them. It's possible that the pages were also found during
1001-
* previous scans and so this is a waste of time, but it's cheap enough
1002-
* relative to scanning the index that it shouldn't matter much, and
1003-
* making sure that free pages are available sooner not later seems
1004-
* worthwhile.
1010+
* If there were any calls to _bt_pagedel() during scan of the index then
1011+
* see if any of the resulting pages can be placed in the FSM now. When
1012+
* it's not safe we'll have to leave it up to a future VACUUM operation.
10051013
*
1006-
* Note that if no recyclable pages exist, we don't bother vacuuming the
1007-
* FSM at all.
1014+
* Finally, if we placed any pages in the FSM (either just now or during
1015+
* the scan), forcibly update the upper-level FSM pages to ensure that
1016+
* searchers can find them.
10081017
*/
1018+
_bt_pendingfsm_finalize(rel, &vstate);
10091019
if (stats->pages_free > 0)
10101020
IndexFreeSpaceMapVacuum(rel);
10111021
}

src/include/access/nbtree.h

Lines changed: 25 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
279279
* Is an existing page recyclable?
280280
*
281281
* This exists to centralize the policy on which deleted pages are now safe to
282-
* re-use.
282+
* re-use. However, _bt_pendingfsm_finalize() duplicates some of the same
283+
* logic because it doesn't work directly with pages -- keep the two in sync.
283284
*
284285
* Note: PageIsNew() pages are always safe to recycle, but we can't deal with
285286
* them here (caller is responsible for that case themselves). Caller might
@@ -305,6 +306,10 @@ BTPageIsRecyclable(Page page)
305306
* For that check if the deletion XID could still be visible to
306307
* anyone. If not, then no scan that's still in progress could have
307308
* seen its downlink, and we can recycle it.
309+
*
310+
* XXX: If we had the heap relation we could be more aggressive about
311+
* recycling deleted pages in non-catalog relations. For now we just
312+
* pass NULL. That is at least simple and consistent.
308313
*/
309314
return GlobalVisCheckRemovableFullXid(NULL, BTPageGetDeleteXid(page));
310315
}
@@ -313,9 +318,15 @@ BTPageIsRecyclable(Page page)
313318
}
314319

315320
/*
316-
* BTVacState is private nbtree.c state used during VACUUM. It is exported
317-
* for use by page deletion related code in nbtpage.c.
321+
* BTVacState and BTPendingFSM are private nbtree.c state used during VACUUM.
322+
* They are exported for use by page deletion related code in nbtpage.c.
318323
*/
324+
typedef struct BTPendingFSM
325+
{
326+
BlockNumber target; /* Page deleted by current VACUUM */
327+
FullTransactionId safexid; /* Page's BTDeletedPageData.safexid */
328+
} BTPendingFSM;
329+
319330
typedef struct BTVacState
320331
{
321332
IndexVacuumInfo *info;
@@ -324,6 +335,14 @@ typedef struct BTVacState
324335
void *callback_state;
325336
BTCycleId cycleid;
326337
MemoryContext pagedelcontext;
338+
339+
/*
340+
* _bt_pendingfsm_finalize() state
341+
*/
342+
int bufsize; /* pendingpages space (in # elements) */
343+
int maxbufsize; /* max bufsize that respects work_mem */
344+
BTPendingFSM *pendingpages; /* One entry per newly deleted page */
345+
int npendingpages; /* current # valid pendingpages */
327346
} BTVacState;
328347

329348
/*
@@ -1195,6 +1214,9 @@ extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
11951214
Relation heapRel,
11961215
TM_IndexDeleteOp *delstate);
11971216
extern void _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate);
1217+
extern void _bt_pendingfsm_init(Relation rel, BTVacState *vstate,
1218+
bool cleanuponly);
1219+
extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
11981220

11991221
/*
12001222
* prototypes for functions in nbtsearch.c

0 commit comments

Comments
 (0)