@@ -490,24 +490,33 @@ lock on the leaf page).
490
490
Once an index tuple has been marked LP_DEAD it can actually be deleted
491
491
from the index immediately; since index scans only stop "between" pages,
492
492
no scan can lose its place from such a deletion. We separate the steps
493
- because we allow LP_DEAD to be set with only a share lock (it's exactly
494
- like a hint bit for a heap tuple), but physically removing tuples requires
495
- exclusive lock. Also, delaying the deletion often allows us to pick up
496
- extra index tuples that weren't initially safe for index scans to mark
497
- LP_DEAD. We do this with index tuples whose TIDs point to the same table
498
- blocks as an LP_DEAD-marked tuple. They're practically free to check in
499
- passing, and have a pretty good chance of being safe to delete due to
500
- various locality effects.
501
-
502
- We only try to delete LP_DEAD tuples (and nearby tuples) when we are
503
- otherwise faced with having to split a page to do an insertion (and hence
504
- have exclusive lock on it already). Deduplication and bottom-up index
505
- deletion can also prevent a page split, but simple deletion is always our
506
- preferred approach. (Note that posting list tuples can only have their
507
- LP_DEAD bit set when every table TID within the posting list is known
508
- dead. This isn't much of a problem in practice because LP_DEAD bits are
509
- just a starting point for simple deletion -- we still manage to perform
510
- granular deletes of posting list TIDs quite often.)
493
+ because we allow LP_DEAD to be set with only a share lock (it's like a
494
+ hint bit for a heap tuple), but physically deleting tuples requires an
495
+ exclusive lock. We also need to generate a latestRemovedXid value for
496
+ each deletion operation's WAL record, which requires additional
497
+ coordinating with the tableam when the deletion actually takes place.
498
+ (This latestRemovedXid value may be used to generate a recovery conflict
499
+ during subsequent REDO of the record by a standby.)
500
+
501
+ Delaying and batching index tuple deletion like this enables a further
502
+ optimization: opportunistic checking of "extra" nearby index tuples
503
+ (tuples that are not LP_DEAD-set) when they happen to be very cheap to
504
+ check in passing (because we already know that the tableam will be
505
+ visiting their table block to generate a latestRemovedXid value). Any
506
+ index tuples that turn out to be safe to delete will also be deleted.
507
+ Simple deletion will behave as if the extra tuples that actually turn
508
+ out to be delete-safe had their LP_DEAD bits set right from the start.
509
+
510
+ Deduplication can also prevent a page split, but index tuple deletion is
511
+ our preferred approach. Note that posting list tuples can only have
512
+ their LP_DEAD bit set when every table TID within the posting list is
513
+ known dead. This isn't much of a problem in practice because LP_DEAD
514
+ bits are just a starting point for deletion. What really matters is
515
+ that _some_ deletion operation that targets related nearby-in-table TIDs
516
+ takes place at some point before the page finally splits. That's all
517
+ that's required for the deletion process to perform granular removal of
518
+ groups of dead TIDs from posting list tuples (without the situation ever
519
+ being allowed to get out of hand).
511
520
512
521
It's sufficient to have an exclusive lock on the index page, not a
513
522
super-exclusive lock, to do deletion of LP_DEAD items. It might seem
0 commit comments