@@ -166,53 +166,40 @@ that the incoming item doesn't fit on the split page where it needs to go!
166
166
Deleting index tuples during VACUUM
167
167
-----------------------------------
168
168
169
- Before deleting a leaf item, we get a super-exclusive lock on the target
169
+ Before deleting a leaf item, we get a full cleanup lock on the target
170
170
page, so that no other backend has a pin on the page when the deletion
171
171
starts. This is not necessary for correctness in terms of the btree index
172
172
operations themselves; as explained above, index scans logically stop
173
173
"between" pages and so can't lose their place. The reason we do it is to
174
- provide an interlock between VACUUM and indexscans. Since VACUUM deletes
175
- index entries before reclaiming heap tuple line pointers, the
176
- super-exclusive lock guarantees that VACUUM can't reclaim for re-use a
177
- line pointer that an indexscanning process might be about to visit. This
178
- guarantee works only for simple indexscans that visit the heap in sync
179
- with the index scan, not for bitmap scans. We only need the guarantee
180
- when using non-MVCC snapshot rules; when using an MVCC snapshot, it
181
- doesn't matter if the heap tuple is replaced with an unrelated tuple at
182
- the same TID, because the new tuple won't be visible to our scan anyway.
183
- Therefore, a scan using an MVCC snapshot which has no other confounding
184
- factors will not hold the pin after the page contents are read. The
185
- current reasons for exceptions, where a pin is still needed, are if the
186
- index is not WAL-logged or if the scan is an index-only scan. If later
187
- work allows the pin to be dropped for all cases we will be able to
188
- simplify the vacuum code, since the concept of a super-exclusive lock
189
- for btree indexes will no longer be needed.
174
+ provide an interlock between VACUUM and index scans that are not prepared
175
+ to deal with concurrent TID recycling when visiting the heap. Since only
176
+ VACUUM can ever mark pointed-to items LP_UNUSED in the heap, and since
177
+ this only ever happens _after_ btbulkdelete returns, having index scans
178
+ hold on to the pin (used when reading from the leaf page) until _after_
179
+ they're done visiting the heap (for TIDs from pinned leaf page) prevents
180
+ concurrent TID recycling. VACUUM cannot get a conflicting cleanup lock
181
+ until the index scan is totally finished processing its leaf page.
182
+
183
+ This approach is fairly coarse, so we avoid it whenever possible. In
184
+ practice most index scans won't hold onto their pin, and so won't block
185
+ VACUUM. These index scans must deal with TID recycling directly, which is
186
+ more complicated and not always possible. See later section on making
187
+ concurrent TID recycling safe.
188
+
189
+ Opportunistic index tuple deletion performs almost the same page-level
190
+ modifications while only holding an exclusive lock. This is safe because
191
+ there is no question of TID recycling taking place later on -- only VACUUM
192
+ can make TIDs recyclable. See also simple deletion and bottom-up
193
+ deletion, below.
190
194
191
195
Because a pin is not always held, and a page can be split even while
192
196
someone does hold a pin on it, it is possible that an indexscan will
193
197
return items that are no longer stored on the page it has a pin on, but
194
198
rather somewhere to the right of that page. To ensure that VACUUM can't
195
- prematurely remove such heap tuples, we require btbulkdelete to obtain a
196
- super-exclusive lock on every leaf page in the index, even pages that
197
- don't contain any deletable tuples. Any scan which could yield incorrect
198
- results if the tuple at a TID matching the scan's range and filter
199
- conditions were replaced by a different tuple while the scan is in
200
- progress must hold the pin on each index page until all index entries read
201
- from the page have been processed. This guarantees that the btbulkdelete
202
- call cannot return while any indexscan is still holding a copy of a
203
- deleted index tuple if the scan could be confused by that. Note that this
204
- requirement does not say that btbulkdelete must visit the pages in any
205
- particular order. (See also simple deletion and bottom-up deletion,
206
- below.)
207
-
208
- There is no such interlocking for deletion of items in internal pages,
209
- since backends keep no lock nor pin on a page they have descended past.
210
- Hence, when a backend is ascending the tree using its stack, it must
211
- be prepared for the possibility that the item it wants is to the left of
212
- the recorded position (but it can't have moved left out of the recorded
213
- page). Since we hold a lock on the lower page (per L&Y) until we have
214
- re-found the parent item that links to it, we can be assured that the
215
- parent item does still exist and can't have been deleted.
199
+ prematurely make TIDs recyclable in this scenario, we require btbulkdelete
200
+ to obtain a cleanup lock on every leaf page in the index, even pages that
201
+ don't contain any deletable tuples. Note that this requirement does not
202
+ say that btbulkdelete must visit the pages in any particular order.
216
203
217
204
VACUUM's linear scan, concurrent page splits
218
205
--------------------------------------------
@@ -453,6 +440,55 @@ whenever it is subsequently taken from the FSM for reuse. The deleted
453
440
page's contents will be overwritten by the split operation (it will become
454
441
the new right sibling page).
455
442
443
+ Making concurrent TID recycling safe
444
+ ------------------------------------
445
+
446
+ As explained in the earlier section about deleting index tuples during
447
+ VACUUM, we implement a locking protocol that allows individual index scans
448
+ to avoid concurrent TID recycling. Index scans opt-out (and so drop their
449
+ leaf page pin when visiting the heap) whenever it's safe to do so, though.
450
+ Dropping the pin early is useful because it avoids blocking progress by
451
+ VACUUM. This is particularly important with index scans used by cursors,
452
+ since idle cursors sometimes stop for relatively long periods of time. In
453
+ extreme cases, a client application may hold on to an idle cursors for
454
+ hours or even days. Blocking VACUUM for that long could be disastrous.
455
+
456
+ Index scans that don't hold on to a buffer pin are protected by holding an
457
+ MVCC snapshot instead. This more limited interlock prevents wrong answers
458
+ to queries, but it does not prevent concurrent TID recycling itself (only
459
+ holding onto the leaf page pin while accessing the heap ensures that).
460
+
461
+ Index-only scans can never drop their buffer pin, since they are unable to
462
+ tolerate having a referenced TID become recyclable. Index-only scans
463
+ typically just visit the visibility map (not the heap proper), and so will
464
+ not reliably notice that any stale TID reference (for a TID that pointed
465
+ to a dead-to-all heap item at first) was concurrently marked LP_UNUSED in
466
+ the heap by VACUUM. This could easily allow VACUUM to set the whole heap
467
+ page to all-visible in the visibility map immediately afterwards. An MVCC
468
+ snapshot is only sufficient to avoid problems during plain index scans
469
+ because they must access granular visibility information from the heap
470
+ proper. A plain index scan will even recognize LP_UNUSED items in the
471
+ heap (items that could be recycled but haven't been just yet) as "not
472
+ visible" -- even when the heap page is generally considered all-visible.
473
+
474
+ LP_DEAD setting of index tuples by the kill_prior_tuple optimization
475
+ (described in full in simple deletion, below) is also more complicated for
476
+ index scans that drop their leaf page pins. We must be careful to avoid
477
+ LP_DEAD-marking any new index tuple that looks like a known-dead index
478
+ tuple because it happens to share the same TID, following concurrent TID
479
+ recycling. It's just about possible that some other session inserted a
480
+ new, unrelated index tuple, on the same leaf page, which has the same
481
+ original TID. It would be totally wrong to LP_DEAD-set this new,
482
+ unrelated index tuple.
483
+
484
+ We handle this kill_prior_tuple race condition by having affected index
485
+ scans conservatively assume that any change to the leaf page at all
486
+ implies that it was reached by btbulkdelete in the interim period when no
487
+ buffer pin was held. This is implemented by not setting any LP_DEAD bits
488
+ on the leaf page at all when the page's LSN has changed. (That won't work
489
+ with an unlogged index, so for now we don't ever apply the "don't hold
490
+ onto pin" optimization there.)
491
+
456
492
Fastpath For Index Insertion
457
493
----------------------------
458
494
@@ -518,22 +554,6 @@ that's required for the deletion process to perform granular removal of
518
554
groups of dead TIDs from posting list tuples (without the situation ever
519
555
being allowed to get out of hand).
520
556
521
- It's sufficient to have an exclusive lock on the index page, not a
522
- super-exclusive lock, to do deletion of LP_DEAD items. It might seem
523
- that this breaks the interlock between VACUUM and indexscans, but that is
524
- not so: as long as an indexscanning process has a pin on the page where
525
- the index item used to be, VACUUM cannot complete its btbulkdelete scan
526
- and so cannot remove the heap tuple. This is another reason why
527
- btbulkdelete has to get a super-exclusive lock on every leaf page, not only
528
- the ones where it actually sees items to delete.
529
-
530
- LP_DEAD setting by index scans cannot be sure that a TID whose index tuple
531
- it had planned on LP_DEAD-setting has not been recycled by VACUUM if it
532
- drops its pin in the meantime. It must conservatively also remember the
533
- LSN of the page, and only act to set LP_DEAD bits when the LSN has not
534
- changed at all. (Avoiding dropping the pin entirely also makes it safe, of
535
- course.)
536
-
537
557
Bottom-Up deletion
538
558
------------------
539
559
@@ -733,23 +753,21 @@ because it allows running applications to continue while the standby
733
753
changes state into a normally running server.
734
754
735
755
The interlocking required to avoid returning incorrect results from
736
- non-MVCC scans is not required on standby nodes. We still get a
737
- super-exclusive lock ("cleanup lock") when replaying VACUUM records
738
- during recovery, but recovery does not need to lock every leaf page
739
- (only those leaf pages that have items to delete). That is safe because
740
- HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
741
- HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only ever
742
- used during write transactions, which cannot exist on the standby. MVCC
743
- scans are already protected by definition, so HeapTupleSatisfiesMVCC()
744
- is not a problem. The optimizer looks at the boundaries of value ranges
745
- using HeapTupleSatisfiesNonVacuumable() with an index-only scan, which
746
- is also safe. That leaves concern only for HeapTupleSatisfiesToast().
747
-
748
- HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
749
- because it doesn't need to - if the main heap row is visible then the
750
- toast rows will also be visible. So as long as we follow a toast
751
- pointer from a visible (live) tuple the corresponding toast rows
752
- will also be visible, so we do not need to recheck MVCC on them.
756
+ non-MVCC scans is not required on standby nodes. We still get a full
757
+ cleanup lock when replaying VACUUM records during recovery, but recovery
758
+ does not need to lock every leaf page (only those leaf pages that have
759
+ items to delete) -- that's sufficient to avoid breaking index-only scans
760
+ during recovery (see section above about making TID recycling safe). That
761
+ leaves concern only for plain index scans. (XXX: Not actually clear why
762
+ this is totally unnecessary during recovery.)
763
+
764
+ MVCC snapshot plain index scans are always safe, for the same reasons that
765
+ they're safe during original execution. HeapTupleSatisfiesToast() doesn't
766
+ use MVCC semantics, though that's because it doesn't need to - if the main
767
+ heap row is visible then the toast rows will also be visible. So as long
768
+ as we follow a toast pointer from a visible (live) tuple the corresponding
769
+ toast rows will also be visible, so we do not need to recheck MVCC on
770
+ them.
753
771
754
772
Other Things That Are Handy to Know
755
773
-----------------------------------
0 commit comments