Skip to content

Commit 35c0cd3

Browse files
committed
Improve comment for tricky aspect of index-only scans.
Index-only scans avoid taking a lock on the VM buffer, which would cause a lot of contention. To be correct, that requires some intricate assumptions that weren't completely documented in the previous comment. Reviewed by Robert Haas.
1 parent 3a9d430 commit 35c0cd3

File tree

1 file changed

+25
-9
lines changed

1 file changed

+25
-9
lines changed

src/backend/executor/nodeIndexonlyscan.c

Lines changed: 25 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -88,15 +88,31 @@ IndexOnlyNext(IndexOnlyScanState *node)
8888
* Note on Memory Ordering Effects: visibilitymap_test does not lock
8989
* the visibility map buffer, and therefore the result we read here
9090
* could be slightly stale. However, it can't be stale enough to
91-
* matter. It suffices to show that (1) there is a read barrier
92-
* between the time we read the index TID and the time we test the
93-
* visibility map; and (2) there is a write barrier between the time
94-
* some other concurrent process clears the visibility map bit and the
95-
* time it inserts the index TID. Since acquiring or releasing a
96-
* LWLock interposes a full barrier, this is easy to show: (1) is
97-
* satisfied by the release of the index buffer content lock after
98-
* reading the TID; and (2) is satisfied by the acquisition of the
99-
* buffer content lock in order to insert the TID.
91+
* matter.
92+
*
93+
* We need to detect clearing a VM bit due to an insert right away,
94+
* because the tuple is present in the index page but not visible. The
95+
* reading of the TID by this scan (using a shared lock on the index
96+
* buffer) is serialized with the insert of the TID into the index
97+
* (using an exclusive lock on the index buffer). Because the VM bit
98+
* is cleared before updating the index, and locking/unlocking of the
99+
* index page acts as a full memory barrier, we are sure to see the
100+
* cleared bit if we see a recently-inserted TID.
101+
*
102+
* Deletes do not update the index page (only VACUUM will clear out
103+
* the TID), so the clearing of the VM bit by a delete is not
104+
* serialized with this test below, and we may see a value that is
105+
* significantly stale. However, we don't care about the delete right
106+
* away, because the tuple is still visible until the deleting
107+
* transaction commits or the statement ends (if it's our
108+
* transaction). In either case, the lock on the VM buffer will have
109+
* been released (acting as a write barrier) after clearing the
110+
* bit. And for us to have a snapshot that includes the deleting
111+
* transaction (making the tuple invisible), we must have acquired
112+
* ProcArrayLock after that time, acting as a read barrier.
113+
*
114+
* It's worth going through this complexity to avoid needing to lock
115+
* the VM buffer, which could cause significant contention.
100116
*/
101117
if (!visibilitymap_test(scandesc->heapRelation,
102118
ItemPointerGetBlockNumber(tid),

0 commit comments

Comments
 (0)