Skip to content

Commit fc545a2

Browse files
committed
Fix race condition in GIN posting tree page deletion.
If a page is deleted, and reused for something else, just as a search is following a rightlink to it from its left sibling, the search would continue scanning whatever the new contents of the page are. That could lead to incorrect query results, or even something more curious if the page is reused for a different kind of a page. To fix, modify the search algorithm to lock the next page before releasing the previous one, and refrain from deleting pages from the leftmost branch of the tree. Add a new Concurrency section to the README, explaining why this works. There is a lot more one could say about concurrency in GIN, but that's for another patch. Backpatch to all supported versions.
1 parent 9548bee commit fc545a2

File tree

5 files changed

+123
-63
lines changed

5 files changed

+123
-63
lines changed

src/backend/access/gin/README

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,56 @@ fit on one pending-list page must have those pages to itself, even if this
210210
results in wasting much of the space on the preceding page and the last
211211
page for the tuple.)
212212

213+
Concurrency
214+
-----------
215+
216+
The entry tree and each posting tree is a B-tree, with right-links connecting
217+
sibling pages at the same level. This is the same structure that is used in
218+
the regular B-tree indexam (invented by Lehman & Yao), but we don't support
219+
scanning a GIN trees backwards, so we don't need left-links.
220+
221+
To avoid deadlocks, B-tree pages must always be locked in the same order:
222+
left to right, and bottom to top. When searching, the tree is traversed from
223+
top to bottom, so the lock on the parent page must be released before
224+
descending to the next level. Concurrent page splits move the keyspace to
225+
right, so after following a downlink, the page actually containing the key
226+
we're looking for might be somewhere to the right of the page we landed on.
227+
In that case, we follow the right-links until we find the page we're looking
228+
for.
229+
230+
To delete a page, the page's left sibling, the target page, and its parent,
231+
are locked in that order, and the page is marked as deleted. However, a
232+
concurrent search might already have read a pointer to the page, and might be
233+
just about to follow it. A page can be reached via the right-link of its left
234+
sibling, or via its downlink in the parent.
235+
236+
To prevent a backend from reaching a deleted page via a right-link, when
237+
following a right-link the lock on the previous page is not released until
238+
the lock on next page has been acquired.
239+
240+
The downlink is more tricky. A search descending the tree must release the
241+
lock on the parent page before locking the child, or it could deadlock with
242+
a concurrent split of the child page; a page split locks the parent, while
243+
already holding a lock on the child page. However, posting trees are only
244+
fully searched from left to right, starting from the leftmost leaf. (The
245+
tree-structure is only needed by insertions, to quickly find the correct
246+
insert location). So as long as we don't delete the leftmost page on each
247+
level, a search can never follow a downlink to page that's about to be
248+
deleted.
249+
250+
The previous paragraph's reasoning only applies to searches, and only to
251+
posting trees. To protect from inserters following a downlink to a deleted
252+
page, vacuum simply locks out all concurrent insertions to the posting tree,
253+
by holding a super-exclusive lock on the posting tree root. Inserters hold a
254+
pin on the root page, but searches do not, so while new searches cannot begin
255+
while root page is locked, any already-in-progress scans can continue
256+
concurrently with vacuum. In the entry tree, we never delete pages.
257+
258+
(This is quite different from the mechanism the btree indexam uses to make
259+
page-deletions safe; it stamps the deleted pages with an XID and keeps the
260+
deleted pages around with the right-link intact until all concurrent scans
261+
have finished.)
262+
213263
Limitations
214264
-----------
215265

src/backend/access/gin/ginbtree.c

Lines changed: 42 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -112,10 +112,8 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
112112
/* rightmost page */
113113
break;
114114

115+
stack->buffer = ginStepRight(stack->buffer, btree->index, access);
115116
stack->blkno = rightlink;
116-
LockBuffer(stack->buffer, GIN_UNLOCK);
117-
stack->buffer = ReleaseAndReadBuffer(stack->buffer, btree->index, stack->blkno);
118-
LockBuffer(stack->buffer, access);
119117
page = BufferGetPage(stack->buffer);
120118
}
121119

@@ -148,6 +146,41 @@ ginFindLeafPage(GinBtree btree, GinBtreeStack *stack)
148146
}
149147
}
150148

149+
/*
150+
* Step right from current page.
151+
*
152+
* The next page is locked first, before releasing the current page. This is
153+
* crucial to protect from concurrent page deletion (see comment in
154+
* ginDeletePage).
155+
*/
156+
Buffer
157+
ginStepRight(Buffer buffer, Relation index, int lockmode)
158+
{
159+
Buffer nextbuffer;
160+
Page page = BufferGetPage(buffer);
161+
bool isLeaf = GinPageIsLeaf(page);
162+
bool isData = GinPageIsData(page);
163+
BlockNumber blkno = GinPageGetOpaque(page)->rightlink;
164+
165+
nextbuffer = ReadBuffer(index, blkno);
166+
LockBuffer(nextbuffer, lockmode);
167+
UnlockReleaseBuffer(buffer);
168+
169+
/* Sanity check that the page we stepped to is of similar kind. */
170+
page = BufferGetPage(nextbuffer);
171+
if (isLeaf != GinPageIsLeaf(page) || isData != GinPageIsData(page))
172+
elog(ERROR, "right sibling of GIN page is of different type");
173+
174+
/*
175+
* Given the proper lock sequence above, we should never land on a
176+
* deleted page.
177+
*/
178+
if (GinPageIsDeleted(page))
179+
elog(ERROR, "right sibling of GIN page was deleted");
180+
181+
return nextbuffer;
182+
}
183+
151184
void
152185
freeGinBtreeStack(GinBtreeStack *stack)
153186
{
@@ -235,12 +268,12 @@ ginFindParents(GinBtree btree, GinBtreeStack *stack,
235268
while ((offset = btree->findChildPtr(btree, page, stack->blkno, InvalidOffsetNumber)) == InvalidOffsetNumber)
236269
{
237270
blkno = GinPageGetOpaque(page)->rightlink;
238-
LockBuffer(buffer, GIN_UNLOCK);
239-
ReleaseBuffer(buffer);
240271
if (blkno == InvalidBlockNumber)
272+
{
273+
UnlockReleaseBuffer(buffer);
241274
break;
242-
buffer = ReadBuffer(btree->index, blkno);
243-
LockBuffer(buffer, GIN_EXCLUSIVE);
275+
}
276+
buffer = ginStepRight(buffer, btree->index, GIN_EXCLUSIVE);
244277
page = BufferGetPage(buffer);
245278
}
246279

@@ -439,23 +472,21 @@ ginInsertValue(GinBtree btree, GinBtreeStack *stack, GinStatsData *buildStats)
439472
{
440473
BlockNumber rightlink = GinPageGetOpaque(page)->rightlink;
441474

442-
LockBuffer(parent->buffer, GIN_UNLOCK);
443-
444475
if (rightlink == InvalidBlockNumber)
445476
{
446477
/*
447478
* rightmost page, but we don't find parent, we should use
448479
* plain search...
449480
*/
481+
LockBuffer(parent->buffer, GIN_UNLOCK);
450482
ginFindParents(btree, stack, rootBlkno);
451483
parent = stack->parent;
452484
Assert(parent != NULL);
453485
break;
454486
}
455487

488+
parent->buffer = ginStepRight(parent->buffer, btree->index, GIN_EXCLUSIVE);
456489
parent->blkno = rightlink;
457-
parent->buffer = ReleaseAndReadBuffer(parent->buffer, btree->index, parent->blkno);
458-
LockBuffer(parent->buffer, GIN_EXCLUSIVE);
459490
page = BufferGetPage(parent->buffer);
460491
}
461492

src/backend/access/gin/ginget.c

Lines changed: 8 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -105,16 +105,11 @@ moveRightIfItNeeded(GinBtreeData *btree, GinBtreeStack *stack)
105105
/*
106106
* We scanned the whole page, so we should take right page
107107
*/
108-
stack->blkno = GinPageGetOpaque(page)->rightlink;
109-
110108
if (GinPageRightMost(page))
111109
return false; /* no more pages */
112110

113-
LockBuffer(stack->buffer, GIN_UNLOCK);
114-
stack->buffer = ReleaseAndReadBuffer(stack->buffer,
115-
btree->index,
116-
stack->blkno);
117-
LockBuffer(stack->buffer, GIN_SHARE);
111+
stack->buffer = ginStepRight(stack->buffer, btree->index, GIN_SHARE);
112+
stack->blkno = BufferGetBlockNumber(stack->buffer);
118113
stack->off = FirstOffsetNumber;
119114
}
120115

@@ -132,7 +127,6 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
132127
GinPostingTreeScan *gdi;
133128
Buffer buffer;
134129
Page page;
135-
BlockNumber blkno;
136130

137131
/* Descend to the leftmost leaf page */
138132
gdi = ginPrepareScanPostingTree(index, rootPostingTree, TRUE);
@@ -162,10 +156,7 @@ scanPostingTree(Relation index, GinScanEntry scanEntry,
162156
if (GinPageRightMost(page))
163157
break; /* no more pages */
164158

165-
blkno = GinPageGetOpaque(page)->rightlink;
166-
LockBuffer(buffer, GIN_UNLOCK);
167-
buffer = ReleaseAndReadBuffer(buffer, index, blkno);
168-
LockBuffer(buffer, GIN_SHARE);
159+
buffer = ginStepRight(buffer, index, GIN_SHARE);
169160
}
170161

171162
UnlockReleaseBuffer(buffer);
@@ -542,7 +533,6 @@ static void
542533
entryGetNextItem(GinState *ginstate, GinScanEntry entry)
543534
{
544535
Page page;
545-
BlockNumber blkno;
546536

547537
for (;;)
548538
{
@@ -560,23 +550,18 @@ entryGetNextItem(GinState *ginstate, GinScanEntry entry)
560550
* It's needed to go by right link. During that we should refind
561551
* first ItemPointer greater that stored
562552
*/
563-
564-
blkno = GinPageGetOpaque(page)->rightlink;
565-
566-
LockBuffer(entry->buffer, GIN_UNLOCK);
567-
if (blkno == InvalidBlockNumber)
553+
if (GinPageRightMost(page))
568554
{
569-
ReleaseBuffer(entry->buffer);
555+
UnlockReleaseBuffer(entry->buffer);
570556
ItemPointerSetInvalid(&entry->curItem);
571557
entry->buffer = InvalidBuffer;
572558
entry->isFinished = TRUE;
573559
return;
574560
}
575561

576-
entry->buffer = ReleaseAndReadBuffer(entry->buffer,
577-
ginstate->index,
578-
blkno);
579-
LockBuffer(entry->buffer, GIN_SHARE);
562+
entry->buffer = ginStepRight(entry->buffer,
563+
ginstate->index,
564+
GIN_SHARE);
580565
page = BufferGetPage(entry->buffer);
581566

582567
entry->offset = InvalidOffsetNumber;

src/backend/access/gin/ginvacuum.c

Lines changed: 22 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,9 @@ ginVacuumPostingTreeLeaves(GinVacuumState *gvs, BlockNumber blkno, bool isRoot,
237237
return hasVoidPage;
238238
}
239239

240+
/*
241+
* Delete a posting tree page.
242+
*/
240243
static void
241244
ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkno,
242245
BlockNumber parentBlkno, OffsetNumber myoff, bool isParentRoot)
@@ -246,39 +249,35 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
246249
Buffer pBuffer;
247250
Page page,
248251
parentPage;
252+
BlockNumber rightlink;
249253

254+
/*
255+
* Lock the pages in the same order as an insertion would, to avoid
256+
* deadlocks: left, then right, then parent.
257+
*/
258+
lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
259+
RBM_NORMAL, gvs->strategy);
250260
dBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, deleteBlkno,
251261
RBM_NORMAL, gvs->strategy);
252-
253-
if (leftBlkno != InvalidBlockNumber)
254-
lBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, leftBlkno,
255-
RBM_NORMAL, gvs->strategy);
256-
else
257-
lBuffer = InvalidBuffer;
258-
259262
pBuffer = ReadBufferExtended(gvs->index, MAIN_FORKNUM, parentBlkno,
260263
RBM_NORMAL, gvs->strategy);
261264

265+
LockBuffer(lBuffer, GIN_EXCLUSIVE);
262266
LockBuffer(dBuffer, GIN_EXCLUSIVE);
263267
if (!isParentRoot) /* parent is already locked by
264268
* LockBufferForCleanup() */
265269
LockBuffer(pBuffer, GIN_EXCLUSIVE);
266-
if (leftBlkno != InvalidBlockNumber)
267-
LockBuffer(lBuffer, GIN_EXCLUSIVE);
268270

269271
START_CRIT_SECTION();
270272

271-
if (leftBlkno != InvalidBlockNumber)
272-
{
273-
BlockNumber rightlink;
274-
275-
page = BufferGetPage(dBuffer);
276-
rightlink = GinPageGetOpaque(page)->rightlink;
273+
/* Unlink the page by changing left sibling's rightlink */
274+
page = BufferGetPage(dBuffer);
275+
rightlink = GinPageGetOpaque(page)->rightlink;
277276

278-
page = BufferGetPage(lBuffer);
279-
GinPageGetOpaque(page)->rightlink = rightlink;
280-
}
277+
page = BufferGetPage(lBuffer);
278+
GinPageGetOpaque(page)->rightlink = rightlink;
281279

280+
/* Delete downlink from parent */
282281
parentPage = BufferGetPage(pBuffer);
283282
#ifdef USE_ASSERT_CHECKING
284283
do
@@ -360,10 +359,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
360359
if (!isParentRoot)
361360
LockBuffer(pBuffer, GIN_UNLOCK);
362361
ReleaseBuffer(pBuffer);
363-
364-
if (leftBlkno != InvalidBlockNumber)
365-
UnlockReleaseBuffer(lBuffer);
366-
362+
UnlockReleaseBuffer(lBuffer);
367363
UnlockReleaseBuffer(dBuffer);
368364

369365
END_CRIT_SECTION();
@@ -431,15 +427,12 @@ ginScanToDelete(GinVacuumState *gvs, BlockNumber blkno, bool isRoot, DataPageDel
431427

432428
if (GinPageGetOpaque(page)->maxoff < FirstOffsetNumber)
433429
{
434-
if (!(me->leftBlkno == InvalidBlockNumber && GinPageRightMost(page)))
430+
/* we never delete the left- or rightmost branch */
431+
if (me->leftBlkno != InvalidBlockNumber && !GinPageRightMost(page))
435432
{
436-
/* we never delete right most branch */
437433
Assert(!isRoot);
438-
if (GinPageGetOpaque(page)->maxoff < FirstOffsetNumber)
439-
{
440-
ginDeletePage(gvs, blkno, me->leftBlkno, me->parent->blkno, myoff, me->parent->isRoot);
441-
meDelete = TRUE;
442-
}
434+
ginDeletePage(gvs, blkno, me->leftBlkno, me->parent->blkno, myoff, me->parent->isRoot);
435+
meDelete = TRUE;
443436
}
444437
}
445438

src/include/access/gin_private.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -513,6 +513,7 @@ typedef struct GinBtreeData
513513

514514
extern GinBtreeStack *ginPrepareFindLeafPage(GinBtree btree, BlockNumber blkno);
515515
extern GinBtreeStack *ginFindLeafPage(GinBtree btree, GinBtreeStack *stack);
516+
extern Buffer ginStepRight(Buffer buffer, Relation index, int lockmode);
516517
extern void freeGinBtreeStack(GinBtreeStack *stack);
517518
extern void ginInsertValue(GinBtree btree, GinBtreeStack *stack,
518519
GinStatsData *buildStats);

0 commit comments

Comments
 (0)