Skip to content

Commit d8adfc1

Browse files
Avoid parallel nbtree index scan hangs with SAOPs.
Commit 5bf748b, which enhanced nbtree ScalarArrayOp execution, made parallel index scans work with the new design for arrays via explicit scheduling of primitive index scans. A backend that successfully scheduled the scan's next primitive index scan saved its backend local array keys in shared memory. Any backend could pick up the scheduled primitive scan within _bt_first. This scheme decouples scheduling a primitive scan from starting the scan (by performing another descent of the index via a _bt_search call from _bt_first) to make things robust. The scheme had a deadlock hazard, at least when the leader process participated in the scan. _bt_parallel_seize had a code path that made backends that were not in an immediate position to start a scheduled primitive index scan wait for some other backend to do so instead. Under the right circumstances, the leader process could wait here forever: the leader would wait for any other backend to start the primitive scan, while every worker was busy waiting on the leader to consume tuples from the scan's tuple queue. To fix, don't wait for a scheduled primitive index scan to be started by some other eligible backend from within _bt_parallel_seize (when the calling backend isn't in a position to do so itself). Return false instead, while recording that the scan has a scheduled primitive index scan in backend local state. This leaves the backend in the same state as the existing case where a backend schedules (or tries to schedule) another primitive index scan from within _bt_advance_array_keys, before calling _bt_parallel_seize. _bt_parallel_seize already handles that case by returning false without waiting, and without unsetting the backend local state. Leaving the backend in this state enables it to start a previously scheduled primitive index scan once it gets back to _bt_first. Oversight in commit 5bf748b, which enhanced nbtree ScalarArrayOp execution. Matthias van de Meent, with tweaks by me. Author: Matthias van de Meent <boekewurm+postgres@gmail.com> Reported-By: Tomas Vondra <tomas@vondra.me> Reviewed-By: Peter Geoghegan <pg@bowt.ie> Discussion: https://postgr.es/m/CAH2-WzmMGaPa32u9x_FvEbPTUkP5e95i=QxR8054nvCRydP-sw@mail.gmail.com Backpatch: 17-, where nbtree SAOP execution was enhanced.
1 parent 89f908a commit d8adfc1

File tree

1 file changed

+33
-20
lines changed

1 file changed

+33
-20
lines changed

src/backend/access/nbtree/nbtree.c

+33-20
Original file line numberDiff line numberDiff line change
@@ -584,7 +584,10 @@ btparallelrescan(IndexScanDesc scan)
584584
* or _bt_parallel_done().
585585
*
586586
* The return value is true if we successfully seized the scan and false
587-
* if we did not. The latter case occurs if no pages remain.
587+
* if we did not. The latter case occurs when no pages remain, or when
588+
* another primitive index scan is scheduled that caller's backend cannot
589+
* start just yet (only backends that call from _bt_first are capable of
590+
* starting primitive index scans, which they indicate by passing first=true).
588591
*
589592
* If the return value is true, *pageno returns the next or current page
590593
* of the scan (depending on the scan direction). An invalid block number
@@ -595,10 +598,6 @@ btparallelrescan(IndexScanDesc scan)
595598
* scan will return false.
596599
*
597600
* Callers should ignore the value of pageno if the return value is false.
598-
*
599-
* Callers that are in a position to start a new primitive index scan must
600-
* pass first=true (all other callers pass first=false). We just return false
601-
* for first=false callers that require another primitive index scan.
602601
*/
603602
bool
604603
_bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
@@ -615,22 +614,16 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
615614
{
616615
/*
617616
* Initialize array related state when called from _bt_first, assuming
618-
* that this will either be the first primitive index scan for the
619-
* scan, or a previous explicitly scheduled primitive scan.
620-
*
621-
* Note: so->needPrimScan is only set when a scheduled primitive index
622-
* scan is set to be performed in caller's worker process. It should
623-
* not be set here by us for the first primitive scan, nor should we
624-
* ever set it for a parallel scan that has no array keys.
617+
* that this will be the first primitive index scan for the scan
625618
*/
626619
so->needPrimScan = false;
627620
so->scanBehind = false;
628621
}
629622
else
630623
{
631624
/*
632-
* Don't attempt to seize the scan when backend requires another
633-
* primitive index scan unless we're in a position to start it now
625+
* Don't attempt to seize the scan when it requires another primitive
626+
* index scan, since caller's backend cannot start it right now
634627
*/
635628
if (so->needPrimScan)
636629
return false;
@@ -652,12 +645,9 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
652645
{
653646
Assert(so->numArrayKeys);
654647

655-
/*
656-
* If we can start another primitive scan right away, do so.
657-
* Otherwise just wait.
658-
*/
659648
if (first)
660649
{
650+
/* Can start scheduled primitive scan right away, so do so */
661651
btscan->btps_pageStatus = BTPARALLEL_ADVANCING;
662652
for (int i = 0; i < so->numArrayKeys; i++)
663653
{
@@ -667,11 +657,25 @@ _bt_parallel_seize(IndexScanDesc scan, BlockNumber *pageno, bool first)
667657
array->cur_elem = btscan->btps_arrElems[i];
668658
skey->sk_argument = array->elem_values[array->cur_elem];
669659
}
670-
so->needPrimScan = true;
671-
so->scanBehind = false;
672660
*pageno = InvalidBlockNumber;
673661
exit_loop = true;
674662
}
663+
else
664+
{
665+
/*
666+
* Don't attempt to seize the scan when it requires another
667+
* primitive index scan, since caller's backend cannot start
668+
* it right now
669+
*/
670+
status = false;
671+
}
672+
673+
/*
674+
* Either way, update backend local state to indicate that a
675+
* pending primitive scan is required
676+
*/
677+
so->needPrimScan = true;
678+
so->scanBehind = false;
675679
}
676680
else if (btscan->btps_pageStatus != BTPARALLEL_ADVANCING)
677681
{
@@ -730,6 +734,7 @@ _bt_parallel_release(IndexScanDesc scan, BlockNumber scan_page)
730734
void
731735
_bt_parallel_done(IndexScanDesc scan)
732736
{
737+
BTScanOpaque so = (BTScanOpaque) scan->opaque;
733738
ParallelIndexScanDesc parallel_scan = scan->parallel_scan;
734739
BTParallelScanDesc btscan;
735740
bool status_changed = false;
@@ -738,6 +743,13 @@ _bt_parallel_done(IndexScanDesc scan)
738743
if (parallel_scan == NULL)
739744
return;
740745

746+
/*
747+
* Should not mark parallel scan done when there's still a pending
748+
* primitive index scan
749+
*/
750+
if (so->needPrimScan)
751+
return;
752+
741753
btscan = (BTParallelScanDesc) OffsetToPointer((void *) parallel_scan,
742754
parallel_scan->ps_offset);
743755

@@ -746,6 +758,7 @@ _bt_parallel_done(IndexScanDesc scan)
746758
* already
747759
*/
748760
SpinLockAcquire(&btscan->btps_mutex);
761+
Assert(btscan->btps_pageStatus != BTPARALLEL_NEED_PRIMSCAN);
749762
if (btscan->btps_pageStatus != BTPARALLEL_DONE)
750763
{
751764
btscan->btps_pageStatus = BTPARALLEL_DONE;

0 commit comments

Comments
 (0)