Skip to content

Commit 20d9489

Browse files
committed
Fix extreme skew detection in Parallel Hash Join.
After repartitioning the inner side of a hash join that would have exceeded the allowed size, we check if all the tuples from a parent partition moved to one child partition. That is evidence that it contains duplicate keys and later attempts to repartition will also fail, so we should give up trying to limit memory (for lack of a better fallback strategy). A thinko prevented the check from working correctly in partition 0 (the one that is partially loaded into memory already). After repartitioning, we should check for extreme skew if the *parent* partition's space_exhausted flag was set, not the child partition's. The consequence was repeated futile repartitioning until per-partition data exceeded various limits including "ERROR: invalid DSA memory alloc request size 1811939328", OS allocation failure, or temporary disk space errors. (We could also do something about some of those symptoms, but that's material for separate patches.) This problem only became likely when PostgreSQL 16 introduced support for Parallel Hash Right/Full Join, allowing NULL keys into the hash table. Repartitioning always leaves NULL in partition 0, no matter how many times you do it, because the hash value is all zero bits. That's unlikely for other hashed values, but they might still have caused wasted extra effort before giving up. Back-patch to all supported releases. Reported-by: Craig Milhiser <craig@milhiser.com> Reviewed-by: Andrei Lepikhov <lepihov@gmail.com> Discussion: https://postgr.es/m/CA%2BwnhO1OfgXbmXgC4fv_uu%3DOxcDQuHvfoQ4k0DFeB0Qqd-X-rQ%40mail.gmail.com
1 parent ab13c46 commit 20d9489

File tree

1 file changed

+12
-5
lines changed

1 file changed

+12
-5
lines changed

src/backend/executor/nodeHash.c

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1252,32 +1252,39 @@ ExecParallelHashIncreaseNumBatches(HashJoinTable hashtable)
12521252
if (BarrierArriveAndWait(&pstate->grow_batches_barrier,
12531253
WAIT_EVENT_HASH_GROW_BATCHES_DECIDE))
12541254
{
1255+
ParallelHashJoinBatch *old_batches;
12551256
bool space_exhausted = false;
12561257
bool extreme_skew_detected = false;
12571258

12581259
/* Make sure that we have the current dimensions and buckets. */
12591260
ExecParallelHashEnsureBatchAccessors(hashtable);
12601261
ExecParallelHashTableSetCurrentBatch(hashtable, 0);
12611262

1263+
old_batches = dsa_get_address(hashtable->area, pstate->old_batches);
1264+
12621265
/* Are any of the new generation of batches exhausted? */
12631266
for (i = 0; i < hashtable->nbatch; ++i)
12641267
{
1265-
ParallelHashJoinBatch *batch = hashtable->batches[i].shared;
1268+
ParallelHashJoinBatch *batch;
1269+
ParallelHashJoinBatch *old_batch;
1270+
int parent;
12661271

1272+
batch = hashtable->batches[i].shared;
12671273
if (batch->space_exhausted ||
12681274
batch->estimated_size > pstate->space_allowed)
1269-
{
1270-
int parent;
1271-
12721275
space_exhausted = true;
12731276

1277+
parent = i % pstate->old_nbatch;
1278+
old_batch = NthParallelHashJoinBatch(old_batches, parent);
1279+
if (old_batch->space_exhausted ||
1280+
batch->estimated_size > pstate->space_allowed)
1281+
{
12741282
/*
12751283
* Did this batch receive ALL of the tuples from its
12761284
* parent batch? That would indicate that further
12771285
* repartitioning isn't going to help (the hash values
12781286
* are probably all the same).
12791287
*/
1280-
parent = i % pstate->old_nbatch;
12811288
if (batch->ntuples == hashtable->batches[parent].shared->old_ntuples)
12821289
extreme_skew_detected = true;
12831290
}

0 commit comments

Comments
 (0)