Skip to content

Commit 5c0a132

Browse files
committed
Rotate instead of shifting hash join batch number.
Our algorithm for choosing batch numbers turned out not to work effectively for multi-billion key inner relations. We would use more hash bits than we have, and effectively concentrate all tuples into a smaller number of batches than we intended. While ideally we should switch to wider hashes, for now, change the algorithm to one that effectively gives up bits from the bucket number when we don't have enough bits. That means we'll finish up with longer bucket chains than would be ideal, but that's better than having batches that don't fit in work_mem and can't be divided. Batch-patch to all supported releases. Author: Thomas Munro Reviewed-by: Tom Lane, thanks also to Tomas Vondra, Alvaro Herrera, Andres Freund for testing and discussion Reported-by: James Coleman Discussion: https://postgr.es/m/16104-dc11ed911f1ab9df%40postgresql.org
1 parent 4a3cdb5 commit 5c0a132

File tree

1 file changed

+17
-4
lines changed

1 file changed

+17
-4
lines changed

src/backend/executor/nodeHash.c

+17-4
Original file line numberDiff line numberDiff line change
@@ -862,6 +862,15 @@ ExecHashGetHashValue(HashJoinTable hashtable,
862862
return true;
863863
}
864864

865+
/*
866+
* Rotate the bits of "word" to the right by n bits.
867+
*/
868+
static inline uint32
869+
pg_rotate_right32(uint32 word, int n)
870+
{
871+
return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
872+
}
873+
865874
/*
866875
* ExecHashGetBucketAndBatch
867876
* Determine the bucket number and batch number for a hash value
@@ -871,7 +880,7 @@ ExecHashGetHashValue(HashJoinTable hashtable,
871880
* chains), and must only cause the batch number to remain the same or
872881
* increase. Our algorithm is
873882
* bucketno = hashvalue MOD nbuckets
874-
* batchno = (hashvalue DIV nbuckets) MOD nbatch
883+
* batchno = ROR(hashvalue, log2_nbuckets) MOD nbatch
875884
* where nbuckets and nbatch are both expected to be powers of 2, so we can
876885
* do the computations by shifting and masking. (This assumes that all hash
877886
* functions are good about randomizing all their output bits, else we are
@@ -880,7 +889,11 @@ ExecHashGetHashValue(HashJoinTable hashtable,
880889
* nbuckets doesn't change over the course of the join.
881890
*
882891
* nbatch is always a power of 2; we increase it only by doubling it. This
883-
* effectively adds one more bit to the top of the batchno.
892+
* effectively adds one more bit to the top of the batchno. In very large
893+
* joins, we might run out of bits to add, so we do this by rotating the hash
894+
* value. This causes batchno to steal bits from bucketno when the number of
895+
* virtual buckets exceeds 2^32. It's better to have longer bucket chains
896+
* than to lose the ability to divide batches.
884897
*/
885898
void
886899
ExecHashGetBucketAndBatch(HashJoinTable hashtable,
@@ -893,9 +906,9 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
893906

894907
if (nbatch > 1)
895908
{
896-
/* we can do MOD by masking, DIV by shifting */
897909
*bucketno = hashvalue & (nbuckets - 1);
898-
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
910+
*batchno = pg_rotate_right32(hashvalue,
911+
hashtable->log2_nbuckets) & (nbatch - 1);
899912
}
900913
else
901914
{

0 commit comments

Comments
 (0)