Skip to content

Commit 8052aaf

Browse files
committed
Rotate instead of shifting hash join batch number.
Our algorithm for choosing batch numbers turned out not to work effectively for multi-billion key inner relations. We would use more hash bits than we have, and effectively concentrate all tuples into a smaller number of batches than we intended. While ideally we should switch to wider hashes, for now, change the algorithm to one that effectively gives up bits from the bucket number when we don't have enough bits. That means we'll finish up with longer bucket chains than would be ideal, but that's better than having batches that don't fit in work_mem and can't be divided. Batch-patch to all supported releases. Author: Thomas Munro Reviewed-by: Tom Lane, thanks also to Tomas Vondra, Alvaro Herrera, Andres Freund for testing and discussion Reported-by: James Coleman Discussion: https://postgr.es/m/16104-dc11ed911f1ab9df%40postgresql.org
1 parent b5e7569 commit 8052aaf

File tree

2 files changed

+18
-4
lines changed

2 files changed

+18
-4
lines changed

src/backend/executor/nodeHash.c

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@
3737
#include "miscadmin.h"
3838
#include "pgstat.h"
3939
#include "port/atomics.h"
40+
#include "port/pg_bitutils.h"
4041
#include "utils/dynahash.h"
4142
#include "utils/memutils.h"
4243
#include "utils/lsyscache.h"
@@ -1878,7 +1879,7 @@ ExecHashGetHashValue(HashJoinTable hashtable,
18781879
* chains), and must only cause the batch number to remain the same or
18791880
* increase. Our algorithm is
18801881
* bucketno = hashvalue MOD nbuckets
1881-
* batchno = (hashvalue DIV nbuckets) MOD nbatch
1882+
* batchno = ROR(hashvalue, log2_nbuckets) MOD nbatch
18821883
* where nbuckets and nbatch are both expected to be powers of 2, so we can
18831884
* do the computations by shifting and masking. (This assumes that all hash
18841885
* functions are good about randomizing all their output bits, else we are
@@ -1890,7 +1891,11 @@ ExecHashGetHashValue(HashJoinTable hashtable,
18901891
* number the way we do here).
18911892
*
18921893
* nbatch is always a power of 2; we increase it only by doubling it. This
1893-
* effectively adds one more bit to the top of the batchno.
1894+
* effectively adds one more bit to the top of the batchno. In very large
1895+
* joins, we might run out of bits to add, so we do this by rotating the hash
1896+
* value. This causes batchno to steal bits from bucketno when the number of
1897+
* virtual buckets exceeds 2^32. It's better to have longer bucket chains
1898+
* than to lose the ability to divide batches.
18941899
*/
18951900
void
18961901
ExecHashGetBucketAndBatch(HashJoinTable hashtable,
@@ -1903,9 +1908,9 @@ ExecHashGetBucketAndBatch(HashJoinTable hashtable,
19031908

19041909
if (nbatch > 1)
19051910
{
1906-
/* we can do MOD by masking, DIV by shifting */
19071911
*bucketno = hashvalue & (nbuckets - 1);
1908-
*batchno = (hashvalue >> hashtable->log2_nbuckets) & (nbatch - 1);
1912+
*batchno = pg_rotate_right32(hashvalue,
1913+
hashtable->log2_nbuckets) & (nbatch - 1);
19091914
}
19101915
else
19111916
{

src/include/port/pg_bitutils.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -136,4 +136,13 @@ extern int (*pg_popcount64) (uint64 word);
136136
/* Count the number of one-bits in a byte array */
137137
extern uint64 pg_popcount(const char *buf, int bytes);
138138

139+
/*
140+
* Rotate the bits of "word" to the right by n bits.
141+
*/
142+
static inline uint32
143+
pg_rotate_right32(uint32 word, int n)
144+
{
145+
return (word >> n) | (word << (sizeof(word) * BITS_PER_BYTE - n));
146+
}
147+
139148
#endif /* PG_BITUTILS_H */

0 commit comments

Comments
 (0)