Skip to content

Commit 6b258e3

Browse files
committed
pgbench: Function to generate random permutations.
This adds a new function, permute(), that generates pseudorandom permutations of arbitrary sizes. This can be used to randomly shuffle a set of values to remove unwanted correlations. For example, permuting the output from a non-uniform random distribution so that all the most common values aren't collocated, allowing more realistic tests to be performed. Formerly, hash() was recommended for this purpose, but that suffers from collisions that might alter the distribution, so recommend permute() for this purpose instead. Fabien Coelho and Hironobu Suzuki, with additional hacking be me. Reviewed by Thomas Munro, Alvaro Herrera and Muhammad Usama. Discussion: https://postgr.es/m/alpine.DEB.2.21.1807280944370.5142@lancre
1 parent a8af856 commit 6b258e3

File tree

6 files changed

+273
-12
lines changed

6 files changed

+273
-12
lines changed

doc/src/sgml/ref/pgbench.sgml

+70-11
Original file line numberDiff line numberDiff line change
@@ -1057,7 +1057,7 @@ pgbench <optional> <replaceable>options</replaceable> </optional> <replaceable>d
10571057

10581058
<row>
10591059
<entry> <literal>default_seed</literal> </entry>
1060-
<entry>seed used in hash functions by default</entry>
1060+
<entry>seed used in hash and pseudorandom permutation functions by default</entry>
10611061
</row>
10621062

10631063
<row>
@@ -1864,6 +1864,24 @@ SELECT 4 AS four \; SELECT 5 AS five \aset
18641864
</para></entry>
18651865
</row>
18661866

1867+
<row>
1868+
<entry role="func_table_entry"><para role="func_signature">
1869+
<function>permute</function> ( <parameter>i</parameter>, <parameter>size</parameter> [, <parameter>seed</parameter> ] )
1870+
<returnvalue>integer</returnvalue>
1871+
</para>
1872+
<para>
1873+
Permuted value of <parameter>i</parameter>, in the range
1874+
<literal>[0, size)</literal>. This is the new position of
1875+
<parameter>i</parameter> (modulo <parameter>size</parameter>) in a
1876+
pseudorandom permutation of the integers <literal>0...size-1</literal>,
1877+
parameterized by <parameter>seed</parameter>, see below.
1878+
</para>
1879+
<para>
1880+
<literal>permute(0, 4)</literal>
1881+
<returnvalue>an integer between 0 and 3</returnvalue>
1882+
</para></entry>
1883+
</row>
1884+
18671885
<row>
18681886
<entry role="func_table_entry"><para role="func_signature">
18691887
<function>pi</function> ()
@@ -2071,29 +2089,70 @@ f(x) = PHI(2.0 * parameter * (x - mu) / (max - min + 1)) /
20712089
</listitem>
20722090
</itemizedlist>
20732091

2092+
<note>
2093+
<para>
2094+
When designing a benchmark which selects rows non-uniformly, be aware
2095+
that the rows chosen may be correlated with other data such as IDs from
2096+
a sequence or the physical row ordering, which may skew performance
2097+
measurements.
2098+
</para>
2099+
<para>
2100+
To avoid this, you may wish to use the <function>permute</function>
2101+
function, or some other additional step with similar effect, to shuffle
2102+
the selected rows and remove such correlations.
2103+
</para>
2104+
</note>
2105+
20742106
<para>
20752107
Hash functions <literal>hash</literal>, <literal>hash_murmur2</literal> and
20762108
<literal>hash_fnv1a</literal> accept an input value and an optional seed parameter.
20772109
In case the seed isn't provided the value of <literal>:default_seed</literal>
20782110
is used, which is initialized randomly unless set by the command-line
2079-
<literal>-D</literal> option. Hash functions can be used to scatter the
2080-
distribution of random functions such as <literal>random_zipfian</literal> or
2081-
<literal>random_exponential</literal>. For instance, the following pgbench
2082-
script simulates possible real world workload typical for social media and
2083-
blogging platforms where few accounts generate excessive load:
2111+
<literal>-D</literal> option.
2112+
</para>
2113+
2114+
<para>
2115+
<literal>permute</literal> accepts an input value, a size, and an optional
2116+
seed parameter. It generates a pseudorandom permutation of integers in
2117+
the range <literal>[0, size)</literal>, and returns the index of the input
2118+
value in the permuted values. The permutation chosen is parameterized by
2119+
the seed, which defaults to <literal>:default_seed</literal>, if not
2120+
specified. Unlike the hash functions, <literal>permute</literal> ensures
2121+
that there are no collisions or holes in the output values. Input values
2122+
outside the interval are interpreted modulo the size. The function raises
2123+
an error if the size is not positive. <function>permute</function> can be
2124+
used to scatter the distribution of non-uniform random functions such as
2125+
<literal>random_zipfian</literal> or <literal>random_exponential</literal>
2126+
so that values drawn more often are not trivially correlated. For
2127+
instance, the following <application>pgbench</application> script
2128+
simulates a possible real world workload typical for social media and
2129+
blogging platforms where a few accounts generate excessive load:
20842130

20852131
<programlisting>
2086-
\set r random_zipfian(0, 100000000, 1.07)
2087-
\set k abs(hash(:r)) % 1000000
2132+
\set size 1000000
2133+
\set r random_zipfian(1, :size, 1.07)
2134+
\set k 1 + permute(:r, :size)
20882135
</programlisting>
20892136

20902137
In some cases several distinct distributions are needed which don't correlate
2091-
with each other and this is when implicit seed parameter comes in handy:
2138+
with each other and this is when the optional seed parameter comes in handy:
20922139

20932140
<programlisting>
2094-
\set k1 abs(hash(:r, :default_seed + 123)) % 1000000
2095-
\set k2 abs(hash(:r, :default_seed + 321)) % 1000000
2141+
\set k1 1 + permute(:r, :size, :default_seed + 123)
2142+
\set k2 1 + permute(:r, :size, :default_seed + 321)
20962143
</programlisting>
2144+
2145+
A similar behavior can also be approximated with <function>hash</function>:
2146+
2147+
<programlisting>
2148+
\set size 1000000
2149+
\set r random_zipfian(1, 100 * :size, 1.07)
2150+
\set k 1 + abs(hash(:r)) % :size
2151+
</programlisting>
2152+
2153+
However, since <function>hash</function> generates collisions, some values
2154+
will not be reachable and others will be more frequent than expected from
2155+
the original distribution.
20972156
</para>
20982157

20992158
<para>

src/bin/pgbench/exprparse.y

+17
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
#define PGBENCH_NARGS_VARIABLE (-1)
2020
#define PGBENCH_NARGS_CASE (-2)
2121
#define PGBENCH_NARGS_HASH (-3)
22+
#define PGBENCH_NARGS_PERMUTE (-4)
2223

2324
PgBenchExpr *expr_parse_result;
2425

@@ -370,6 +371,9 @@ static const struct
370371
{
371372
"hash_fnv1a", PGBENCH_NARGS_HASH, PGBENCH_HASH_FNV1A
372373
},
374+
{
375+
"permute", PGBENCH_NARGS_PERMUTE, PGBENCH_PERMUTE
376+
},
373377
/* keep as last array element */
374378
{
375379
NULL, 0, 0
@@ -482,6 +486,19 @@ make_func(yyscan_t yyscanner, int fnumber, PgBenchExprList *args)
482486
}
483487
break;
484488

489+
/* pseudorandom permutation function with optional seed argument */
490+
case PGBENCH_NARGS_PERMUTE:
491+
if (len < 2 || len > 3)
492+
expr_yyerror_more(yyscanner, "unexpected number of arguments",
493+
PGBENCH_FUNCTIONS[fnumber].fname);
494+
495+
if (len == 2)
496+
{
497+
PgBenchExpr *var = make_variable("default_seed");
498+
args = make_elist(var, args);
499+
}
500+
break;
501+
485502
/* common case: positive arguments number */
486503
default:
487504
Assert(PGBENCH_FUNCTIONS[fnumber].nargs >= 0);

src/bin/pgbench/pgbench.c

+131
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@
6666
#include "getopt_long.h"
6767
#include "libpq-fe.h"
6868
#include "pgbench.h"
69+
#include "port/pg_bitutils.h"
6970
#include "portability/instr_time.h"
7071

7172
#ifndef M_PI
@@ -1127,6 +1128,113 @@ getHashMurmur2(int64 val, uint64 seed)
11271128
return (int64) result;
11281129
}
11291130

1131+
/*
1132+
* Pseudorandom permutation function
1133+
*
1134+
* For small sizes, this generates each of the (size!) possible permutations
1135+
* of integers in the range [0, size) with roughly equal probability. Once
1136+
* the size is larger than 20, the number of possible permutations exceeds the
1137+
* number of distinct states of the internal pseudorandom number generators,
1138+
* and so not all possible permutations can be generated, but the permutations
1139+
* chosen should continue to give the appearance of being random.
1140+
*
1141+
* THIS FUNCTION IS NOT CRYPTOGRAPHICALLY SECURE.
1142+
* DO NOT USE FOR SUCH PURPOSE.
1143+
*/
1144+
static int64
1145+
permute(const int64 val, const int64 isize, const int64 seed)
1146+
{
1147+
RandomState random_state1;
1148+
RandomState random_state2;
1149+
uint64 size;
1150+
uint64 v;
1151+
int masklen;
1152+
uint64 mask;
1153+
int i;
1154+
1155+
if (isize < 2)
1156+
return 0; /* nothing to permute */
1157+
1158+
/* Initialize a pair of random states using the seed */
1159+
random_state1.xseed[0] = seed & 0xFFFF;
1160+
random_state1.xseed[1] = (seed >> 16) & 0xFFFF;
1161+
random_state1.xseed[2] = (seed >> 32) & 0xFFFF;
1162+
1163+
random_state2.xseed[0] = (((uint64) seed) >> 48) & 0xFFFF;
1164+
random_state2.xseed[1] = seed & 0xFFFF;
1165+
random_state2.xseed[2] = (seed >> 16) & 0xFFFF;
1166+
1167+
/* Computations are performed on unsigned values */
1168+
size = (uint64) isize;
1169+
v = (uint64) val % size;
1170+
1171+
/* Mask to work modulo largest power of 2 less than or equal to size */
1172+
masklen = pg_leftmost_one_pos64(size);
1173+
mask = (((uint64) 1) << masklen) - 1;
1174+
1175+
/*
1176+
* Permute the input value by applying several rounds of pseudorandom
1177+
* bijective transformations. The intention here is to distribute each
1178+
* input uniformly randomly across the range, and separate adjacent inputs
1179+
* approximately uniformly randomly from each other, leading to a fairly
1180+
* random overall choice of permutation.
1181+
*
1182+
* To separate adjacent inputs, we multiply by a random number modulo
1183+
* (mask + 1), which is a power of 2. For this to be a bijection, the
1184+
* multiplier must be odd. Since this is known to lead to less randomness
1185+
* in the lower bits, we also apply a rotation that shifts the topmost bit
1186+
* into the least significant bit. In the special cases where size <= 3,
1187+
* mask = 1 and each of these operations is actually a no-op, so we also
1188+
* XOR the value with a different random number to inject additional
1189+
* randomness. Since the size is generally not a power of 2, we apply
1190+
* this bijection on overlapping upper and lower halves of the input.
1191+
*
1192+
* To distribute the inputs uniformly across the range, we then also apply
1193+
* a random offset modulo the full range.
1194+
*
1195+
* Taken together, these operations resemble a modified linear
1196+
* congruential generator, as is commonly used in pseudorandom number
1197+
* generators. The number of rounds is fairly arbitrary, but six has been
1198+
* found empirically to give a fairly good tradeoff between performance
1199+
* and uniform randomness. For small sizes it selects each of the (size!)
1200+
* possible permutations with roughly equal probability. For larger
1201+
* sizes, not all permutations can be generated, but the intended random
1202+
* spread is still produced.
1203+
*/
1204+
for (i = 0; i < 6; i++)
1205+
{
1206+
uint64 m,
1207+
r,
1208+
t;
1209+
1210+
/* Random multiply (by an odd number), XOR and rotate of lower half */
1211+
m = (uint64) getrand(&random_state1, 0, mask) | 1;
1212+
r = (uint64) getrand(&random_state2, 0, mask);
1213+
if (v <= mask)
1214+
{
1215+
v = ((v * m) ^ r) & mask;
1216+
v = ((v << 1) & mask) | (v >> (masklen - 1));
1217+
}
1218+
1219+
/* Random multiply (by an odd number), XOR and rotate of upper half */
1220+
m = (uint64) getrand(&random_state1, 0, mask) | 1;
1221+
r = (uint64) getrand(&random_state2, 0, mask);
1222+
t = size - 1 - v;
1223+
if (t <= mask)
1224+
{
1225+
t = ((t * m) ^ r) & mask;
1226+
t = ((t << 1) & mask) | (t >> (masklen - 1));
1227+
v = size - 1 - t;
1228+
}
1229+
1230+
/* Random offset */
1231+
r = (uint64) getrand(&random_state2, 0, size - 1);
1232+
v = (v + r) % size;
1233+
}
1234+
1235+
return (int64) v;
1236+
}
1237+
11301238
/*
11311239
* Initialize the given SimpleStats struct to all zeroes
11321240
*/
@@ -2475,6 +2583,29 @@ evalStandardFunc(CState *st,
24752583
return true;
24762584
}
24772585

2586+
case PGBENCH_PERMUTE:
2587+
{
2588+
int64 val,
2589+
size,
2590+
seed;
2591+
2592+
Assert(nargs == 3);
2593+
2594+
if (!coerceToInt(&vargs[0], &val) ||
2595+
!coerceToInt(&vargs[1], &size) ||
2596+
!coerceToInt(&vargs[2], &seed))
2597+
return false;
2598+
2599+
if (size <= 0)
2600+
{
2601+
pg_log_error("permute size parameter must be greater than zero");
2602+
return false;
2603+
}
2604+
2605+
setIntValue(retval, permute(val, size, seed));
2606+
return true;
2607+
}
2608+
24782609
default:
24792610
/* cannot get here */
24802611
Assert(0);

src/bin/pgbench/pgbench.h

+2-1
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,8 @@ typedef enum PgBenchFunction
9999
PGBENCH_IS,
100100
PGBENCH_CASE,
101101
PGBENCH_HASH_FNV1A,
102-
PGBENCH_HASH_MURMUR2
102+
PGBENCH_HASH_MURMUR2,
103+
PGBENCH_PERMUTE
103104
} PgBenchFunction;
104105

105106
typedef struct PgBenchExpr PgBenchExpr;

src/bin/pgbench/t/001_pgbench_with_server.pl

+43
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
use PostgresNode;
55
use TestLib;
66
use Test::More;
7+
use Config;
78

89
# start a pgbench specific server
910
my $node = get_new_node('main');
@@ -483,6 +484,17 @@ sub pgbench
483484
qr{command=98.: int 5432\b}, # :random_seed
484485
qr{command=99.: int -9223372036854775808\b}, # min int
485486
qr{command=100.: int 9223372036854775807\b}, # max int
487+
# pseudorandom permutation tests
488+
qr{command=101.: boolean true\b},
489+
qr{command=102.: boolean true\b},
490+
qr{command=103.: boolean true\b},
491+
qr{command=104.: boolean true\b},
492+
qr{command=105.: boolean true\b},
493+
qr{command=109.: boolean true\b},
494+
qr{command=110.: boolean true\b},
495+
qr{command=111.: boolean true\b},
496+
qr{command=112.: int 9223372036854775797\b},
497+
qr{command=113.: boolean true\b},
486498
],
487499
'pgbench expressions',
488500
{
@@ -610,6 +622,33 @@ sub pgbench
610622
-- minint constant parsing
611623
\set min debug(-9223372036854775808)
612624
\set max debug(-(:min + 1))
625+
-- parametric pseudorandom permutation function
626+
\set t debug(permute(0, 2) + permute(1, 2) = 1)
627+
\set t debug(permute(0, 3) + permute(1, 3) + permute(2, 3) = 3)
628+
\set t debug(permute(0, 4) + permute(1, 4) + permute(2, 4) + permute(3, 4) = 6)
629+
\set t debug(permute(0, 5) + permute(1, 5) + permute(2, 5) + permute(3, 5) + permute(4, 5) = 10)
630+
\set t debug(permute(0, 16) + permute(1, 16) + permute(2, 16) + permute(3, 16) + \
631+
permute(4, 16) + permute(5, 16) + permute(6, 16) + permute(7, 16) + \
632+
permute(8, 16) + permute(9, 16) + permute(10, 16) + permute(11, 16) + \
633+
permute(12, 16) + permute(13, 16) + permute(14, 16) + permute(15, 16) = 120)
634+
-- random sanity checks
635+
\set size random(2, 1000)
636+
\set v random(0, :size - 1)
637+
\set p permute(:v, :size)
638+
\set t debug(0 <= :p and :p < :size and :p = permute(:v + :size, :size) and :p <> permute(:v + 1, :size))
639+
-- actual values
640+
\set t debug(permute(:v, 1) = 0)
641+
\set t debug(permute(0, 2, 5432) = 0 and permute(1, 2, 5432) = 1 and \
642+
permute(0, 2, 5435) = 1 and permute(1, 2, 5435) = 0)
643+
-- 63 bits tests
644+
\set size debug(:max - 10)
645+
\set t debug(permute(:size-1, :size, 5432) = 5301702756001087507 and \
646+
permute(:size-2, :size, 5432) = 8968485976055840695 and \
647+
permute(:size-3, :size, 5432) = 6708495591295582115 and \
648+
permute(:size-4, :size, 5432) = 2801794404574855121 and \
649+
permute(:size-5, :size, 5432) = 1489011409218895840 and \
650+
permute(:size-6, :size, 5432) = 2267749475878240183 and \
651+
permute(:size-7, :size, 5432) = 1300324176838786780)
613652
}
614653
});
615654

@@ -1048,6 +1087,10 @@ sub pgbench
10481087
'bad boolean', 2,
10491088
[qr{malformed variable.*trueXXX}], q{\set b :badtrue or true}
10501089
],
1090+
[
1091+
'invalid permute size', 2,
1092+
[qr{permute size parameter must be greater than zero}], q{\set i permute(0, 0)}
1093+
],
10511094

10521095
# GSET
10531096
[

0 commit comments

Comments
 (0)