Skip to content

Commit 9eacee2

Browse files
committed
Add Result Cache executor node (take 2)
Here we add a new executor node type named "Result Cache". The planner can include this node type in the plan to have the executor cache the results from the inner side of parameterized nested loop joins. This allows caching of tuples for sets of parameters so that in the event that the node sees the same parameter values again, it can just return the cached tuples instead of rescanning the inner side of the join all over again. Internally, result cache uses a hash table in order to quickly find tuples that have been previously cached. For certain data sets, this can significantly improve the performance of joins. The best cases for using this new node type are for join problems where a large portion of the tuples from the inner side of the join have no join partner on the outer side of the join. In such cases, hash join would have to hash values that are never looked up, thus bloating the hash table and possibly causing it to multi-batch. Merge joins would have to skip over all of the unmatched rows. If we use a nested loop join with a result cache, then we only cache tuples that have at least one join partner on the outer side of the join. The benefits of using a parameterized nested loop with a result cache increase when there are fewer distinct values being looked up and the number of lookups of each value is large. Also, hash probes to lookup the cache can be much faster than the hash probe in a hash join as it's common that the result cache's hash table is much smaller than the hash join's due to result cache only caching useful tuples rather than all tuples from the inner side of the join. This variation in hash probe performance is more significant when the hash join's hash table no longer fits into the CPU's L3 cache, but the result cache's hash table does. The apparent "random" access of hash buckets with each hash probe can cause a poor L3 cache hit ratio for large hash tables. Smaller hash tables generally perform better. The hash table used for the cache limits itself to not exceeding work_mem * hash_mem_multiplier in size. We maintain a dlist of keys for this cache and when we're adding new tuples and realize we've exceeded the memory budget, we evict cache entries starting with the least recently used ones until we have enough memory to add the new tuples to the cache. For parameterized nested loop joins, we now consider using one of these result cache nodes in between the nested loop node and its inner node. We determine when this might be useful based on cost, which is primarily driven off of what the expected cache hit ratio will be. Estimating the cache hit ratio relies on having good distinct estimates on the nested loop's parameters. For now, the planner will only consider using a result cache for parameterized nested loop joins. This works for both normal joins and also for LATERAL type joins to subqueries. It is possible to use this new node for other uses in the future. For example, to cache results from correlated subqueries. However, that's not done here due to some difficulties obtaining a distinct estimation on the outer plan to calculate the estimated cache hit ratio. Currently we plan the inner plan before planning the outer plan so there is no good way to know if a result cache would be useful or not since we can't estimate the number of times the subplan will be called until the outer plan is generated. The functionality being added here is newly introducing a dependency on the return value of estimate_num_groups() during the join search. Previously, during the join search, we only ever needed to perform selectivity estimations. With this commit, we need to use estimate_num_groups() in order to estimate what the hit ratio on the result cache will be. In simple terms, if we expect 10 distinct values and we expect 1000 outer rows, then we'll estimate the hit ratio to be 99%. Since cache hits are very cheap compared to scanning the underlying nodes on the inner side of the nested loop join, then this will significantly reduce the planner's cost for the join. However, it's fairly easy to see here that things will go bad when estimate_num_groups() incorrectly returns a value that's significantly lower than the actual number of distinct values. If this happens then that may cause us to make use of a nested loop join with a result cache instead of some other join type, such as a merge or hash join. Our distinct estimations have been known to be a source of trouble in the past, so the extra reliance on them here could cause the planner to choose slower plans than it did previous to having this feature. Distinct estimations are also fairly hard to estimate accurately when several tables have been joined already or when a WHERE clause filters out a set of values that are correlated to the expressions we're estimating the number of distinct value for. For now, the costing we perform during query planning for result caches does put quite a bit of faith in the distinct estimations being accurate. When these are accurate then we should generally see faster execution times for plans containing a result cache. However, in the real world, we may find that we need to either change the costings to put less trust in the distinct estimations being accurate or perhaps even disable this feature by default. There's always an element of risk when we teach the query planner to do new tricks that it decides to use that new trick at the wrong time and causes a regression. Users may opt to get the old behavior by turning the feature off using the enable_resultcache GUC. Currently, this is enabled by default. It remains to be seen if we'll maintain that setting for the release. Additionally, the name "Result Cache" is the best name I could think of for this new node at the time I started writing the patch. Nobody seems to strongly dislike the name. A few people did suggest other names but no other name seemed to dominate in the brief discussion that there was about names. Let's allow the beta period to see if the current name pleases enough people. If there's some consensus on a better name, then we can change it before the release. Please see the 2nd discussion link below for the discussion on the "Result Cache" name. Author: David Rowley Reviewed-by: Andy Fan, Justin Pryzby, Zhihong Yu, Hou Zhijie Tested-By: Konstantin Knizhnik Discussion: https://postgr.es/m/CAApHDvrPcQyQdWERGYWx8J%2B2DLUNgXu%2BfOSbQ1UscxrunyXyrQ%40mail.gmail.com Discussion: https://postgr.es/m/CAApHDvq=yQXr5kqhRviT2RhNKwToaWr9JAN5t+5_PzhuRJ3wvg@mail.gmail.com
1 parent fe246d1 commit 9eacee2

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+2709
-76
lines changed

contrib/postgres_fdw/expected/postgres_fdw.out

Lines changed: 15 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1602,6 +1602,7 @@ SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 LEFT JOIN ft2 t2 ON (t1.c1 = t2.c1) FULL
16021602
20 | 0 | AAA020
16031603
(10 rows)
16041604

1605+
SET enable_resultcache TO off;
16051606
-- right outer join + left outer join
16061607
EXPLAIN (VERBOSE, COSTS OFF)
16071608
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 RIGHT JOIN ft2 t2 ON (t1.c1 = t2.c1) LEFT JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;
@@ -1628,6 +1629,7 @@ SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 RIGHT JOIN ft2 t2 ON (t1.c1 = t2.c1) LEFT
16281629
20 | 0 | AAA020
16291630
(10 rows)
16301631

1632+
RESET enable_resultcache;
16311633
-- left outer join + right outer join
16321634
EXPLAIN (VERBOSE, COSTS OFF)
16331635
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 LEFT JOIN ft2 t2 ON (t1.c1 = t2.c1) RIGHT JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;
@@ -2139,22 +2141,25 @@ SELECT t1c1, avg(t1c1 + t2c1) FROM (SELECT t1.c1, t2.c1 FROM ft1 t1 JOIN ft2 t2
21392141
-- join with lateral reference
21402142
EXPLAIN (VERBOSE, COSTS OFF)
21412143
SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10;
2142-
QUERY PLAN
2143-
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
2144+
QUERY PLAN
2145+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
21442146
Limit
21452147
Output: t1."C 1"
21462148
-> Nested Loop
21472149
Output: t1."C 1"
21482150
-> Index Scan using t1_pkey on "S 1"."T 1" t1
21492151
Output: t1."C 1", t1.c2, t1.c3, t1.c4, t1.c5, t1.c6, t1.c7, t1.c8
2150-
-> HashAggregate
2151-
Output: t2.c1, t3.c1
2152-
Group Key: t2.c1, t3.c1
2153-
-> Foreign Scan
2154-
Output: t2.c1, t3.c1
2155-
Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3)
2156-
Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer))))
2157-
(13 rows)
2152+
-> Result Cache
2153+
Cache Key: t1.c2
2154+
-> Subquery Scan on q
2155+
-> HashAggregate
2156+
Output: t2.c1, t3.c1
2157+
Group Key: t2.c1, t3.c1
2158+
-> Foreign Scan
2159+
Output: t2.c1, t3.c1
2160+
Relations: (public.ft1 t2) INNER JOIN (public.ft2 t3)
2161+
Remote SQL: SELECT r1."C 1", r2."C 1" FROM ("S 1"."T 1" r1 INNER JOIN "S 1"."T 1" r2 ON (((r1."C 1" = r2."C 1")) AND ((r1.c2 = $1::integer))))
2162+
(16 rows)
21582163

21592164
SELECT t1."C 1" FROM "S 1"."T 1" t1, LATERAL (SELECT DISTINCT t2.c1, t3.c1 FROM ft1 t2, ft2 t3 WHERE t2.c1 = t3.c1 AND t2.c2 = t1.c2) q ORDER BY t1."C 1" OFFSET 10 LIMIT 10;
21602165
C 1

contrib/postgres_fdw/sql/postgres_fdw.sql

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -502,10 +502,12 @@ SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 FULL JOIN ft2 t2 ON (t1.c1 = t2.c1) LEFT
502502
EXPLAIN (VERBOSE, COSTS OFF)
503503
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 LEFT JOIN ft2 t2 ON (t1.c1 = t2.c1) FULL JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;
504504
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 LEFT JOIN ft2 t2 ON (t1.c1 = t2.c1) FULL JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;
505+
SET enable_resultcache TO off;
505506
-- right outer join + left outer join
506507
EXPLAIN (VERBOSE, COSTS OFF)
507508
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 RIGHT JOIN ft2 t2 ON (t1.c1 = t2.c1) LEFT JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;
508509
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 RIGHT JOIN ft2 t2 ON (t1.c1 = t2.c1) LEFT JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;
510+
RESET enable_resultcache;
509511
-- left outer join + right outer join
510512
EXPLAIN (VERBOSE, COSTS OFF)
511513
SELECT t1.c1, t2.c2, t3.c3 FROM ft2 t1 LEFT JOIN ft2 t2 ON (t1.c1 = t2.c1) RIGHT JOIN ft4 t3 ON (t2.c1 = t3.c1) OFFSET 10 LIMIT 10;

doc/src/sgml/config.sgml

Lines changed: 22 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1770,8 +1770,9 @@ include_dir 'conf.d'
17701770
fact in mind when choosing the value. Sort operations are used
17711771
for <literal>ORDER BY</literal>, <literal>DISTINCT</literal>,
17721772
and merge joins.
1773-
Hash tables are used in hash joins, hash-based aggregation, and
1774-
hash-based processing of <literal>IN</literal> subqueries.
1773+
Hash tables are used in hash joins, hash-based aggregation, result
1774+
cache nodes and hash-based processing of <literal>IN</literal>
1775+
subqueries.
17751776
</para>
17761777
<para>
17771778
Hash-based operations are generally more sensitive to memory
@@ -4925,6 +4926,25 @@ ANY <replaceable class="parameter">num_sync</replaceable> ( <replaceable class="
49254926
</listitem>
49264927
</varlistentry>
49274928

4929+
<varlistentry id="guc-enable-resultcache" xreflabel="enable_resultcache">
4930+
<term><varname>enable_resultcache</varname> (<type>boolean</type>)
4931+
<indexterm>
4932+
<primary><varname>enable_resultcache</varname> configuration parameter</primary>
4933+
</indexterm>
4934+
</term>
4935+
<listitem>
4936+
<para>
4937+
Enables or disables the query planner's use of result cache plans for
4938+
caching results from parameterized scans inside nested-loop joins.
4939+
This plan type allows scans to the underlying plans to be skipped when
4940+
the results for the current parameters are already in the cache. Less
4941+
commonly looked up results may be evicted from the cache when more
4942+
space is required for new entries. The default is
4943+
<literal>on</literal>.
4944+
</para>
4945+
</listitem>
4946+
</varlistentry>
4947+
49284948
<varlistentry id="guc-enable-mergejoin" xreflabel="enable_mergejoin">
49294949
<term><varname>enable_mergejoin</varname> (<type>boolean</type>)
49304950
<indexterm>

src/backend/commands/explain.c

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,8 @@ static void show_sort_info(SortState *sortstate, ExplainState *es);
108108
static void show_incremental_sort_info(IncrementalSortState *incrsortstate,
109109
ExplainState *es);
110110
static void show_hash_info(HashState *hashstate, ExplainState *es);
111+
static void show_resultcache_info(ResultCacheState *rcstate, List *ancestors,
112+
ExplainState *es);
111113
static void show_hashagg_info(AggState *hashstate, ExplainState *es);
112114
static void show_tidbitmap_info(BitmapHeapScanState *planstate,
113115
ExplainState *es);
@@ -1284,6 +1286,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
12841286
case T_Material:
12851287
pname = sname = "Materialize";
12861288
break;
1289+
case T_ResultCache:
1290+
pname = sname = "Result Cache";
1291+
break;
12871292
case T_Sort:
12881293
pname = sname = "Sort";
12891294
break;
@@ -1996,6 +2001,10 @@ ExplainNode(PlanState *planstate, List *ancestors,
19962001
case T_Hash:
19972002
show_hash_info(castNode(HashState, planstate), es);
19982003
break;
2004+
case T_ResultCache:
2005+
show_resultcache_info(castNode(ResultCacheState, planstate),
2006+
ancestors, es);
2007+
break;
19992008
default:
20002009
break;
20012010
}
@@ -3063,6 +3072,141 @@ show_hash_info(HashState *hashstate, ExplainState *es)
30633072
}
30643073
}
30653074

3075+
/*
3076+
* Show information on result cache hits/misses/evictions and memory usage.
3077+
*/
3078+
static void
3079+
show_resultcache_info(ResultCacheState *rcstate, List *ancestors,
3080+
ExplainState *es)
3081+
{
3082+
Plan *plan = ((PlanState *) rcstate)->plan;
3083+
ListCell *lc;
3084+
List *context;
3085+
StringInfoData keystr;
3086+
char *seperator = "";
3087+
bool useprefix;
3088+
int64 memPeakKb;
3089+
3090+
initStringInfo(&keystr);
3091+
3092+
/*
3093+
* It's hard to imagine having a result cache with fewer than 2 RTEs, but
3094+
* let's just keep the same useprefix logic as elsewhere in this file.
3095+
*/
3096+
useprefix = list_length(es->rtable) > 1 || es->verbose;
3097+
3098+
/* Set up deparsing context */
3099+
context = set_deparse_context_plan(es->deparse_cxt,
3100+
plan,
3101+
ancestors);
3102+
3103+
foreach(lc, ((ResultCache *) plan)->param_exprs)
3104+
{
3105+
Node *expr = (Node *) lfirst(lc);
3106+
3107+
appendStringInfoString(&keystr, seperator);
3108+
3109+
appendStringInfoString(&keystr, deparse_expression(expr, context,
3110+
useprefix, false));
3111+
seperator = ", ";
3112+
}
3113+
3114+
if (es->format != EXPLAIN_FORMAT_TEXT)
3115+
{
3116+
ExplainPropertyText("Cache Key", keystr.data, es);
3117+
}
3118+
else
3119+
{
3120+
ExplainIndentText(es);
3121+
appendStringInfo(es->str, "Cache Key: %s\n", keystr.data);
3122+
}
3123+
3124+
pfree(keystr.data);
3125+
3126+
if (!es->analyze)
3127+
return;
3128+
3129+
/*
3130+
* mem_peak is only set when we freed memory, so we must use mem_used when
3131+
* mem_peak is 0.
3132+
*/
3133+
if (rcstate->stats.mem_peak > 0)
3134+
memPeakKb = (rcstate->stats.mem_peak + 1023) / 1024;
3135+
else
3136+
memPeakKb = (rcstate->mem_used + 1023) / 1024;
3137+
3138+
if (rcstate->stats.cache_misses > 0)
3139+
{
3140+
if (es->format != EXPLAIN_FORMAT_TEXT)
3141+
{
3142+
ExplainPropertyInteger("Cache Hits", NULL, rcstate->stats.cache_hits, es);
3143+
ExplainPropertyInteger("Cache Misses", NULL, rcstate->stats.cache_misses, es);
3144+
ExplainPropertyInteger("Cache Evictions", NULL, rcstate->stats.cache_evictions, es);
3145+
ExplainPropertyInteger("Cache Overflows", NULL, rcstate->stats.cache_overflows, es);
3146+
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb, es);
3147+
}
3148+
else
3149+
{
3150+
ExplainIndentText(es);
3151+
appendStringInfo(es->str,
3152+
"Hits: " UINT64_FORMAT " Misses: " UINT64_FORMAT " Evictions: " UINT64_FORMAT " Overflows: " UINT64_FORMAT " Memory Usage: " INT64_FORMAT "kB\n",
3153+
rcstate->stats.cache_hits,
3154+
rcstate->stats.cache_misses,
3155+
rcstate->stats.cache_evictions,
3156+
rcstate->stats.cache_overflows,
3157+
memPeakKb);
3158+
}
3159+
}
3160+
3161+
if (rcstate->shared_info == NULL)
3162+
return;
3163+
3164+
/* Show details from parallel workers */
3165+
for (int n = 0; n < rcstate->shared_info->num_workers; n++)
3166+
{
3167+
ResultCacheInstrumentation *si;
3168+
3169+
si = &rcstate->shared_info->sinstrument[n];
3170+
3171+
if (es->workers_state)
3172+
ExplainOpenWorker(n, es);
3173+
3174+
/*
3175+
* Since the worker's ResultCacheState.mem_used field is unavailable
3176+
* to us, ExecEndResultCache will have set the
3177+
* ResultCacheInstrumentation.mem_peak field for us. No need to do
3178+
* the zero checks like we did for the serial case above.
3179+
*/
3180+
memPeakKb = (si->mem_peak + 1023) / 1024;
3181+
3182+
if (es->format == EXPLAIN_FORMAT_TEXT)
3183+
{
3184+
ExplainIndentText(es);
3185+
appendStringInfo(es->str,
3186+
"Hits: " UINT64_FORMAT " Misses: " UINT64_FORMAT " Evictions: " UINT64_FORMAT " Overflows: " UINT64_FORMAT " Memory Usage: " INT64_FORMAT "kB\n",
3187+
si->cache_hits, si->cache_misses,
3188+
si->cache_evictions, si->cache_overflows,
3189+
memPeakKb);
3190+
}
3191+
else
3192+
{
3193+
ExplainPropertyInteger("Cache Hits", NULL,
3194+
si->cache_hits, es);
3195+
ExplainPropertyInteger("Cache Misses", NULL,
3196+
si->cache_misses, es);
3197+
ExplainPropertyInteger("Cache Evictions", NULL,
3198+
si->cache_evictions, es);
3199+
ExplainPropertyInteger("Cache Overflows", NULL,
3200+
si->cache_overflows, es);
3201+
ExplainPropertyInteger("Peak Memory Usage", "kB", memPeakKb,
3202+
es);
3203+
}
3204+
3205+
if (es->workers_state)
3206+
ExplainCloseWorker(n, es);
3207+
}
3208+
}
3209+
30663210
/*
30673211
* Show information on hash aggregate memory usage and batches.
30683212
*/

src/backend/executor/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ OBJS = \
6161
nodeProjectSet.o \
6262
nodeRecursiveunion.o \
6363
nodeResult.o \
64+
nodeResultCache.o \
6465
nodeSamplescan.o \
6566
nodeSeqscan.o \
6667
nodeSetOp.o \

src/backend/executor/execAmi.c

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -44,6 +44,7 @@
4444
#include "executor/nodeProjectSet.h"
4545
#include "executor/nodeRecursiveunion.h"
4646
#include "executor/nodeResult.h"
47+
#include "executor/nodeResultCache.h"
4748
#include "executor/nodeSamplescan.h"
4849
#include "executor/nodeSeqscan.h"
4950
#include "executor/nodeSetOp.h"
@@ -254,6 +255,10 @@ ExecReScan(PlanState *node)
254255
ExecReScanMaterial((MaterialState *) node);
255256
break;
256257

258+
case T_ResultCacheState:
259+
ExecReScanResultCache((ResultCacheState *) node);
260+
break;
261+
257262
case T_SortState:
258263
ExecReScanSort((SortState *) node);
259264
break;

0 commit comments

Comments
 (0)