Skip to content

Commit 234d508

Browse files
committed
Add documentation on how statistics are used by the planner.
Mark Kirkwood
1 parent 4f51368 commit 234d508

File tree

3 files changed

+374
-2
lines changed

3 files changed

+374
-2
lines changed

doc/src/sgml/filelist.sgml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.42 2005/02/13 03:04:15 tgl Exp $ -->
1+
<!-- $PostgreSQL: pgsql/doc/src/sgml/filelist.sgml,v 1.43 2005/02/27 00:49:28 momjian Exp $ -->
22

33
<!entity history SYSTEM "history.sgml">
44
<!entity info SYSTEM "info.sgml">
@@ -77,6 +77,7 @@
7777
<!entity catalogs SYSTEM "catalogs.sgml">
7878
<!entity geqo SYSTEM "geqo.sgml">
7979
<!entity gist SYSTEM "gist.sgml">
80+
<!entity planstats SYSTEM "planstats.sgml">
8081
<!entity indexam SYSTEM "indexam.sgml">
8182
<!entity nls SYSTEM "nls.sgml">
8283
<!entity plhandler SYSTEM "plhandler.sgml">

doc/src/sgml/planstats.sgml

Lines changed: 370 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,370 @@
1+
<!--
2+
$PostgreSQL: pgsql/doc/src/sgml/planstats.sgml,v 1.1 2005/02/27 00:49:28 momjian Exp $
3+
-->
4+
5+
<chapter id="planner-stats">
6+
<title>How the Planner Uses Statistics</title>
7+
8+
<para>
9+
This chapter builds on the material covered in <xref linkend="using-explain">
10+
and <xref linkend="planner-stats">, and shows how the planner uses the
11+
system statistics to estimate the number of rows each stage in a query might
12+
return. This is a significant part of the planning / optimizing process,
13+
providing much of the raw material for cost calculation.
14+
</para>
15+
16+
<para>
17+
The intent of this chapter is not to document the code &mdash;
18+
better done in the code itself, but to present an overview of how it works.
19+
This will perhaps ease the learning curve for someone who subsequently
20+
wishes to read the code. As a consequence, the approach chosen is to analyze
21+
a series of incrementally more complex examples.
22+
</para>
23+
24+
<para>
25+
The outputs and algorithms shown below are taken from version 8.0.
26+
The behaviour of earlier (or later) versions may vary.
27+
</para>
28+
29+
<sect1 id="row-estimation-examples">
30+
<title>Row Estimation Examples</title>
31+
32+
<indexterm zone="row-estimation-examples">
33+
<primary>row estimation</primary>
34+
<secondary>planner</secondary>
35+
</indexterm>
36+
37+
<para>
38+
Using examples drawn from the regression test database, let's start with a
39+
very simple query:
40+
<programlisting>
41+
EXPLAIN SELECT * FROM tenk1;
42+
43+
QUERY PLAN
44+
-------------------------------------------------------------
45+
Seq Scan on tenk1 (cost=0.00..445.00 rows=10000 width=244)
46+
</programlisting>
47+
48+
How the planner determines the cardinality of <classname>tenk1</classname>
49+
is covered in <xref linkend="using-explain">, but is repeated here for
50+
completeness. The number of rows is looked up from
51+
<classname>pg_class</classname>:
52+
53+
<programlisting>
54+
SELECT reltuples, relpages FROM pg_class WHERE relname = 'tenk1';
55+
56+
relpages | reltuples
57+
----------+-----------
58+
345 | 10000
59+
</programlisting>
60+
The planner will check the <structfield>relpages<structfield> estimate
61+
(this is a cheap operation) and if incorrect may scale
62+
<structfield>reltuples<structfield> to obtain a row estimate. In this case it
63+
does not, thus:
64+
65+
<programlisting>
66+
rows = 10000
67+
</programlisting>
68+
69+
</para>
70+
71+
<para>
72+
let's move on to an example with a range condition in its
73+
<literal>WHERE</literal> clause:
74+
75+
<programlisting>
76+
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000;
77+
78+
QUERY PLAN
79+
------------------------------------------------------------
80+
Seq Scan on tenk1 (cost=0.00..470.00 rows=1031 width=244)
81+
Filter: (unique1 &lt; 1000)
82+
</programlisting>
83+
84+
The planner examines the <literal>WHERE</literal> clause condition:
85+
86+
<programlisting>
87+
unique1 &lt; 1000
88+
</programlisting>
89+
90+
and looks up the restriction function for the operator
91+
<literal>&lt;</literal> in <classname>pg_operator</classname>.
92+
This is held in the column <structfield>oprrest</structfield>,
93+
and the result in this case is <function>scalarltsel</function>.
94+
The <function>scalarltsel</function> function retrieves the histogram for
95+
<structfield>unique1</structfield> from <classname>pg_statistics</classname>
96+
- we can follow this by using the simpler <classname>pg_stats</classname>
97+
view:
98+
99+
<programlisting>
100+
SELECT histogram_bounds FROM pg_stats
101+
WHERE tablename='tenk1' AND attname='unique1';
102+
103+
histogram_bounds
104+
------------------------------------------------------
105+
{1,970,1943,2958,3971,5069,6028,7007,7919,8982,9995}
106+
</programlisting>
107+
108+
Next the fraction of the histogram occupied by <quote>&lt; 1000</quote>
109+
is worked out. This is the selectivity. The histogram divides the range
110+
into equal frequency buckets, so all we have to do is locate the bucket
111+
that our value is in and count <emphasis>part</emphasis> of it and
112+
<emphasis>all</emphasis> of the ones before. The value 1000 is clearly in
113+
the second (970 - 1943) bucket, so by assuming a linear distribution of
114+
values inside each bucket we can calculate the selectivity as:
115+
116+
<programlisting>
117+
selectivity = (1 + (1000 - bckt[2].min)/(bckt[2].max - bckt[2].min))/num_bckts
118+
= (1 + (1000 - 970)/(1943 - 970))/10
119+
= 0.1031
120+
</programlisting>
121+
122+
that is, one whole bucket plus a linear fraction of the second, divided by
123+
the number of buckets. The estimated number of rows can now be calculated as
124+
the product of the selectivity and the cardinality of
125+
<classname>tenk1</classname>:
126+
127+
<programlisting>
128+
rows = rel_cardinality * selectivity
129+
= 10000 * 0.1031
130+
= 1031
131+
</programlisting>
132+
133+
</para>
134+
135+
<para>
136+
Next let's consider an example with equality condition in its
137+
<literal>WHERE</literal> clause:
138+
139+
<programlisting>
140+
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'ATAAAA';
141+
142+
QUERY PLAN
143+
----------------------------------------------------------
144+
Seq Scan on tenk1 (cost=0.00..470.00 rows=31 width=244)
145+
Filter: (stringu1 = 'ATAAAA'::name)
146+
</programlisting>
147+
148+
Again the planner examines the <literal>WHERE</literal> clause condition:
149+
150+
<programlisting>
151+
stringu1 = 'ATAAAA'
152+
</programlisting>
153+
154+
and looks up the restriction function for <literal>=</literal>, which is
155+
<function>eqsel</function>. This case is a bit different, as the most
156+
common values &mdash; <acronym>MCV</acronym>s, are used to determine the
157+
selectivity. Let's have a look at these, with some extra columns that will
158+
be useful later:
159+
160+
<programlisting>
161+
SELECT null_frac, n_distinct, most_common_vals, most_common_freqs FROM pg_stats
162+
WHERE tablename='tenk1' AND attname='stringu1';
163+
164+
null_frac | 0
165+
n_distinct | 672
166+
most_common_vals | {FDAAAA,NHAAAA,ATAAAA,BGAAAA,EBAAAA,MOAAAA,NDAAAA,OWAAAA,BHAAAA,BJAAAA}
167+
most_common_freqs | {0.00333333,0.00333333,0.003,0.003,0.003,0.003,0.003,0.003,0.00266667,0.00266667}
168+
</programlisting>
169+
170+
The selectivity is merely the most common frequency (<acronym>MCF</acronym>)
171+
corresponding to the third <acronym>MCV</acronym> &mdash; 'ATAAAA':
172+
173+
<programlisting>
174+
selectivity = mcf[3]
175+
= 0.003
176+
</programlisting>
177+
178+
The estimated number of rows is just the product of this with the
179+
cardinality of <classname>tenk1</classname> as before:
180+
181+
<programlisting>
182+
rows = 10000 * 0.003
183+
= 30
184+
</programlisting>
185+
186+
The number displayed by <command>EXPLAIN</command> is one more than this,
187+
due to some post estimation checks.
188+
</para>
189+
190+
<para>
191+
Now consider the same query, but with a constant that is not in the
192+
<acronym>MCV</acronym> list:
193+
194+
<programlisting>
195+
EXPLAIN SELECT * FROM tenk1 WHERE stringu1 = 'xxx';
196+
197+
QUERY PLAN
198+
----------------------------------------------------------
199+
Seq Scan on tenk1 (cost=0.00..470.00 rows=15 width=244)
200+
Filter: (stringu1 = 'xxx'::name)
201+
</programlisting>
202+
203+
This is quite a different problem, how to estimate the selectivity when the
204+
value is <emphasis>not</emphasis> in the <acronym>MCV</acronym> list.
205+
The approach is to use the fact that the value is not in the list,
206+
combined with the knowledge of the frequencies for all of the
207+
<acronym>MCV</acronym>s:
208+
209+
<programlisting>
210+
selectivity = (1 - sum(mvf))/(num_distinct - num_mcv)
211+
= (1 - (0.00333333 + 0.00333333 + 0.003 + 0.003 + 0.003
212+
+ 0.003 + 0.003 + 0.003 + 0.00266667 + 0.00266667))/(672 - 10)
213+
= 0.001465
214+
</programlisting>
215+
216+
That is, add up all the frequencies for the <acronym>MCV</acronym>s and
217+
subtract them from one &mdash; because it is <emphasis>not</emphasis> one
218+
of these, and divide by the <emphasis>remaining</emphasis> distinct values.
219+
Notice that there are no null values so we don't have to worry about those.
220+
The estimated number of rows is calculated as usual:
221+
222+
<programlisting>
223+
rows = 10000 * 0.001465
224+
= 15
225+
</programlisting>
226+
227+
</para>
228+
229+
<para>
230+
Let's increase the complexity to consider a case with more than one
231+
condition in the <literal>WHERE</literal> clause:
232+
233+
<programlisting>
234+
EXPLAIN SELECT * FROM tenk1 WHERE unique1 &lt; 1000 AND stringu1 = 'xxx';
235+
236+
QUERY PLAN
237+
-----------------------------------------------------------
238+
Seq Scan on tenk1 (cost=0.00..495.00 rows=2 width=244)
239+
Filter: ((unique1 &lt; 1000) AND (stringu1 = 'xxx'::name))
240+
</programlisting>
241+
242+
An assumption of independence is made and the selectivities of the
243+
individual restrictions are multiplied together:
244+
245+
<programlisting>
246+
selectivity = selectivity(unique1 &lt; 1000) * selectivity(stringu1 = 'xxx')
247+
= 0.1031 * 0.001465
248+
= 0.00015104
249+
</programlisting>
250+
251+
The row estimates are calculated as before:
252+
253+
<programlisting>
254+
rows = 10000 * 0.00015104
255+
= 2
256+
</programlisting>
257+
</para>
258+
259+
<para>
260+
Finally we will examine a query that includes a <literal>JOIN</literal>
261+
together with a <literal>WHERE</literal> clause:
262+
263+
<programlisting>
264+
EXPLAIN SELECT * FROM tenk1 t1, tenk2 t2
265+
WHERE t1.unique1 &lt; 50 AND t1.unique2 = t2.unique2;
266+
267+
QUERY PLAN
268+
-----------------------------------------------------------------------------------------
269+
Nested Loop (cost=0.00..346.90 rows=51 width=488)
270+
-> Index Scan using tenk1_unique1 on tenk1 t1 (cost=0.00..192.57 rows=51 width=244)
271+
Index Cond: (unique1 &lt; 50)
272+
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.00..3.01 rows=1 width=244)
273+
Index Cond: ("outer".unique2 = t2.unique2)
274+
</programlisting>
275+
276+
The restriction on <classname>tenk1</classname>
277+
<quote>unique1 &lt; 50</quote> is evaluated before the nested-loop join.
278+
This is handled analogously to the previous range example. The restriction
279+
operator for <literal>&lt;</literal> is <function>scalarlteqsel</function>
280+
as before, but this time the value 50 is in the first bucket of the
281+
<structfield>unique1</structfield> histogram:
282+
283+
<programlisting>
284+
selectivity = (0 + (50 - bckt[1].min)/(bckt[1].max - bckt[1].min))/num_bckts
285+
= (0 + (50 - 1)/(970 - 1))/10
286+
= 0.005057
287+
288+
rows = 10000 * 0.005057
289+
= 51
290+
</programlisting>
291+
292+
The restriction for the join is:
293+
294+
<programlisting>
295+
t2.unique2 = t1.unique2
296+
</programlisting>
297+
298+
This is due to the join method being nested-loop, with
299+
<classname>tenk1</classname> being in the outer loop. The operator is just
300+
our familiar <literal>=<literal>, however the restriction function is
301+
obtained from the <structfield>oprjoin</structfield> column of
302+
<classname>pg_operator</classname> - and is <function>eqjoinsel</function>.
303+
Additionally we use the statistical information for both
304+
<classname>tenk2</classname> and <classname>tenk1</classname>:
305+
306+
<programlisting>
307+
SELECT tablename, null_frac,n_distinct, most_common_vals FROM pg_stats
308+
WHERE tablename IN ('tenk1', 'tenk2') AND attname='unique2';
309+
310+
tablename | null_frac | n_distinct | most_common_vals
311+
-----------+-----------+------------+------------------
312+
tenk1 | 0 | -1 |
313+
tenk2 | 0 | -1 |
314+
</programlisting>
315+
316+
In this case there is no <acronym>MCV</acronym> information for
317+
<structfield>unique2</structfield> because all the values appear to be
318+
unique, so we can use an algorithm that relies only on the number of
319+
distinct values for both relations together with their null fractions:
320+
321+
<programlisting>
322+
selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
323+
= (1 - 0) * (1 - 0) * min(1/10000, 1/1000)
324+
= 0.0001
325+
</programlisting>
326+
327+
This is, subtract the null fraction from one for each of the relations,
328+
and divide by the maximum of the two distinct values. The number of rows
329+
that the join is likely to emit is calculated as the cardinality of
330+
cartesian product of the two nodes in the nested-loop, multiplied by the
331+
selectivity:
332+
333+
<programlisting>
334+
rows = (outer_cardinality * inner_cardinality) * selectivity
335+
= (51 * 10000) * 0.0001
336+
= 51
337+
</programlisting>
338+
</para>
339+
340+
<para>
341+
For those interested in further details, estimation of the number of rows in
342+
a relation is covered in
343+
<filename>src/backend/optimizer/util/plancat.c</filename>. The calculation
344+
logic for clause selectivities is in
345+
<filename>src/backend/optimizer/path/clausesel.c</filename>. The actual
346+
implementations of the operator and join restriction functions can be found
347+
in <filename>src/backend/utils/adt/selfuncs.c</filename>.
348+
</para>
349+
350+
</sect1>
351+
352+
353+
</chapter>
354+
355+
<!-- Keep this comment at the end of the file
356+
Local variables:
357+
mode:sgml
358+
sgml-omittag:nil
359+
sgml-shorttag:t
360+
sgml-minimize-attributes:nil
361+
sgml-always-quote-attributes:t
362+
sgml-indent-step:1
363+
sgml-indent-data:t
364+
sgml-parent-document:nil
365+
sgml-default-dtd-file:"./reference.ced"
366+
sgml-exposed-tags:nil
367+
sgml-local-catalogs:("/usr/lib/sgml/catalog")
368+
sgml-local-ecat-files:nil
369+
End:
370+
-->

doc/src/sgml/postgres.sgml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.74 2005/02/13 03:04:15 tgl Exp $
2+
$PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.75 2005/02/27 00:49:28 momjian Exp $
33
-->
44

55
<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook V4.2//EN" [
@@ -239,6 +239,7 @@ $PostgreSQL: pgsql/doc/src/sgml/postgres.sgml,v 1.74 2005/02/13 03:04:15 tgl Exp
239239
&gist;
240240
&storage;
241241
&bki;
242+
&planstats;
242243

243244
</part>
244245

0 commit comments

Comments
 (0)