Skip to content

Commit 919f6d7

Browse files
committed
Improve multivariate statistics documentation
Extended statistics commit 7b504eb did not include appropriate documentation next to where we document regular planner statistics (I ripped what was submitted before commit and then forgot to put it back), and while later commit 2686ee1 added some material, it structurally depended on what I had ripped out, so the end result wasn't proper. Fix those problems by shuffling what was added by 2686ee1 and including some additional material, so that now chapter 14 "Performance Tips" now describes the types of multivariate statistics we currently have, and chapter 68 "How the Planner Uses Statistics" shows some examples. The new text should be more in line with previous material, in (hopefully) the appropriate depth. While at it, fix a small bug in pg_statistic_ext docs: one column was listed in the wrong spot.
1 parent 8bcb31a commit 919f6d7

File tree

3 files changed

+311
-134
lines changed

3 files changed

+311
-134
lines changed

doc/src/sgml/catalogs.sgml

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4320,6 +4320,17 @@
43204320
<entry>Owner of the statistic</entry>
43214321
</row>
43224322

4323+
<row>
4324+
<entry><structfield>stxkeys</structfield></entry>
4325+
<entry><type>int2vector</type></entry>
4326+
<entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry>
4327+
<entry>
4328+
This is an array of values that indicate which table columns this
4329+
statistic covers. For example a value of <literal>1 3</literal> would
4330+
mean that the first and the third table columns make up the statistic key.
4331+
</entry>
4332+
</row>
4333+
43234334
<row>
43244335
<entry><structfield>stxkind</structfield></entry>
43254336
<entry><type>char[]</type></entry>
@@ -4332,17 +4343,6 @@
43324343
</entry>
43334344
</row>
43344345

4335-
<row>
4336-
<entry><structfield>stxkeys</structfield></entry>
4337-
<entry><type>int2vector</type></entry>
4338-
<entry><literal><link linkend="catalog-pg-attribute"><structname>pg_attribute</structname></link>.attnum</literal></entry>
4339-
<entry>
4340-
This is an array of values that indicate which table columns this
4341-
statistic covers. For example a value of <literal>1 3</literal> would
4342-
mean that the first and the third table columns make up the statistic key.
4343-
</entry>
4344-
</row>
4345-
43464346
<row>
43474347
<entry><structfield>stxndistinct</structfield></entry>
43484348
<entry><type>pg_ndistinct</type></entry>

doc/src/sgml/perform.sgml

Lines changed: 195 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000
906906
<secondary>of the planner</secondary>
907907
</indexterm>
908908

909+
<sect2>
910+
<title>Single-Column Statistics</title>
909911
<para>
910912
As we saw in the previous section, the query planner needs to estimate
911913
the number of rows retrieved by a query in order to make good choices
@@ -1043,7 +1045,200 @@ WHERE tablename = 'road';
10431045
Further details about the planner's use of statistics can be found in
10441046
<xref linkend="planner-stats-details">.
10451047
</para>
1048+
</sect2>
1049+
1050+
<sect2 id="planner-stats-extended">
1051+
<title>Extended Statistics</title>
1052+
1053+
<indexterm zone="planner-stats-extended">
1054+
<primary>statistics</primary>
1055+
<secondary>of the planner</secondary>
1056+
</indexterm>
10461057

1058+
<indexterm>
1059+
<primary>correlation</primary>
1060+
<secondary>in the query planner</secondary>
1061+
</indexterm>
1062+
1063+
<indexterm>
1064+
<primary>pg_statistic_ext</primary>
1065+
</indexterm>
1066+
1067+
<para>
1068+
It is common to see slow queries running bad execution plans because
1069+
multiple columns used in the query clauses are correlated.
1070+
The planner normally assumes that multiple conditions
1071+
are independent of each other,
1072+
an assumption that does not hold when column values are correlated.
1073+
Regular statistics, because of their per-individual-column nature,
1074+
do not capture the knowledge of cross-column correlation;
1075+
<firstterm>multivariate statistics</firstterm> can be used to instruct
1076+
the server to obtain statistics across such a set of columns,
1077+
which are later used by the query optimizer
1078+
to determine cardinality and selectivity
1079+
of clauses involving those columns.
1080+
Multivariate statistics are currently the only use of
1081+
<firstterm>extended statistics</firstterm>.
1082+
</para>
1083+
1084+
<para>
1085+
Extended statistics are created using
1086+
<xref linkend="sql-createstatistics">, which see for more details.
1087+
Data collection is deferred until the next <command>ANALYZE</command>
1088+
on the table, after which the stored values can be examined in the
1089+
<link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
1090+
catalog.
1091+
</para>
1092+
1093+
<para>
1094+
The following subsections describe the types of extended statistics
1095+
that are currently supported.
1096+
</para>
1097+
1098+
<sect3>
1099+
<title>Functional Dependencies</title>
1100+
1101+
<para>
1102+
The simplest type of extended statistics are functional dependencies,
1103+
a concept used in definitions of database normal forms.
1104+
Put simply, it is said that column <literal>b</> is functionally
1105+
dependent on column <literal>a</> if knowledge of the value of
1106+
<literal>a</> is sufficient to determine the value of <literal>b</>.
1107+
In normalized databases, functional dependencies are allowed only on
1108+
primary keys and superkeys. However, many data sets are in practice not
1109+
fully normalized for various reasons; intentional denormalization for
1110+
performance reasons is a common example.
1111+
</para>
1112+
1113+
<para>
1114+
The existance of functional dependencies directly affects the accuracy
1115+
of estimates in certain queries.
1116+
The reason is that conditions on the dependent columns do not
1117+
restrict the result set, but the query planner (lacking functional
1118+
dependency knowledge) considers them independent, resulting in
1119+
underestimates.
1120+
To inform the planner about the functional dependencies, we collect
1121+
measurements of dependency during <command>ANALYZE</>. Assessing
1122+
the degree of dependency between all sets of columns would be
1123+
prohibitively expensive, so the search is limited to potential
1124+
dependencies defined using the <literal>dependencies</> option of
1125+
extended statistics. It is advisable to create
1126+
<literal>dependencies</> statistics if and only if functional
1127+
dependencies actually exist, to avoid unnecessary overhead on both
1128+
<command>ANALYZE</> and query planning.
1129+
</para>
1130+
1131+
<para>
1132+
To inspect functional dependencies on a statistics
1133+
<literal>stts</literal>, you may do this:
1134+
<programlisting>
1135+
CREATE STATISTICS stts WITH (dependencies)
1136+
ON (zip, city) FROM zipcodes;
1137+
ANALYZE zipcodes;
1138+
SELECT stxname, stxkeys, stxdependencies
1139+
FROM pg_statistic_ext
1140+
WHERE stxname = 'stts';
1141+
stxname | stxkeys | stxdependencies
1142+
---------+---------+--------------------------------------------
1143+
stts | 1 5 | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
1144+
(1 row)
1145+
</programlisting>
1146+
where it can be seen that column 1 (a zip code) fully determines column
1147+
5 (city) so the coefficient is 1.0, while city only determines zip code
1148+
about 42% of the time, meaning that there are many cities (58%) that are
1149+
represented by more than a single ZIP code.
1150+
</para>
1151+
1152+
<para>
1153+
When computing the selectivity, the planner inspects all conditions and
1154+
attempts to identify which conditions are already implied by other
1155+
conditions. The selectivity estimates from any redundant conditions are
1156+
ignored from a selectivity point of view. In the example query above,
1157+
the selectivity estimates for either of the conditions may be eliminated,
1158+
thus improving the overall estimate.
1159+
</para>
1160+
1161+
<sect4>
1162+
<title>Limitations of Functional Dependencies</title>
1163+
1164+
<para>
1165+
Functional dependencies are a very simple type of statistics, and
1166+
as such have several limitations. The first limitation is that they
1167+
only work with simple equality conditions, comparing columns and constant
1168+
values. It's not possible to use them to eliminate equality conditions
1169+
comparing two columns or a column to an expression, range clauses,
1170+
<literal>LIKE</> or any other type of conditions.
1171+
</para>
1172+
1173+
<para>
1174+
When eliminating the implied conditions, the planner assumes that the
1175+
conditions are compatible. Consider the following example, where
1176+
this assumption does not hold:
1177+
1178+
<programlisting>
1179+
EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
1180+
QUERY PLAN
1181+
-----------------------------------------------------------------------------
1182+
Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
1183+
Filter: ((a = 1) AND (b = 10))
1184+
Rows Removed by Filter: 10000
1185+
</programlisting>
1186+
1187+
While there are no rows with such combination of values, the planner
1188+
is unable to verify whether the values match &mdash; it only knows that
1189+
the columns are functionally dependent.
1190+
</para>
1191+
1192+
<para>
1193+
This assumption is related to queries executed on the database; in many
1194+
cases, it's actually satisfied (e.g. when the GUI only allows selecting
1195+
compatible values). But if that's not the case, functional dependencies
1196+
may not be a viable option.
1197+
</para>
1198+
</sect4>
1199+
</sect3>
1200+
1201+
<sect3>
1202+
<title>Multivariate N-Distinct Coefficients</title>
1203+
1204+
<para>
1205+
Single-column statistics store the number of distinct values in each
1206+
column. Estimates of the number of distinct values on more than one
1207+
column (for example, for <literal>GROUP BY a, b</literal>) are
1208+
frequently wrong when the planner only has single-column statistical
1209+
data, however, causing it to select bad plans.
1210+
In order to improve n-distinct estimation when multiple columns are
1211+
grouped together, the <literal>ndistinct</> option of extended statistics
1212+
can be used, which instructs <command>ANALYZE</> to collect n-distinct
1213+
estimates for all possible combinations of two or more columns of the set
1214+
of columns in the statistics object (the per-column estimates are already
1215+
available in <structname>pg_statistic</>).
1216+
</para>
1217+
1218+
<para>
1219+
Continuing the above example, the n-distinct coefficients in a ZIP
1220+
code table may look like the following:
1221+
<programlisting>
1222+
CREATE STATISTICS stts2 WITH (ndistinct)
1223+
ON (zip, state, city) FROM zipcodes;
1224+
ANALYZE zipcodes;
1225+
SELECT stxkeys AS k, stxndistinct AS nd
1226+
FROM pg_statistic_ext
1227+
WHERE stxname = 'stts2';
1228+
-[ RECORD 1 ]---------------------------------------------
1229+
k | 1 2 5
1230+
nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
1231+
{(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
1232+
(1 row)
1233+
</programlisting>
1234+
which indicates that there are three combinations of columns that
1235+
have 33178 distinct values: ZIP code and state; ZIP code and city;
1236+
and ZIP code, city and state (the fact that they are all equal is
1237+
expected given the nature of ZIP-code data). On the other hand,
1238+
the combination of city and state only has 27435 distinct values.
1239+
</para>
1240+
</sect3>
1241+
</sect2>
10471242
</sect1>
10481243

10491244
<sect1 id="explicit-joins">

0 commit comments

Comments
 (0)