@@ -906,6 +906,8 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
906
906
<secondary>of the planner</secondary>
907
907
</indexterm>
908
908
909
+ <sect2>
910
+ <title>Single-Column Statistics</title>
909
911
<para>
910
912
As we saw in the previous section, the query planner needs to estimate
911
913
the number of rows retrieved by a query in order to make good choices
@@ -1043,7 +1045,200 @@ WHERE tablename = 'road';
1043
1045
Further details about the planner's use of statistics can be found in
1044
1046
<xref linkend="planner-stats-details">.
1045
1047
</para>
1048
+ </sect2>
1049
+
1050
+ <sect2 id="planner-stats-extended">
1051
+ <title>Extended Statistics</title>
1052
+
1053
+ <indexterm zone="planner-stats-extended">
1054
+ <primary>statistics</primary>
1055
+ <secondary>of the planner</secondary>
1056
+ </indexterm>
1046
1057
1058
+ <indexterm>
1059
+ <primary>correlation</primary>
1060
+ <secondary>in the query planner</secondary>
1061
+ </indexterm>
1062
+
1063
+ <indexterm>
1064
+ <primary>pg_statistic_ext</primary>
1065
+ </indexterm>
1066
+
1067
+ <para>
1068
+ It is common to see slow queries running bad execution plans because
1069
+ multiple columns used in the query clauses are correlated.
1070
+ The planner normally assumes that multiple conditions
1071
+ are independent of each other,
1072
+ an assumption that does not hold when column values are correlated.
1073
+ Regular statistics, because of their per-individual-column nature,
1074
+ do not capture the knowledge of cross-column correlation;
1075
+ <firstterm>multivariate statistics</firstterm> can be used to instruct
1076
+ the server to obtain statistics across such a set of columns,
1077
+ which are later used by the query optimizer
1078
+ to determine cardinality and selectivity
1079
+ of clauses involving those columns.
1080
+ Multivariate statistics are currently the only use of
1081
+ <firstterm>extended statistics</firstterm>.
1082
+ </para>
1083
+
1084
+ <para>
1085
+ Extended statistics are created using
1086
+ <xref linkend="sql-createstatistics">, which see for more details.
1087
+ Data collection is deferred until the next <command>ANALYZE</command>
1088
+ on the table, after which the stored values can be examined in the
1089
+ <link linkend="catalog-pg-statistic-ext"><structname>pg_statistic_ext</structname></link>
1090
+ catalog.
1091
+ </para>
1092
+
1093
+ <para>
1094
+ The following subsections describe the types of extended statistics
1095
+ that are currently supported.
1096
+ </para>
1097
+
1098
+ <sect3>
1099
+ <title>Functional Dependencies</title>
1100
+
1101
+ <para>
1102
+ The simplest type of extended statistics are functional dependencies,
1103
+ a concept used in definitions of database normal forms.
1104
+ Put simply, it is said that column <literal>b</> is functionally
1105
+ dependent on column <literal>a</> if knowledge of the value of
1106
+ <literal>a</> is sufficient to determine the value of <literal>b</>.
1107
+ In normalized databases, functional dependencies are allowed only on
1108
+ primary keys and superkeys. However, many data sets are in practice not
1109
+ fully normalized for various reasons; intentional denormalization for
1110
+ performance reasons is a common example.
1111
+ </para>
1112
+
1113
+ <para>
1114
+ The existance of functional dependencies directly affects the accuracy
1115
+ of estimates in certain queries.
1116
+ The reason is that conditions on the dependent columns do not
1117
+ restrict the result set, but the query planner (lacking functional
1118
+ dependency knowledge) considers them independent, resulting in
1119
+ underestimates.
1120
+ To inform the planner about the functional dependencies, we collect
1121
+ measurements of dependency during <command>ANALYZE</>. Assessing
1122
+ the degree of dependency between all sets of columns would be
1123
+ prohibitively expensive, so the search is limited to potential
1124
+ dependencies defined using the <literal>dependencies</> option of
1125
+ extended statistics. It is advisable to create
1126
+ <literal>dependencies</> statistics if and only if functional
1127
+ dependencies actually exist, to avoid unnecessary overhead on both
1128
+ <command>ANALYZE</> and query planning.
1129
+ </para>
1130
+
1131
+ <para>
1132
+ To inspect functional dependencies on a statistics
1133
+ <literal>stts</literal>, you may do this:
1134
+ <programlisting>
1135
+ CREATE STATISTICS stts WITH (dependencies)
1136
+ ON (zip, city) FROM zipcodes;
1137
+ ANALYZE zipcodes;
1138
+ SELECT stxname, stxkeys, stxdependencies
1139
+ FROM pg_statistic_ext
1140
+ WHERE stxname = 'stts';
1141
+ stxname | stxkeys | stxdependencies
1142
+ ---------+---------+--------------------------------------------
1143
+ stts | 1 5 | [{1 => 5 : 1.000000}, {5 => 1 : 0.423130}]
1144
+ (1 row)
1145
+ </programlisting>
1146
+ where it can be seen that column 1 (a zip code) fully determines column
1147
+ 5 (city) so the coefficient is 1.0, while city only determines zip code
1148
+ about 42% of the time, meaning that there are many cities (58%) that are
1149
+ represented by more than a single ZIP code.
1150
+ </para>
1151
+
1152
+ <para>
1153
+ When computing the selectivity, the planner inspects all conditions and
1154
+ attempts to identify which conditions are already implied by other
1155
+ conditions. The selectivity estimates from any redundant conditions are
1156
+ ignored from a selectivity point of view. In the example query above,
1157
+ the selectivity estimates for either of the conditions may be eliminated,
1158
+ thus improving the overall estimate.
1159
+ </para>
1160
+
1161
+ <sect4>
1162
+ <title>Limitations of Functional Dependencies</title>
1163
+
1164
+ <para>
1165
+ Functional dependencies are a very simple type of statistics, and
1166
+ as such have several limitations. The first limitation is that they
1167
+ only work with simple equality conditions, comparing columns and constant
1168
+ values. It's not possible to use them to eliminate equality conditions
1169
+ comparing two columns or a column to an expression, range clauses,
1170
+ <literal>LIKE</> or any other type of conditions.
1171
+ </para>
1172
+
1173
+ <para>
1174
+ When eliminating the implied conditions, the planner assumes that the
1175
+ conditions are compatible. Consider the following example, where
1176
+ this assumption does not hold:
1177
+
1178
+ <programlisting>
1179
+ EXPLAIN (ANALYZE, TIMING OFF) SELECT * FROM t WHERE a = 1 AND b = 10;
1180
+ QUERY PLAN
1181
+ -----------------------------------------------------------------------------
1182
+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual rows=0 loops=1)
1183
+ Filter: ((a = 1) AND (b = 10))
1184
+ Rows Removed by Filter: 10000
1185
+ </programlisting>
1186
+
1187
+ While there are no rows with such combination of values, the planner
1188
+ is unable to verify whether the values match — it only knows that
1189
+ the columns are functionally dependent.
1190
+ </para>
1191
+
1192
+ <para>
1193
+ This assumption is related to queries executed on the database; in many
1194
+ cases, it's actually satisfied (e.g. when the GUI only allows selecting
1195
+ compatible values). But if that's not the case, functional dependencies
1196
+ may not be a viable option.
1197
+ </para>
1198
+ </sect4>
1199
+ </sect3>
1200
+
1201
+ <sect3>
1202
+ <title>Multivariate N-Distinct Coefficients</title>
1203
+
1204
+ <para>
1205
+ Single-column statistics store the number of distinct values in each
1206
+ column. Estimates of the number of distinct values on more than one
1207
+ column (for example, for <literal>GROUP BY a, b</literal>) are
1208
+ frequently wrong when the planner only has single-column statistical
1209
+ data, however, causing it to select bad plans.
1210
+ In order to improve n-distinct estimation when multiple columns are
1211
+ grouped together, the <literal>ndistinct</> option of extended statistics
1212
+ can be used, which instructs <command>ANALYZE</> to collect n-distinct
1213
+ estimates for all possible combinations of two or more columns of the set
1214
+ of columns in the statistics object (the per-column estimates are already
1215
+ available in <structname>pg_statistic</>).
1216
+ </para>
1217
+
1218
+ <para>
1219
+ Continuing the above example, the n-distinct coefficients in a ZIP
1220
+ code table may look like the following:
1221
+ <programlisting>
1222
+ CREATE STATISTICS stts2 WITH (ndistinct)
1223
+ ON (zip, state, city) FROM zipcodes;
1224
+ ANALYZE zipcodes;
1225
+ SELECT stxkeys AS k, stxndistinct AS nd
1226
+ FROM pg_statistic_ext
1227
+ WHERE stxname = 'stts2';
1228
+ -[ RECORD 1 ]---------------------------------------------
1229
+ k | 1 2 5
1230
+ nd | [{(b 1 2), 33178.000000}, {(b 1 5), 33178.000000},
1231
+ {(b 2 5), 27435.000000}, {(b 1 2 5), 33178.000000}]
1232
+ (1 row)
1233
+ </programlisting>
1234
+ which indicates that there are three combinations of columns that
1235
+ have 33178 distinct values: ZIP code and state; ZIP code and city;
1236
+ and ZIP code, city and state (the fact that they are all equal is
1237
+ expected given the nature of ZIP-code data). On the other hand,
1238
+ the combination of city and state only has 27435 distinct values.
1239
+ </para>
1240
+ </sect3>
1241
+ </sect2>
1047
1242
</sect1>
1048
1243
1049
1244
<sect1 id="explicit-joins">
0 commit comments