@@ -446,6 +446,160 @@ rows = (outer_cardinality * inner_cardinality) * selectivity
446
446
in <filename>src/backend/utils/adt/selfuncs.c</filename>.
447
447
</para>
448
448
449
+ <sect2 id="functional-dependencies">
450
+ <title>Functional Dependencies</title>
451
+
452
+ <para>
453
+ The simplest type of extended statistics are functional dependencies,
454
+ used in definitions of database normal forms. When simplified, saying that
455
+ <literal>b</> is functionally dependent on <literal>a</> means that
456
+ knowledge of value of <literal>a</> is sufficient to determine value of
457
+ <literal>b</>.
458
+ </para>
459
+
460
+ <para>
461
+ In normalized databases, only functional dependencies on primary keys
462
+ and superkeys are allowed. However, in practice, many data sets are not
463
+ fully normalized, for example, due to intentional denormalization for
464
+ performance reasons.
465
+ </para>
466
+
467
+ <para>
468
+ Functional dependencies directly affect accuracy of the estimates, as
469
+ conditions on the dependent column(s) do not restrict the result set,
470
+ resulting in underestimates.
471
+ </para>
472
+
473
+ <para>
474
+ To inform the planner about the functional dependencies, we collect
475
+ measurements of dependency during <command>ANALYZE</>. Assessing
476
+ dependency between all sets of columns would be prohibitively
477
+ expensive, so we limit our search to potential dependencies defined
478
+ using the <command>CREATE STATISTICS</> command.
479
+
480
+ <programlisting>
481
+ CREATE TABLE t (a INT, b INT);
482
+ INSERT INTO t SELECT i/100, i/100 FROM generate_series(1,10000) s(i);
483
+ CREATE STATISTICS s1 WITH (dependencies) ON (a, b) FROM t;
484
+ ANALYZE t;
485
+ EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
486
+ QUERY PLAN
487
+ -------------------------------------------------------------------------------------------------
488
+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=0.095..3.118 rows=100 loops=1)
489
+ Filter: ((a = 1) AND (b = 1))
490
+ Rows Removed by Filter: 9900
491
+ Planning time: 0.367 ms
492
+ Execution time: 3.380 ms
493
+ (5 rows)
494
+ </programlisting>
495
+
496
+ The planner is now aware of the functional dependencies and considers
497
+ them when computing the selectivity of the second condition. Running
498
+ the query without the statistics would lead to quite different estimates.
499
+
500
+ <programlisting>
501
+ DROP STATISTICS s1;
502
+ EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 1;
503
+ QUERY PLAN
504
+ -----------------------------------------------------------------------------------------------
505
+ Seq Scan on t (cost=0.00..195.00 rows=1 width=8) (actual time=0.000..6.379 rows=100 loops=1)
506
+ Filter: ((a = 1) AND (b = 1))
507
+ Rows Removed by Filter: 9900
508
+ Planning time: 0.000 ms
509
+ Execution time: 6.379 ms
510
+ (5 rows)
511
+ </programlisting>
512
+ </para>
513
+
514
+ <para>
515
+ If no dependency exists, the collected statistics do not influence the
516
+ query plan. The only effect is to slow down <command>ANALYZE</>. Should
517
+ partial dependencies exist these will also be stored and applied
518
+ during planning.
519
+ </para>
520
+
521
+ <para>
522
+ Similarly to per-column statistics, extended statistics are stored in
523
+ a system catalog called <structname>pg_statistic_ext</structname>, but
524
+ there is also a more convenient view <structname>pg_stats_ext</structname>.
525
+ To inspect the statistics <literal>s1</literal> defined above,
526
+ you may do this:
527
+
528
+ <programlisting>
529
+ SELECT tablename, staname, attnums, depsbytes
530
+ FROM pg_stats_ext WHERE staname = 's1';
531
+ tablename | staname | attnums | depsbytes
532
+ -----------+---------+---------+-----------
533
+ t | s1 | 1 2 | 40
534
+ (1 row)
535
+ </programlisting>
536
+
537
+ This shows that the statistics are defined on table <structname>t</>,
538
+ <structfield>attnums</structfield> lists attribute numbers of columns
539
+ (references <structname>pg_attribute</structname>). It also shows
540
+ the length in bytes of the functional dependencies, as found by
541
+ <command>ANALYZE</> when serialized into a <literal>bytea</> column.
542
+ </para>
543
+
544
+ <para>
545
+ When computing the selectivity, the planner inspects all conditions and
546
+ attempts to identify which conditions are already implied by other
547
+ conditions. The selectivity estimates from any redundant conditions are
548
+ ignored from a selectivity point of view. In the example query above,
549
+ the selectivity estimates for either of the conditions may be eliminated,
550
+ thus improving the overall estimate.
551
+ </para>
552
+
553
+ <sect3 id="functional-dependencies-limitations">
554
+ <title>Limitations of functional dependencies</title>
555
+
556
+ <para>
557
+ Functional dependencies are a very simple type of statistics, and
558
+ as such have several limitations. The first limitation is that they
559
+ only work with simple equality conditions, comparing columns and constant
560
+ values. It's not possible to use them to eliminate equality conditions
561
+ comparing two columns or a column to an expression, range clauses,
562
+ <literal>LIKE</> or any other type of conditions.
563
+ </para>
564
+
565
+ <para>
566
+ When eliminating the implied conditions, the planner assumes that the
567
+ conditions are compatible. Consider the following example, violating
568
+ this assumption:
569
+
570
+ <programlisting>
571
+ EXPLAIN ANALYZE SELECT * FROM t WHERE a = 1 AND b = 10;
572
+ QUERY PLAN
573
+ -----------------------------------------------------------------------------------------------
574
+ Seq Scan on t (cost=0.00..195.00 rows=100 width=8) (actual time=2.992..2.992 rows=0 loops=1)
575
+ Filter: ((a = 1) AND (b = 10))
576
+ Rows Removed by Filter: 10000
577
+ Planning time: 0.232 ms
578
+ Execution time: 3.033 ms
579
+ (5 rows)
580
+ </programlisting>
581
+
582
+ While there are no rows with such combination of values, the planner
583
+ is unable to verify whether the values match - it only knows that
584
+ the columns are functionally dependent.
585
+ </para>
586
+
587
+ <para>
588
+ This assumption is more about queries executed on the database - in many
589
+ cases, it's actually satisfied (e.g. when the GUI only allows selecting
590
+ compatible values). But if that's not the case, functional dependencies
591
+ may not be a viable option.
592
+ </para>
593
+
594
+ <para>
595
+ For additional information about functional dependencies, see
596
+ <filename>src/backend/statistics/README.dependencies</>.
597
+ </para>
598
+
599
+ </sect3>
600
+
601
+ </sect2>
602
+
449
603
</sect1>
450
604
451
605
</chapter>
0 commit comments