Academia.eduAcademia.edu

Efficient Maintenance of Materialized Outer-Join Views

2007, 2007 IEEE 23rd International Conference on Data Engineering

Queries containing outer joins are common in data warehousing applications. Materialized outer-join views could greatly speed up many such queries but most database systems do not allow outer joins in materialized views. In part, this is because outer-join views could not previously be maintained efficiently when base tables are updated. In this paper we show how to efficiently maintain general outer-join views, that is, views composed of selection, projection, inner and outer joins. Foreign-key constraints are exploited to reduce maintenance overhead. Experimental results show that maintaining an outer-join view need not be more expensive than maintaining an inner-join view.

Efficient Maintenance of Materialized Outer-Join Views Per-Åke Larson Jingren Zhou Microsoft Research palarson@microsoft.com Microsoft Research jrzhou@microsoft.com Abstract Example 1 Suppose we create a view, oj view, as shown below. The view consists of outer joins of the tables part, lineitem and orders from the TPC-H database [10]. Recall that p partkey is the primary key of part and o orderkey is the primary key of orders. There is a foreign key constraint between lineitem and part, and also one between lineitem and orders. Queries containing outer joins are common in data warehousing applications. Materialized outer-join views could greatly speed up many such queries but most database systems do not allow outer joins in materialized views. In part, this is because outer-join views could not previously be maintained efficiently when base tables are updated. In this paper we show how to efficiently maintain general outer-join views, that is, views composed of selection, projection, inner and outer joins. Foreign-key constraints are exploited to reduce maintenance overhead. Experimental results show that maintaining an outer-join view need not be more expensive than maintaining an inner-join view. 1 create view oj view as select p partkey, p name, p retailprice, o orderkey, o custkey, l linenumber, l quantity, l extendedprice from part full outer join (orders left outer join lineitem on l orderkey=o orderkey) on p partkey=l partkey We first analyze what types of tuples the view may contain. The join between orders and lineitem may output tuples of two types: {orders, lineitem} and {orders}. Each lineitem tuple joins with exactly one orders tuple (because of the foreign key from l orderkey to o orderkey) and produces an {orders, lineitem} tuple. orders-only tuples are orphaned orders (no matching lineitem tuples), which occur in the result null-extended on all lineitem columns. Now consider what happens with the tuples in the second join. Because of the foreign key from lineitem to part, every {orders, lineitem} tuple will join with a part tuple, producing a {part, orders, lineitem} tuple. All orphaned order tuples are null extended on the join column l partkey so they will not join with any part tuples. However, the full outer join retains the orphaned order tuples. Similarly, a part may not join with any lineitem tuples but such orphaned part tuples are retained by the outer join. In summary, the view may contain tuples of three types:{part, orders, lineitem}, {orders}, and {part}. Now suppose we insert new tuples into the part table. The view can then be brought up to date simply by inserting the new tuples, appropriately extended with nulls, into the view. Nothing more is required because the foreign key constraint between lineitem and part guarantees that a new part tuple cannot join with any lineitem tuples. If it did, the joining lineitem tuples would have violated the foreign key constraint. Insertions into the orders table can be handled in the same way. Next consider insertions into the lineitem table. Suppose the new lineitems are contained in a table new lineitems. The view can then be updated using the following sequence of statements. Introduction Queries containing outer joins are common in OLAP applications, typically joining a fact table with some number of dimension tables followed by aggregation. Outer-join queries are also used for constructing tree-structured objects (e.g. XML) from data stored in flat tables. Outer joins are needed so we can also retain objects that lack some subobjects. Materialized views can speed up query processing greatly, but to realize the benefits two subproblems must be solved: view matching and incremental view maintenance. The goal of view matching is to determine, at optimization time, whether and how part or all of a query can be computed from a view. Incremental view maintenance is required to efficiently bring a view up to date when base tables are updated. View matching and efficient incremental view maintenance algorithms for SPJG views, that is, views composed of selection, projection and inner joins with an optional group-by on top, are well understood. Our goal is to extend view support to SPOJG view, that is, allow views to also contain outer joins. We introduced a view matching algorithm for outer-join views in a previous paper [6]. In this paper, we introduce an efficient incremental maintenance procedure. We show that maintenance can be divided into two steps: computing and applying a primary delta and a secondary delta. The first step is very similar to maintaining an inner-join view. The second step is a “clean-up” step and we show how to perform this step efficiently. We also describe how foreign key constraints can be exploited to reduce maintenance overhead. 1 select p partkey, p name, p retailprice, o orderkey, o custkey, l linenumber, l quantity, l extendedprice into #delta1 from new lineitems, orders, part where l orderkey = o orderkey and l partkey = p partkey insert into oj view -- apply primary delta -select * from #delta1 delete from oj view -- apply secondary delta -where l linenumber is null and (p partkey in (select p partkey from #delta1) or o orderkey in (select o orderkey from #delta1)) The first statement computes the set of tuples to be inserted into the view and saves them in a temporary table. The second statement adds the new tuples into the view. The new lineitem tuples may cause some orphaned part or orders tuples to be eliminated from the view. The third statement deletes all orphaned part and orders tuples, if any, that cease to be orphans because of the insertions Two earlier papers [2, 5] describe algorithms for incremental maintenance of outer join views. However, the algorithm in [2] is significantly more expensive than ours and the algorithm in [5] is incorrect. Oracle [8] supports materialized outer-join views but with many restrictions and only a limited class of outer-join views are incrementally maintainable. We discuss related work in more detail in Section 8. The rest of the paper is organized as follows. Section 2 contains preliminary material and introduces concepts used in later sections. The overall maintenance procedure is described in Section 3. We show how to efficiently compute the primary delta in Section 4 and the secondary delta in Section 5. Foreign-key constraints are considered in Section 6. We report experimental results in Section 7 and describe related work in Section 8. Due to space limitations, we have state our results without proofs. Derivations and proofs can be found in [7]. 2 Preliminaries We assume that base tables and views satisfy the following restrictions: every base table has a unique key that does not contain nulls; a view can reference the same table only once (no self-joins); every view outputs a unique key (no duplicates) ; and all predicates of a view are null-rejecting (defined below). 2.1 Definitions and Notation The selection operator is denoted in the normal way as σp where p is a predicate. A predicate p referencing some set S of columns is said to be strong or null-rejecting if it evaluates to false on a tuple as soon as one of the columns in S is null. Projection (without duplicate elimination) is denoted by πc where c is a list of columns. Borrowing from SQL, we use the shorthand T.∗ to denote all columns of table T . We also need an operator that removes duplicates, which we denote by δ. A schema S is a set of columns. Let T1 and T2 be tables with schemas S1 and S2 , respectively. The outer union, denoted by T1 ⊎ T2 , first null-extends (pads with nulls) the tuples of each operand to schema S1 ∪ S2 and then takes the union of the results (without duplicate elimination). A tuple t1 is said to subsume a tuple t2 if they are defined on the same schema, t1 agrees with t2 on all columns where they both are non-null and t1 contains fewer null values than t2 . The operator removal of subsumed tuples of T , denoted by T ↓, returns the tuples of T that are not subsumed by any other tuple in T . The minimum union of tables T1 and T2 is defined as T1 ⊕ T2 = (T1 ⊎ T1 )↓. It can be shown that minimum union is both commutative and associative. Let T1 and T2 be tables with disjoint schemas S1 and S2 , respectively, and p a predicate referencing some subset of the columns in (S1 ∪ S2 ). The left semijoin is defined as ⋉ls T1 ⋊ p T2 = {t1 |t1 ∈ T1 , (∃ t2 ∈ T2 |p (t1 , t2 ))}, that is, a tuple in T1 qualifies if it joins with some tuple in T2 . The left anti(semi)join is T1 ⋊ ⋉la p T2 = {t1 |t1 ∈ T1 , (∄ t2 ∈ T2 |p (t1 , t2 ))}, that is, a tuple in T1 qualifies if it does not ⋉lo join with any tuple in T2 . The left outer join is T1 ⋊ p T2 = ⋉p T2 ⊕ T1 . The right outer join is T1 ⋊ ⋉ro T1 ⋊ p T2 = ⋉p ⋉p T2 ⊕ T2 . The full outer join is T1 ⋊ ⋉fp o T2 = T1 ⋊ T1 ⋊ T2 ⊕ T1 ⊕ T2 . We will make use of a special predicate null(T ) that evaluates to true if a tuple is null-extended on table T . null(T ) can be implemented in SQL as “T.c is null” where c is any column of T that does not contain nulls, for example, a column of a key. When applying null and ¬null to T2 , · · · , Tn }, we use the shorta set of tables T = {T1 , hand notation n(T ) = Ti ∈T null(Ti ) and nn(T ) =  ¬null(T ). i Ti ∈T 2.2 Join-Disjunctive Normal Form Our derivation of view maintenance expressions builds on the join-disjunctive normal form for SPOJ expressions introduced by Galindo-Legaria [1]. We briefly describe the normal form by an example, but refer the reader to [1, 6] for more details and algorithms. Example 2 We will use the following view as a running example throughout the paper o o V1 = (R ⋊ ⋉fp(r,s) S) ⋊ ⋉lo ⋉fp(t,u) U ). p(r,t) (T ⋊ (1) All four tables have a unique key and all predicates are null-rejecting. The notation p(r, s) means a predicate over columns from tables R and S, and similarly for the other predicates. We convert the view expression to normal form bottom up. We first rewrite the join between R and S and the join between T and U in terms of inner joins and minimum union, which results in V1 = (σp(r,s) (R × S) ⊕ R ⊕ S) ⋊ ⋉lo p(r,t) (σp(t,u) (T × U ) ⊕ T ⊕ U ). To convert the remaining outer join, we “multiply” the two input expressions, that is, we consider every combination of a term from the left operand and a term from the right operand. Tuples from σp(t,u) (T × U ) may join with tuples from σp(r,s) (R × S) producing tuples that satisfy σp(r,s)∧p(r,t)∧p(t,u) (T × U × R × S). Similarly, tuples from σp(t,u) (T ×U ) may join with tuples from term R, producing a term σp(r,t)∧p(t,u) (T × U × R). However, tuples from σp(t,u) (T × U ) do not join with tuples from term S because tuples from the S term are null extended on R and the join predicate p(r, t) is null rejecting. We continue the “multiplication” process with other terms and, because the join is a left outer join, also add the terms from the left input. This produces the following normal form of the view V1 = σp(r,s)∧p(r,t)∧p(t,u) (T × U × R × S)⊕ σp(r,t)∧p(t,u) (T × U × R) ⊕ σp(r,t)∧p(r,s) (T × R × S)⊕ σp(r,t) (T × R) ⊕ σp(r,s) (R × S) ⊕ R ⊕ S As illustrated by this example, an SPOJ expression E over a set of tables U can be converted to a normal form consisting of the minimum union of terms composed from selections and inner joins (but no outer joins). More formally, the join-disjunctive normal form of E equals E = E1 ⊕ E2 ⊕ · · · ⊕ En where each term Ei is of the form Ei = σpi (Ti1 × Ti2 × · · · × Tim ) {T,U,R,S} {T,U,R,S}D {T,U,R} {T,R,S} {T,U,R}D {T,R,S}D {T,R} {T,R}D {R,S} {R} {S} (a) Subsumption {R,S}I {R}I (b) Maintenance (update T ) Figure 1. Subsumption and maintenance graphs for view V1 2.4 Net Contribution of a Term The minimum-union operators in the join-disjunctive normal form have two functions: to eliminate subsumed tuples and to union the remaining tuples. If we first eliminate subsumed tuples from every term, we can replace the minimum unions by outer unions. The resulting form clearly shows what terms are affected by an update and how. Subsumption among terms are not arbitrary; when checking whether a tuple of a term is subsumed, it is sufficient to check against tuples in (immediate) parent terms. The following lemma shows how to eliminate subsumed tuples from a term. Ti1 , Ti2 · · · Tim is a subset of the tables in U. Predicate pi is the conjunction of a subset of the selection and join predicates found in the original form of the query. The derivation of the normal form and a conversion algorithm, can be found in [6]. The algorithm is straightforward and traverses the operator tree once. It exploits nullrejecting predicates and foreign keys to reduce the number of terms. For a tree consisting of N full outer joins, the normal form may contain 2N + N terms in the worst case. In practice, it normally contains far fewer terms. Lemma 1 Let Ei be a term with source set Ti in the normal form E of an SPOJ expression. Then the set of tuples generated by Ei that are not subsumed by any other tuples in E can be computed as 2.3 We call Di the net contribution of term Ei because the tuples of Di are not subsumed by any other tuples and thus appear explicitly in the view result. The Subsumption Graph Suppose the complete set of operand tables for an SPOJ expression is U. Each term in the normal form is defined over a unique subset S of U and hence produces tuples that are null extended on U − S. We call the tables in subset S the term’s source tables. A tuple produced by a term with source table set S can only be subsumed by tuples produced by terms whose source set is a superset of S, see Lemma 2 in [6]. The subsumption relationships among terms can be modeled by a DAG, which we call the subsumption graph. Definition 2.1 Let E = E1 ⊕···⊕En be the join-disjunctive form of an SPOJ expression. The subsumption graph of E contains a node ni for each term Ei in the normal form and the node is labeled with the source table set Si of Ei . There is an edge from node ni to node nj , if Si is a minimal superset of Sj . Si is a minimal superset of Sj if there does not exist a node nk in the graph such that Sj ⊂ Sk ⊂ Si . The subsumption graph for V1 is shown in Figure 1(a). Di = Ei ⋊ ⋉la eq(Ti ) (Ei1 ⊎ Ei2 ⊎ · · · ⊎ Eim ) where Ei1 , Ei2 , · · ·Eim are the parent terms of Ei and eq(Ti ) is an equijoin predicate over columns forming a key of Ei . Theorem 1 Let E be an SPOJ expression with normal form E1 ⊕ E2 ⊕ · · · ⊕ En . Then E = E1 ⊕ E2 ⊕ · · · ⊕ En = D1 ⊎ D2 ⊎ · · · ⊎ Dn where each Di is computed from Ei as defined in Lemma 1. We call the form D1 ⊎ · · · ⊎ Dn the net-contribution form of the expression. Because the terms are connected by (outer) unions, there is no interaction among net contributions from different terms so each term can be maintained independently from other terms. 3 View Maintenance Procedure This section describes our overall maintenance procedure. We consider only with insertions or deletions; an update is treated as a deletion followed by an insertion. 3.1 Terms Affected by an Update Consider a view V and suppose one of the its base tables T is modified. This may change the net contribution of a term Di only if T occurs in the expression defining Di . By inspection of the expression for Di with source table set Ti , it is immediately apparent that the change may affect the result in one of three ways: 1. Directly, which occurs if T is among the tables in Ti ; 2. Indirectly, which occurs if T is not among the tables in Ti but it is among the source tables of at least one of its parent nodes; 3. No effect, otherwise. Based on this classification of how terms are affected, we create a view maintenance graph as follows. 1. Eliminate from the subsumption graph all nodes that are unaffected by the update of T . 2. Mark the remaining nodes by D or I depending on whether the node is affected directly or indirectly. The maintenance graph for view V1 when updating T is shown in Figure 1(b). A node n in the maintenance graph may have multiple parents. We denote the set of parents by par(n). Some of the parents may be directly affected, denoted by pard(n), and some of them may be indirectly affected, denoted by pari(n). If node n is directly affected, pari(n) = ∅. If node n is indirectly affected, pard(n) ≥ 1. The equality par(n) = pard(n) ∪ pari(n) holds by definition. 3.2 Maintenance Procedure Suppose table T has been updated and we need to maintain a view V that references T . We first compute the maintenance graph and classify the terms as directly affected, indirectly affected and unaffected. Without loss of generality, assume that the view has n terms, of which terms 1, 2, · · · , k are directly affected, terms k +1, k +2, · · · , k + m are indirectly affected, and terms k + m + 1, k + m + 2, · · · , n are not affected. We can then rewrite the view expression in the form V =V V D = ⊎ki=1 Di , I D V = I ⊎V ⊎V k+m Di , ⊎i=k+1 U where V U = ⊎n i=k+m+1 Di . From this form of the expression it is obvious that to update the view we need to compute two delta expressions ∆V D = ⊎ki=1 ∆Di , k+m ∆Di . ∆V I = ⊎i=k+1 We call ∆V D the primary delta and ∆V I the secondary delta. In summary, maintenance of a view V after updates to one of its underlying base tables is performed in two steps. 1. If there are directly affected terms, compute the primary delta ∆V D and apply it to the view. 2. If there are indirectly affected terms, compute the secondary delta ∆V I and apply it to the view. If the update is an insert (delete), the primary delta is inserted into (deleted from) the view and the secondary delta is deleted from (inserted into) the view. In the following sections, we describe how to efficiently compute the primary delta and the secondary delta, respectively. 3.3 Aggregation Views An aggregated outer-join view is simply an outer-join view with a group-by on top. Maintaining an aggregated outer-join view is not much more complex then maintaining a non-aggregated view. ∆V D for the non-aggregated part of the view is computed in the same way. The result is then aggregated as specified in the view definition and applied to the view in the same way as for aggregated inner-join views. The view needs to contains both a regular row count and a not-null count for every table that is null-extended in some term. New rows are created as needed. Any row whose row count becomes zero is deleted. If the not-null count for table T becomes zero, all aggregates referencing a column in T are set to null. Next we compute ∆V I , aggregate the result and apply it to the view. However, we may have to compute ∆V I from base tables as we shall see in Section 5.3 because it may not be possible to extract a required term from the aggregated view. Tuples from different terms that have been combined into the same group can no longer be separated out. Such computation may incur additional overhead. Further discussion of aggregation views and additional simplification rules are described in [7]. 4 Computing the Primary Delta Suppose table T has been updated and we need to maintain a view V that references T . Every term Ei in V D has T as one of its source tables, so no tuples in V D are null extended on T . Conversely, all tuples in V I and V U are null extended on T because the terms in V I and V U do not reference T . A tuple that is null-extended on a table T cannot subsume a tuple that is not null-extended on T . It follows that a tuple generated by a term in V D can only be subsumed by a tuple generated by another term in V D so we can rewrite V D as V D = ⊎ki=1 Di = ⊕ki=1 Ei . V D = ⊕ki=1 Ei contains all terms that produce tuples containing real tuples from table T . We now show a simple conversion of the original (non-normalized) view expression into an expression equal to V D that can then trivially be converted into an expression for computing ∆V D . Example 3 Suppose table T has been updated and V1 needs to be maintained. We derive expressions for V1D and ∆V1D through a series of transformations of expression (1). The original operator tree is shown in Figure 2(a). We traro p(r,t) lo p(r,t) fo p(r,s) R S T fo p(t,u) fo p(t,u) U T U R p(r,t) lo p(t,u) fo p(r,s) S T U S p(r,t) fo p(r,s) R fo p(r,s) lo p(t,u) ¨T U R S (d) ∆V1D (a) V1 (b) V1 (c) (original) (commuted) Figure 2. Transforming V1 to ∆V1D V1D verse the path from T to the root of the tree. On each join operator encountered we commute the inputs, if needed, so that the input expression referencing T is on the left. The only operator affected is the root operator where we swap the inputs and change the join from a left outer to a right outer join. The resulting operator tree is shown in Figure 2(b). This transformation converts the expression to o o ⋉fp(r,s) S). V1 = (T ⋊ ⋉fp(t,u) U) ⋊ ⋉ro p(r,t) (R ⋊ (2) Now consider the operators on the leftmost path in Figo ure 2(b). The join T ⋊ ⋉fp(t,u) U may produce three types of tuples: T U , T and U . All U -only tuples are null extended on T and, hence, can never become part of V1D because no tuples in V1D are null extended on T . Tuples of type T U and T will, if they survive the next join, become part of V1D because they contain “real” T tuples. We can eliminate the U -only tuples by changing the join to a left outer join. The next join on the path is ⋊ ⋉ro p(r,t) . After the modification, its left input produces only tuples destined for V1D . Its right input may produce tuples of types RS, R and S. Because the join is a right outer join, it preserves unmatched tuples from the right input. However, they are null extended on T and, hence, cannot become part of V1D . We can eliminate the unmatched tuples by changing the join from a right outer to an inner join, as shown in Figure 2(c). The modified expression produces exactly the same tuples as the original expression for all terms containing actual T tuples, that is, all terms in V1D , and no tuples that are null extended on T . Furthermore, none of the retained tuples were ever subsumed by a tuple in a term eliminated by modifying the joins. It follows that the modified expression exactly computes V1D , that is, o ⋉p(r,t) (R ⋊ ⋉fp(r,s) S). V1D = (T ⋊ ⋉lo p(t,u) U ) ⋊ (3) The leftmost path in Figure 2(c) contains only a left outer join and an inner join. As explained further below, an expression for ∆V1D can be obtained simply by substituting ∆T for T in expression (3), that is, o ∆V1D = (∆T ⋊ ⋉lo ⋉p(r,t) (R ⋊ ⋉fp(r,s) S). p(t,u) U ) ⋊ Step 1 is a normal rewrite of the view expression and does not change the result. Step 2 modifies the expression so that it discards all tuples that cannot become part of V D . After Step 2, the operators on the path from T to the root consists only of selects, inner joins and left outer joins and the delta expression is always the left input. The correctness of Step 3 follows from the following delta propagation rules. σp (e1 ± ∆e1 ) = σp e1 ± σp ∆e1 (e1 ± ∆e1 ) ⋊ ⋉p e2 = e1 ⋊ ⋉p e2 ± ∆e1 ⋊ ⋉p e2 (e1 ± ∆e1 ) ⋊ ⋉lo ⋉lo ⋉lo p e2 = e1 ⋊ p e2 ± ∆e1 ⋊ p e2 where ± stands for either a set union or a set difference. The rules for selects and inner joins are obvious. The rule for left outer join can be found in [2]. 4.1 Conversion to a Left-Deep Tree In many cases, only a few tuples are inserted or deleted at a time and only a small number of tuples are affected in the view. The expression for ∆V D produced by the algorithm above are not always efficient for such cases because it may contain subexpressions joining two or more base tables. Joining two base tables may produce a large intermediate result even though the final result is small. We show how to convert the expression to a left-deep join tree that avoids this problem. Ideally, the optimizer should consider this conversion automatically but current optimizers are deficient in this area. We introduce two new associativity rules for outer joins (rules 4 and 5 below). These additional rules make it possible to always convert the delta expression to a left-deep tree provided that all join predicates are binary, that is, reference only two tables. Example 4 When T is updated, our algorithm produces the following expression for ∆V1D o ∆V1D = (∆T ⋊ ⋉lo ⋉p(r,t) (R ⋊ ⋉fp(r,s) S). p(t,u) U ) ⋊ lo p(r,s) (4) p(r,t) The corresponding operator tree is shown in Figure 2(d). The simple procedure for constructing expressions for V D and ∆V D illustrated in this example generalizes to arbitrary SPOJ views. The general algorithm follows. Algorithm: Construct ∆V D expression Inputs: Original view expression V , updated table T . Output: Expression for computing ∆V D . 1. Traverse the operator tree for V along the path from T to the root. On any join operator encountered, apply commutativity rules to ensure that the input referencing T is on the left. 2. Traverse the path from T to the root of V . Convert any full outer join operator encountered to a left outer join and any right outer join operator to an inner join. 3. Substitute T by ∆T . (5) lo p(t,u) T U S p(r,t) fo p(r,s) R (a) Bushy tree lo p(t,u) S R ¨T U (b) Left-deep tree Figure 3. Converting ∆V1D to a left-deep tree The operator tree is shown in graphical form in Figure 3(a). This expression is potentially very expensive to compute. Suppose ∆T is very small, containing only a few tuples. Then the join ∆T ⋊ ⋉lo p(t,u) U is likely to produce a small result. This small result is the left input to the final join ⋊ ⋉p(r,t) so the final result is also likely to produce a small result. However, the right operand is a join involving o S. This join may be exbase tables only, namely, R ⋊ ⋉fp(r,s) pensive to compute and will produce a result that is at least as large as the maximum of R and S. We can eliminate the potentially large intermediate result by converting the operator tree to a left-deep tree, that is, a tree where the right operand of every join is a single base table, possibly including a select. This can be done by exploiting associativity rules for inner and outer joins. Applying this transformation to expression (5) produces ∆V1D = ((∆T ⋊ ⋉lo ⋉p(r,t) R) ⋊ ⋉lo p(t,u) U ) ⋊ p(r,s) S (6) The corresponding operator tree is shown in Figure 3(b). Using this expression, the intermediate results are likely to stay small if ∆T is small. In the expression for ∆V D , the operators on the path from the updated table to the root (the leftmost path) are limited to selection, inner join and left outer join. We convert the tree to a left-deep tree by repeatedly applying the following simple procedure: for any join operator on the leftmost path whose right operand references more than one table, pull the top operator of the right operand into the main path by applying one of the associativity rules listed below. The rules assume that all join predicates are null-rejecting and reference only two tables but they are not required to be equijoins. The notation p(1, 2) means a predicate that references columns in e1 and in e2 , and similarly for p(2, 3). We need a new but simple operator called the null-if operator and denoted by λcp where p is predicate and c a list of column. For every tuple that satisfies p, the operator sets the value of all columns in c to null; all other tuples are passed through unchanged. The operator can implemented using a project with the case statement of SQL. A null-if operator may create duplicates, which need to be eliminated. Associativity rules for left outer join (1) e1 ⋊ ⋉lo p(1,2) (σp(2) e2 ) = δ (2) e1 ⋊ ⋉lo p(1,2) o (e2 ⋊ ⋉fp(2,3) (3) e1 ⋊ ⋉lo p(1,2) (4) e1 ⋊ ⋉lo p(1,2) 2 .∗ λe¬p(2) (e1 ⋊ ⋉lo p(1,2) e2 ) e3 ) = e2 ) ⋊ ⋉lo p(2,3) e3 (e2 ⋊ ⋉lo p(2,3) e3 ) = (e1 ⋊ ⋉lo p(1,2) ⋊ ⋉lo p(2,3) (e2 ⋊ ⋉ro p(2,3) e3 ) = δ 2 .∗,e3 .∗ λe¬p(2,3) ((e1 e3 Extracting Term Deltas from ∆V D The primary delta ∆V D contains the union of the deltas for all directly affected terms. However, we need the deltas for individual terms to compute the secondary delta. Each term is defined over a unique set of tables and null extended on all others so tuples from a particular term are easily identified and can be extracted from ∆V D by a combination of null and ¬null predicates. Example 5 ∆V1D contains the deltas for four directly affected terms, see Figure 1(b). Consider, for example, the T RS-term. Non-subsumed tuples from this term are uniquely identified by the fact that they are composed of real tuples from T , R, and S but are null extended on U . Hence, ∆DT RS can be extracted from ∆V1D as follows ∆DT RS = π(T RS).∗ σnn(T RS)∧n(U ) ∆V1D where nn(T RS) = ¬null(T ) ∧ ¬null(R) ∧ ¬null(S) and n(U ) = null(U ). ∆DT RS contains only the delta of the net contribution. ∆ET RS contains the complete delta of the term, including both subsumed and non-subsumed tuples. Tuples in ∆ET RS are composed of real tuples from T , R, and S and may or may not be null extended on U . Hence, ∆ET RS can be extracted from ∆V1D as follows ∆ET RS = δ π(T RS).∗ σnn(T RS) ∆V1D Theorem 2 Consider a view V defined over tables U. Let Ei be a term in V D defined over tables Ti , and Di its net contribution. Then ∆Di and ∆Ei can be computed as ∆Di = πTi . ∗ σnn(Ti )∧n(U−Ti ) ∆V D ⋊ ⋉lo ⋉lo p(1,2) e2 ) ⋊ p(2,3) e3 ) ⋉p(2,3) e3 ) = (5) e1 ⋊ ⋉lo p(1,2) (e2 ⋊ 2 .∗,e3 .∗ δ λe¬p(2,3) ⋉lo ((e1 ⋊ ⋉lo p(1,2) e2 ) ⋊ p(2,3) e3 ) Note that all joins added to the main path are inner joins and left outer joins. The null-if operators fix up tuples that are supposed to be null-extended and duplicates, if any, thus created are eliminated. To the best of our knowledge, rules 1, 4 and 5 are new. 5 5.1 The duplicate elimination (δ) is necessary because a T RS tuple may have joined with multiple U tuples. (e1 ⋊ ⋉lo p(1,2) e2 ) Every term in a view has a unique set of source tables and is null-extended on all other tables in the view. We denote the source tables of term Ei by Ti and the set of tables on which it is is null-extended by Si . Computing the Secondary Delta The secondary delta can be computed efficiently from the primary delta and either the view or base tables – we consider both options. When possible, it is usually cheaper to use the view but the optimizer should choose in a costbased manner. Recall that the base tables have already been updated and the primary delta has been applied to the view. ∆Ei = δ πTi . ∗ σnn(Ti ) ∆V D  where nn(Ti ) = t∈Ti ¬null(t) and n(U − Ti ) =  t∈(U −Ti ) null(t). 5.2 Computing ∆V I Using the View We first consider how to compute ∆V I from the primary delta and the view. After applying the primary delta, the state of the view is V + ∆V D or V − ∆V D . ∆Di denotes the change in the net contribution of the indirectly affected term Ei . Insertions: After an insertion, ∆Di can be computed from the view and the primary delta by the expression D ∆Di = σnn(Ti )∧n(Si ) (V + ∆V D ) ⋊ ⋉ls eq(Ti ) σPi ∆V  Pi = nn(Tk ) Ek ∈pard(Ei ) Ti denotes the source table set of Ei and Si the set of tables on which Ei is null extended. Ek ranges over all directly affected parents of Ei and Tk denotes the source table set of Ek . eq(Ti ) denotes an equijoin condition between the key columns of Ti in the left operand and in the right operand. This expression makes sense intuitively. The first part selects from the view all orphaned (non-subsumed) tuples of term Ei , that is, the tuples in Di . The second part extracts from the primary delta all tuples added to a parent term of Ei . The complete expression thus amounts to finding all currently orphaned tuples of the term and deleting those that cease to be orphans because of the insert. Example 6 Continuing with our running example, we need to compute ∆DRS and ∆DR . DRS is null extended on T and U and the T RS-term is its only parent, so ∆DRS can be computed as D ∆DRS = σnn(RS)∧n(T U ) (V1 + ∆V1D ) ⋊ ⋉ls eq(RS) σnn(T RS) ∆V1 DR is null extended on S, T and U and it has one directly affected parent, the T R-term so ∆DR can be computed as D ∆DR = σnn(R)∧n(ST U ) (V1 + ∆V1D ) ⋊ ⋉ls eq(R) σnn(T R) ∆V1 Deletions: After a deletion, ∆Di can be computed from the view and the primary delta using the expression D ∆Di = (δ πTi .∗ σPi ∆V D ) ⋊ ⋉la eq(Ti ) (V − ∆V ) where Pi is the same as for the insertion case. This expression also makes sense intuitively. The first part extracts from the primary delta the tuples deleted from parents of term Ei , projects them onto the tables of term Ei , and eliminates duplicates. This produces the potentially orphaned tuples of Ei . The anti-semijoin then discards every tuple that is still included in a parent tuple. This leaves the actual new orphans to be inserted. Example 7 After deletions from table T , the delta of the indirectly affected terms of V1 can be computed as follows. D ∆DR = (δ πR.∗ σnn(T R) ∆V1D ) ⋊ ⋉la eq(R) (V1 − ∆V1 ) ∆DRS = (δ π(RS).∗ σnn(T RS) ∆V1D ) ⋊ ⋉la eq(RS) (V1 − ∆V1D ) Column availability: If a view does not output the columns required by the expressions above, then the expression cannot be used and ∆Di has to be computed using base tables. The join predicates require access to the key columns of the referenced tables. For insertions, tuples are extracted from the view using a combination of null and ¬null predicates against source tables. However, the view may not output a non-null column for each of the referenced source table. Even so, it may still be possible to extract the required tuples from the view. The exact conditions when extraction is still possible are derived in [6]. The key observation is that the view must expose enough non-null columns to uniquely distinguish the required tuples from tuples of all other terms. 5.3 Computing ∆V I from Base Tables If the view does not output all required columns, the delta of a term cannot be computed from the view and the primary delta. If so, the term delta has to be computed from base tables, ∆T , and the primary delta. As before, we assume that the update has already been applied to table T and thus only the new state of the table is available. We denote the new state of the table by T ± ; T ± = T + ∆T after an insertion and T ± = T − ∆T after a deletion. Before proceeding we need to introduce some additional notation. Let Ei = σpi (Si1 × · · · × Sim ) be an indirectly affected term under consideration. Ei has r directly affected parents pard(Ei ) = {Ei1 , · · · , Eir }, r ≥ 1 and s, s ≥ 0, indirectly affected parents pari(Ei ). Let Ek be one of the parent terms. Because Ek is a parent of Ei , we know that its source set contains Si = {Si1 , · · · , Sm }, and some other tables Rk = {Rk1 , · · · , Rkn }. Furthermore, if the parent is directly affected, it references T but not if it is indirectly affected. We split the expression for Ek into three parts: Ek = σpk (Si1 × · · · × Sim × Rk1 × · · · × Rkn × T ) ⋉q(Si ,Rk ,T ) σq(Rk ) (Rk1 × · · · × Rkn ) = σpi (Si1 × · · · × Sim ) ⋊       ⋊ ⋉q(Rk ,T ) σq(T ) (T ) ← missing for indirectly affected terms    The new predicates are constructed from pk as follows: q(Rk ) contains every conjunct of pk that references only tables in Rk ; q(T ) contains every conjunct of pk that references table T only; q(Si , Rk , T ) contains every conjunct that references at least one table among Sk and at least one table among Rk ∪ {T }; and q(Rk , T ) contains every conjunct of pk that references at least one table in Rk and T . ∆Di can be computed from the last two parts of Ek and the primary delta. Insertions After an insertion, ∆Di can be computed as ′ ′ ∆Di = (δ πTi .∗ σQi ∆V D ) ⋊ ⋉la ⋉la qi1 Ei1 · ·· ⋊ qir Eir ′ where Qi , Eip , and qip , p = 1, 2, · · · , r are defined as Qi = nn(Si ) ∧ n(∪Ek ∈pari(Ei ) Rk ) ′ Eip = σq(Rip ) (Ri1 × · · · × Rin ) ⋊ ⋉q(Rip ,T ) (σq(T ) T ± ⋊ ⋉la eq(T ) ∆T ) qip = q(Si , Rip , T ) Example 8 After an insertion into table T , the tuples to be deleted from view V1 can be computed as ± ⋊ ⋉la ∆DR = (δ π(R).∗ σnn(T R)∧n(S) ∆V1D ) ⋊ ⋉la eq(T ) ∆T ) p(r,t) (T ± ∆DRS = (δ π(RS).∗ σnn(T RS) ∆V1D ) ⋊ ⋉la ⋊ ⋉la p(r,t) (T eq(T ) ∆T ) Let’s see if the expression for ∆DR makes sense. The expression δ π(R).∗ σnn(T R)∧n(S) ∆V1D extracts all R tuples in the primary delta that that did not join with any S tuples. These R tuples are no longer orphans but some of them may have been before the insertion. The expression ⋉la T± ⋊ eq(T ) ∆T represents all tuples in T before the insertion. An extracted R tuple satisfies the anti-semijoin if it does not join with any of the old T tuples, that is, if it were an orphan. All such prior orphans should be deleted from the view. Deletions: After a deletion, ∆Di can be computed as ′ ′ ∆Di = (δ πTi .∗ σQi ∆V D ) ⋊ ⋉la ⋉la qi1 Ei1 · ·· ⋊ qir Eir ′ where Qi , Eip and qip , p = 1, 2, · · · , r are defined as Qi = nn(Si ) ∧ n(∪Ek ∈pari(Ei ) Rk ) ′ Eip = σq(Rip ) (Ri1 × · · · × Rin ) ⋊ ⋉q(Rip ,T ) (σq(T ) T ± ) qip = q(Si , Rip , T ) Example 9 Applying the formula above, we find that ∆DR and ∆DRS of our example view V1 can be computed as ∆DR = (δ ∆DRS = (δ π(R).∗ σnn(T R)∧n(S) ∆V1D ) π(RS).∗ σnn(T RS) ∆V1D ) ⋊ ⋉la p(r,t) ⋊ ⋉la p(r,t) T T ± ± Again, let’s analyze the expression for ∆DR . The expression δ π(R).∗ σnn(T R) ∆V1D extracts from the primary delta all deleted R tuples that do not join with an S tuple. These are the potential new R-only orphans. Any new orphan that does not join with a tuple remaining in T after the deletion is an actual orphan and is inserted into the view. 6 Exploiting Foreign Keys In our first example, we exploited foreign-key constraints to conclude that the view could be maintained after insertion of a part tuple simply by inserting the new tuple into the view. The techniques we have developed so far would not recognize this opportunity. In this section we show to exploit foreign-key constraints to further simplify computation of the primary delta and the secondary delta. However, the optimization described in this section cannot be applied under the following circumstances. 1. When an update is logically decomposed into a delete and an insert for the purpose of view maintenance. (The tuples in T may be only modified and there are no actual deletions and insertions.) 2. The constraint is declared with cascading deletes. 3. The constraint is deferrable and the insert/delete statement is part of a multi-statement transaction. 6.1 Simplifying ∆V D Computation Example 10 Consider our running example view V1 but with a slight modification. We add a foreign key constraint from column U.f k to column T.pk where T.pk is a primary key of T , and assume that the join predicate p(t, u) equals T.pk = U.f k. The view definition then becomes o o V1 = (R ⋊ ⋉fp(r,s) S) ⋊ ⋉lo ⋉fpk=f p(r,t) (T ⋊ k U ). and our algorithms generate the primary delta expression ∆V1D = ((∆T ⋊ ⋉lo ⋉p(r,t) R) ⋊ ⋉lo pk=f k U ) ⋊ p(r,s) S (7) Because of the foreign key constraint, no tuples in ∆T will join with tuples in U . Let t ∈ ∆T be a tuple that has been inserted into T . Tuple t has a unique pk value so there cannot exist a tuple u ∈ U that references t. If such at tuple u existed, it would violate the foreign-key constraint. The same reasoning can be applied to a tuple t that is deleted from T . If a tuple u existed, it would violate the foreignkey constraint after the deletion. As no tuples in ∆T join with U , the outer join ∆T ⋊ ⋉lo pk=f k U simply passes through the tuples from ∆T and can therefore be eliminated. Doing so reduces the expression to ∆V1D = (∆T ⋊ ⋉p(r,t) R) ⋊ ⋉lo p(r,s) S None of the other joins reference the discarded table U so no further modifications are needed. Let Fi , i = 1, · · · , m be a foreign key constraint from a table Si to the updated table T that matches a join in the expression for ∆V D . To simplify the operator tree based on the foreign key constraints, call the procedure SimplifyTree below with inputs ∆V D and S = {S1 , · · · , Sm }. Procedure: SimplifyTree(Tree DT , Set of Tables S) Traverse DT from the leftmost leaf to the root. At each operator node n, do the following 1. If n is an inner join or a select with a predicate that is null-rejecting on a table s ∈ S, set DT = ∅ and return. 2. If n is a left outer join with a predicate that is null-rejecting on a table s ∈ S, eliminate node n and connect its left input to its parent. Let R denote the set of tables of the right input expression. Add R to S. 6.2 Simplifying ∆V I Computation Foreign key constraints can also be exploited to reduce the number of affected terms and potentially reduce the cost of computing ∆V I . The following theorem summarizes how to use foreign-key constraints to detect additional terms that are unaffected by an update. Theorem 3 Consider a directly affected term with base Si in the normal form of a SPOJ view and assume that a table T ∈ Si is updated by an insertion or deletion. The net contribution of the term is unaffected if Si contains another table R with a foreign key referencing a non-null, unique key of T , and R and T are joined on this foreign key. We exploit this theorem to eliminate directly affected nodes and their edges from the maintenance graph. Elimination of directly affected nodes may leave an indirectly affected node without incoming edges, that is, without affected parents. Any such nodes can also be eliminated. We call the resulting graph the reduced maintenance graph. Example 11 This optimization does not simplify the computation of ∆V1I for our modified running example. Con- sider instead view V2 defined as follows full outer join part on l partkey=p partkey and p retailprice < 2000 o o ⋉fck=ock V2 = σpc C ⋊ ⋉fok=lok (σpo O ⋊ L) = σpc ∧po ∧ck=ock∧ok=lok (C × O × L)⊕ σpc ∧po ∧ck=ock (C × O)⊕ σpo ∧ok=lok (O × L) ⊕ σpc C ⊕ σpo O ⊕ L The maintenance graph for updates to O in Figure 4(a) shows two indirectly affected terms so ∆V2I = ∆DC ⊎ ∆DL . {C, O, L}D {C, O}D {C, O}D {O, L}D I {C} D {O} {C}I I {L} (a) Original graph {O}D (b) Reduced graph Figure 4. V2 Maintenance graphs (update O). Now assume that there is foreign key constraint from L.lok to O.ok. Then an insertion into O will not affect nodes OL, and COL. Eliminating these two nodes, leaves node L without a parent so it is also eliminated. This leaves the reduced graph shown in Figure 4(b), which has only one indirectly affected term so ∆V2I = ∆DC . 7 Experimental Results We ran a series of experiments on Microsoft SQL Server 2005 Beta2 to evaluate the cost of maintaining outer join views. The experiments were performed on a workstation with two 3.2 GHz Xeon processors, 2GB of memory and three SCSI disks. All queries were against a 10GB version (SF=1) of TPC-H database. View maintenance was implemented using insert and delete triggers that called stored procedures. The stored procedures followed the steps in our maintenance algorithm for outer-join views. We also compared the maintenance cost for a outer join view with the maintenance cost for the corresponding core view, which is the view obtained by replacing all outer joins with regular inner joins. For each experiment, we measured the maintenance costs both with a warm buffer pool and with a cold buffer pool. The trends are quite similar so we only report the results with a cold buffer pool. In the first experiment, we created an outer join view of the tables customer, orders, lineitem, part, as shown below. The corresponding core view contained inner joins of the four tables with the same join predicates. Both views had the same indexes. create view V3 as select l orderkey, l linenumber, l quantity, l extendedprice, l shipdate, l returnflag, o orderkey, o orderdate, o clerk, c custkey, c nationkey, c mktsegment, p partkey, p type, p retailprice from ((select * from lineitem, orders where l orderkey=o orderkey and o orderdate between ’1994-06-01’ and ’1994-12-31’) lo right outer join customer on c custkey = o custkey) create unique clustered index V4 clu on V4(c custkey, p partkey, l orderkey, l linenumber, o orderkey) create index V4 idx on V4(p partkey, c custkey, l orderkey, l linenumber, o orderkey) Term Cardinality Rows affected COLP 5208168 4863 COL 131702 128 C 184224 323 P 789131 346 Table 1. Terms in view V3 and rows affected when inserting 60,000 lineitem rows The normal form of view V3 contains four terms with the cardinalities shown in Table 1. Because of the foreign key constraint between lineitem and orders, insertion or deletion of order rows does not affect the view. When inserting (or deleting) customer rows, due to a foreign key constraint between orders and customer, we only need to add (or delete) the customer in the view. The resulting maintenance overhead for the view is very small and not reported here. Inserting or deleting customer rows or part rows has no effect on the core view. However, updating lineitem can affect all four terms. The last line in Table 1 shows the number of rows affected when inserting 60, 000 rows into lineitem. The maintenance steps for V3 after an insertion into lineitem are listed below. The table Inserted contains all the rows that were inserted into the lineitem table. Maintaining the view in case of deletion is similar but we omit the details due to lack of space. Q1 : Compute primary delta insert into #delta1 select l orderkey, l linenumber, l quantity, l extendedprice, l shipdate, l returnflag, o orderkey, o orderdate, o clerk, c custkey, c nationkey, c mktsegment, p partkey, p type, p retailprice from ((select * from inserted, orders, customer where l orderkey=o orderkey and c custkey = o custkey and o orderdate between ’1994-06-01’ and ’1994-12-31’) lo left outer join part on l partkey=p partkey and p retailprice < 2000 Q2 : Apply primary delta insert into V3 select * from #delta1 Q3 : Update term delete from V3 where o orderkey and p partkey is and c custkey in C is null and l orderkey is null null (select c custkey from #delta1) Q4 : Update term P delete from V3 where c custkey is null and o orderkey is null and l orderkey is null and p partkey in (select p partkey from #delta1) Figure 5 compares the maintenance cost of view V3 with those of the corresponding core view when inserting 60, 600, 6,000 and 60,000 rows, respectively. The costs for the outer-join view are virtually the same as for the core view – the overhead for fixing up the P and C terms is very low. 250 Elapsed Time (seconds) Elapsed Time (seconds) 200 160 120 80 40 0 200 150 100 50 0 60 600 6000 60000 60 LINEITEM Records Inserted Core View Outer Join View Outer Join View (GK) 600 6000 60000 LINEITEM Records Deleted Core View Outer Join View Outer Join View (GK) (a) Insertion (b) Deletion Figure 5. Maintenance costs for V3 The results for Griffin’s and Kumar’s (GK) algorithm [2] are also shown in Figure 5. Their maintenance expressions are quite complex. Their performance is similar to ours when the number of insertions is very small, but deteriorates dramatically with more insertions. For deletions their performance is much worse than ours. Gupta’s and Mumick’s algorithm [5] was not included in the experiment because it may produce an incorrect result. These experiments show that our algorithms generate very efficient maintenance expressions. As a consequence, maintaining an outer-join view need not be more expensive than maintaining an inner-join view. 8 Related Work It is well understood how to incrementally maintain views with inner joins. References [4] and [3] provide good overviews of the large body of work in this area. Much work has also been done on optimization of outer-join queries; for details see [9] and its references. We are aware of only two earlier papers that describe algorithms for incremental maintenance of outer-join views. Griffin’s and Kumar’s algorithm [2] produces maintenance expressions of the correct form but they are incomplete because the predicates of the semi and anti-semi joins used are not specified. Getting the predicates right is not trivial. The experiments reported in Section 7 showed that their approach is significantly more expensive than ours. Their algorithm consistently produces maintenance expressions that are more complex and more expensive than ours. The main reasons are that (a) their expressions may involve joins of base tables only and may produce large intermediate results; (b) their expressions never exploit the view itself, everything is computed from base tables and (c) null-rejecting predicates and foreign keys are not exploited to deduce what terms are unaffected so (empty) deltas for many terms may be computed unnecessarily. Gupta’s and Mumick’s algorithm [5] assumes than each directly affected tuple can subsume at most one indirectly affected tuple, which is incorrect. We can illustrate the problem using view oj view from the introductory section. The view contains tuples of three types only: {part, orders, lineitem}, part, and orders. Suppose we insert a new lineitem tuple. This causes insertion of a new {part, orders, lineitem} tuple into the view. However, the new tuple may force removal of both an orphaned part tuple and an orphaned orders tuple from the view. This happens if the new lineitem tuple is the first line item of the order and nobody has ordered this particular part before. Gupta’s and Mumick’s algorithm would modify one of the tuples but not delete the other one, leaving the view in an incorrect state. It wrongly assumes that that the view also contains {part, lineitem} and {orders, lineitem} terms. This flaw in the algorithm appears to be fundamental and not easily fixable. 9 Conclusion We introduced an efficient incremental maintenance procedure for materialized outer-join views. Efficient incremental maintenance expressions are constructed for such views. The expressions are composed of regular algebraic operators – no new operators are needed. Exploiting a normal form and subsumption graphs enables us to precisely identify which terms are affected and how to maintain them, and therefore avoid unnecessary work. If foreign key constraints are available, they are also exploited to simplify maintenance. Experimental results show that maintaining an outer-join view is not necessarily more expensive than maintaining an inner-join view. One direction for future work is to investigate even more efficient ways to compute ∆V I . It may be possible to combine (parts of) the computations for the different terms, for example, by exploiting outer joins or by saving and reusing partial results. References [1] C. Galindo-Legaria. Outerjoins as disjunctions. In SIGMOD Conference, 1994. [2] T. Griffin and B. Kumar. Algebraic change propagation for semijoin and outerjoin queries. SIGMOD Record, 27(3):22– 27, 1998. [3] A. Gupta and I. S. Mumick. Maintenance of materialized views: Problems,techniques, and applications. Data Engineering Bulletin, 18(2), 1995. [4] A. Gupta, I. S. Mumick, and V. S. Subrahmanian. Maintaining views incrementally. In SIGMOD Conference, 1993. [5] H. Gupta and I. S. Mumick. Incremental maintenance of aggregate and outerjoin expressions. Information Systems, 31(6), 2006. [6] P.-Å. Larson and J. Zhou. View matching for outer-join views. In VLDB Conference, 2005. [7] P.-Å. Larson and J. Zhou. Maintenance of materialized outer-join views. Technical Report (to appear), Microsoft Research, 2006. [8] Oracle Corp. Oracle Database Data Warehousing Guide 10g Release 2, 2006. http://download-west.oracle.com /docs/cd/B19306 01/server.102/b14223/basicmv.htm. [9] J. Rao, H. Pirahesh, and C. Zuzarte. Canonical abstraction for outerjoin optimization. In SIGMOD Conference, 2004. [10] Transaction Processing Performance Council. TPC Benchmark H, (Decision Support), Revision 2.3.0, 2005. http://www.tpc.org/tpch/spec/tpch2.3.0.pdf.