Academia.eduAcademia.edu

Rewriting XPath queries using materialized views

2005, Very Large Data Bases

As a simple XML query language but with enough expressive power, XPath has become very popular. To expedite evaluation of XPath queries, we consider the problem of rewriting XPath queries using materialized XPath views. This problem is very important and arises not only from query optimization in server side but also from semantic caching in client side. We consider the problem of deciding whether there exists a rewriting of a query using XPath views and the problem of finding minimal rewritings. We first consider those two problems for a very practical XPath fragment containing the descendent, child, wildcard and branch features. We show that the rewriting existence problem is coNP-hard and the problem of finding minimal rewritings is Σ p 3 . We also consider those two rewriting problems for three subclasses of this XPath fragment, each of which contains child feature and two of descendent, wildcard and branch features, and show that both rewriting problems can be polynomially solved. Finally, we give an algorithm for finding minimal rewritings, which is sound for the XPath fragment, but is also complete and runs in polynomial time for its three subclasses.

Introduction

Recently, more and more data are represented and exchanged as XML documents over Internet. XPath Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.

Proceedings of the 31st VLDB Conference, Trondheim, Norway, 2005 [11], recommended by W3C, is a simple but popular language to navigate XML documents and extract information from them. XPath is also used as sub-languages of other XML query languages such as XQuery [5] and XSLT [12].

Since this language is popular, there has been a lot of work done to speedup evaluation of XPath queries, for example: index techniques [10,29], structural join algorithms [1,6] and minimization of XPath queries [2,30,28,15]. More recently, the problem of rewriting queries using materialized XML views has begun to attract more attention.

This rewriting problem has been first discussed for semantic caching because semantic caching can improve performance significantly in traditional clientserver databases and Web-based information systems. Hence, [9,32] intuitively consider using cached XML views to answer XML queries and have obtained noticeable advantage on performance. Moreover, authors in [3] also consider this problem but in XML query processing using materialized XPath views. It points out that most of new proposed indexing schemes can be modelled as materialized views such that the rewriting problem could be essential to efficient evaluation of XPath queries. In this paper, we consider this problem in formal theoretical aspects, which is not exploited in previous works to best of our knowledge.

We begin by giving some examples to describe the motivation of studying this problem. Consider the following XML document t stored in an XML server, which partially describes enzyme information of a biological pathway: <Pathway name = "PA1"> <Reaction name = "RE1"> <Enzymes> <Protein name = "PR1" EC# ="1.0.0.1"/> <RNA name = "RN1"/> </Enzymes> </Reaction> <Reaction name = "RE2"> <Enzymes> <RNA name = "RN2"> </Enzymes> </Reaction> </Pathway> Let's assume that a client issues to the server an XPath query v :

/Reaction/Enzymes which retrieves Enzymes subelements of all Reaction elements. The server evaluates this query and sends back to the client its result as follows:

<Enzymes> <Protein name = "PR1" EC# ="1.0.0.1"/> <RNA name = "RN1"/> </Enzymes> <Enzymes> <RNA name = "RN2"> </Enzymes> Suppose the client caches the above result. When the client issues another XPath query p 1 :

/Reaction/Enzymes[/P rotein] which retrieves all Reaction elements' Enzymes subelements that have at least a Protein subelement. It's obvious that the result of p 1 is a subset of the cached result and we can issue an XPath query p 1 :

Enzymes[/protein] which retrieves Enzymes elements having at least a Protein subelement, over the result of query v to compute the result of p 1 without sending p 1 to the server. We say that p 1 together with query v is a rewriting of p 1 , and p 1 is a compensation query of p using v.

Let's consider another XPath query p 2 : /Reaction/Enzymes/P rotein which retrieves all Protein subelements of Enzymes subelements of Reaction elements. The result of p 2 is not a subset of the cached result of query v. But, because of nested structures of XML documents, each Protein element in the result of p 2 is a subelement of an Enzymes element in the cached result of v. We still can issue an XPath query p 2 :

Enzymes/P rotein which retrieves all Protein subelements of Enzymes elements, over the cached result to compute the result of p 2 .

However, for some XPath queries, we can't compute their results by using the cached result even if we know their results are a subset or subelements of the cached result. For example, consider the following XPath query p 3 :

/Reaction[@name = "RN2 "]/Enzymes which only retrieves Enzymes subelements of all Reaction elements with name "RN2". We know the result of p 3 is a subset of the cached result. But, we don't know which Enzymes element in the cached result should be included in p 3 's result. Thus, there is no rewriting of p 3 using v, i.e., we can't issue a query over the result of v to answer p 3 .

In general, given an XPath query(view) v which is materialized (i.e., its result is pre-computed or cached) and a new XPath query p to be answered, the first problem studied in this paper is the rewriting existence problem, i.e., whether a compensation query of p using v exists such that we can evaluate the compensation query over the pre-computed or cached result of v to answer p. In case there are multiple compensation queries, we are interested in the compensation query which needs minimum cost to evaluate. According to the theoretical analysis of [16,17], the evaluation efficiency of XPath queries greatly depends on the size of them. Same to [30,15,2], we also consider the size of XPath queries as a measure for their costs. Hence, the second problem studied in this paper is to find the compensation query with minimum size(also called as finding minimal rewritings problem).

The rest of this paper is organized as follows. In Section 2, we introduce basic notations and definitions about tree patterns which are simple XPath queries but used frequently in practice. Two rewriting problems are formulated in Section 3. Section 4 and 5 discuss the complexities of those two rewriting problems for tree patterns, and give an algorithm to find minimal rewritings. Finally, we describe related work in Section 6 and give the conclusion in Section 7.

Preliminaries

Trees and Tree Patterns

Generally, an XML database consists of a set of XML documents. We model each XML document as an unordered rooted node-labelled tree (called XML tree) over an infinite alphabet Σ, where the label of each internal node corresponds to an XML element, an attribute name or a data value. We denote all possible XML trees over Σ as T Σ .

Definition 2.1 An XML document is a tree t V t , E t , r t over Σ called XML tree, where • V t is the node set and E t is the edge set; • r t ∈ V t is the root of t;

• Each node n in V t has a label from Σ(denoted as n.label).

Given an XML tree t V t , E t , r t , we say that

For any node n in t, we denote the subtree rooted at n and exactly containing all its descendants as (t) n sub . We let n be the root of (t) n sub , such that (t) n sub can also be seen as an XML tree. For instance, Fig. 1(c) shows the subtree rooted at 'd'-labelled node of an XML tree t, which is shown in Fig. 1(a).

Figure 1

Figure 1: (a)An XML tree t; (b)A pattern p; and (c)A subtree of t where EB is the set including all embeddings from p to t, and e(o p ) is a node of t, mapped by the output node o p of p through an embedding e.In addition, we define an empty pattern denoted as ε: the result of evaluating ε over any XML tree is empty.

In this paper, we discuss a fragment of XPath queries, first studied in [23]. This fragment consists of label tests, child axes(/), descendant axes(//), branches([]) and wildcards(*). It can be recursively represented by the following grammar:

where l is a node label from Σ. We denote this fragment as XP {/,//, * ,[]} . Three subclasses of XP {/,//, * ,[]} are also specially discussed: XP {/,//,[]} , XP {/, * ,[]} and XP {/,//, * } , which only use two of the three features: '//', '[]' and '*' in addition to '/'.

As said in [23], any XPath query from XP {/,//, * ,[]} can be trivially represented as a labelled tree(called tree pattern) with the same semantics. Definition 2.2 A tree pattern p is a tree V p , E p , r p , o p over Σ ∪ {'*'}, where V p is the node set and E p is the edge set, and:

• Each node n in V p has a label from Σ ∪ {'*'}, denoted as n.label; • Each edge e in E p has a label from {'/','//'}, denoted as e.label. The edge with label / is called child edge, otherwise called descendent edge; • r p , o p ∈ V p are the root and output node of p respectively.

For example, an XPath query a[ * /b]/c//d is represented as a tree pattern shown in Fig. 1(b), where the dark node is the output node. The size of a tree pattern, written as p , is defined as the number of its nodes. Without loss of generality, we refer to tree patterns as patterns in the rest of this paper.

Given a pattern p V p , E p , r p , o p , we say that

then o p is also the output node of p . For any node n in p, we denote as (p) n sub the subpattern with n as the root and exactly containing all its descendants. As an example, the pattern p = a[// * /d]/b[ * ][//d] is shown in (1) of Fig. 6. Let n * be the '*'-labelled node which is a child of p's root. The subpattern (p) n * sub is given in (2). We now define an embedding(also called pattern match) from a pattern to an XML tree as follows: Definition 2.3 Given an XML tree t V t , E t , r t and a pattern p V p , E p , r p , o p , an embedding from p to t is a function e : V p → V t , with following properties:

Figure 6

n p ⊕v sub rooted at the merged node n p ⊕v of p ⊕ v according to Claim 5.6. Thus, the size of min(p ⊕ (v) ov sub ) is equal to the size of min((p ⊕ v) n p ⊕v sub ). Second, based on Lemma 5.1, min((p ⊕ v) n p ⊕v sub ) is isomorphic to (min(p ⊕ v)) n p ⊕v sub . Obviously, the size of (min(p ⊕v)) n p ⊕v sub is less than that of min(p ⊕v) since (min(p ⊕ v)) n p ⊕v sub is a subpattern of min(p ⊕ v). Notice that min(p ⊕ v) = min(p) due to p ⊕ v ≡ p. Finally, we have that min(p ⊕ (v) ov sub ) = min((p ⊕ v)

) is a child of e(n 1 ) in t; otherwise, e(n 2 ) is a descendent of e(n 1 ) in t.

The embedding maps the output node o p of p to a node n in t. We say that the subtree (t) n sub of t is the result of this embedding. As an example, dashed lines between Fig. 1(a) and (b) shows an embedding, and its result is shown in Fig. 1(c). Actually, there could be more than one embedding from p to t. We define the result of p over t, denoted as p(t), as the union of results of all embeddings, i.e., In addition, we define an empty pattern denoted as ε: the result of evaluating ε over any XML tree is empty.

Containment and Minimization of Patterns

For any two patterns p 1 and p 2 , p 1 is said to be contained in p 2 (p 1 p 2 ) iff ∀t ∈ T Σ p 1 (t) ⊆ p 2 (t), and p 1 is said to be equivalent to p 2 (denoted as p 1 ≡ p 2 ) iff ∀t ∈ T Σ p 1 (t) = p 2 (t). Obviously, the equivalence problem can be seen as a two-way containment problem because p 1 ≡ p 2 iff p 1 p 2 and p 2 p 1 .

The complexity of the pattern containment problem has been well studied for XP {/,//,[], * } and also for its three subclasses. The problem is in coNP-complete [23] for XP {/,//,[], * } and in P for its three subclasses [2,30,25].

Minimizing a pattern p is to find an equivalent pattern p (≡ p) with minimum size, i.e., no other equivalent pattern p (≡ p) having p < p exists. As shown in [15], the minimization problem is coNP-hard. However, a pattern can be minimized in polynomial time in the case of XP {/,//,[]} [2] and XP {/, * ,[]} [30]. Any pattern from XP {/,//, * } is already minimized.

Problem Formulation

Let t be an XML tree. We use v to denote a materialized pattern whose result v(t) is pre-computed or cached, and we use p to denote a pattern to be answered. Our goal is to find a pattern p such that we can answer p by evaluating p over the result of v, i.e., p (v(t)) is equal to p(t). Note that v(t) may include a set of subtrees of t, and p (v(t)) is defined as the union of results of evaluating p over all subtrees in v(t).

By observation, for any XML tree t, the result of evaluating a pattern p over v(t) can be viewed as the result of directly evaluating a pattern over t. Actually, this pattern can be obtained from p and v. We define an asymmetric concatenation operator, denoted as ⊕, between two patterns as below: given two patterns p V p , E p , r p , o p and v V v , E v , r v , o v , the concatenation from p to v is a pattern, denoted as p ⊕ v, which is constructed from p and v by merging r p (the root of p ) and o v (the output node of v) into one node. r v and o p are the root and output node of p ⊕ v respectively. The merged node is denoted as n p ⊕v , and it has both the children of r v in v and the children of o p in p as its children. When the two nodes r p and o v have different labels, we choose the more "restrictive" one as the label of n p ⊕v . That is, the label of merged node n p ⊕v is chosen as

If both of the two labels are from Σ and different, then we let p ⊕ v = ε, i.e., the concatenation is an empty pattern. We have the following result for the ⊕ operator. Lemma 3.2 Let p and v be two patterns.

Obviously, the fragment XP {/,//, * ,[]} is closed under concatenation. Notice that the construction of p ⊕ v based on p and v doesn't introduce new wildcards or descendant edges. Hence, two subclasses XP {/,//,[]} and XP {/, * ,[]} are also closed under concatenation. Moreover, if p and v are from XP {/,//, * } , p and v are linear patterns with their leaves as output nodes. It's obvious that p ⊕ v is in XP {/,//, * } .

Lemma 3.3

The fragment XP {/,//, * ,[]} and its three subclasses are closed under concatenation.

In addition, it's straightforward to show that the concatenation operator ⊕, considered as a binary operator, satisfies that given three patterns v, p and p ,

Based on the concatenation operator and Lemma 3.2, we formally define (minimal) compensation patterns and (minimal) rewritings as follows: Definition 3.4 Let v be a materialized view and p be a pattern. We say that a pattern p is a compensation pattern and p ⊕ v is a rewriting of p using v if p ⊕ v is equivalent to p. We also say that p is a minimal compensation pattern, and p ⊕v is a minimal rewriting of p using v if there is no other compensation pattern p of p using v such that the size of p is less than that of p .

The two problems studied in this paper can now be restated as follows:

Rewriting Existence Problem: Given a pattern v and a pattern p, we check whether there exists a compensation pattern p such that p ⊕ v ≡ p or not; and Finding Minimal Rewritings Problem: If a rewriting of p using v exists, find the minimal compensation pattern p such that p ⊕ v ≡ p.

Rewriting Existence Problem

We discuss the complexity of rewriting existence problem in the case of XP {/,//, * ,[]} in this section. Our first observation is that the rewriting existence problem is closely related to the pattern containment problem. More specifically, our next result shows that for patterns with their roots as output nodes, these two problems are equivalent. Lemma 4.1 Let p and v be two patterns with output nodes as roots, p v iff there exists a rewriting of p using v.

Hence, in the rest of this section, we first describe current techniques on containment of patterns and then discuss the complexity of rewriting existence problem.

Techniques on Containment of Patterns

Many techniques have been used to obtain the complexity results of the pattern containment problem, like homomorphisms [8], canonical models [23] and so on.

The homomorphism technique is first used in the containment problem of conjunctive queries [8]. The existence of a homomorphism between two patterns implies the containment relationship between them. That is, for two patterns p 1 and p 2 , p 2 p 1 if a homomorphism from p 1 to p 2 exists.

with following properties:

• Root and output node preserving: h(r p1 ) = r p2 , and

) is a path in p 2 including at least a child or descendent edge.

As an example, dashed lines in Fig. 3 The containment of two patterns can also imply the homomorphism existence between them in the case of XP {/,//,[]} and XP {/, * ,[]} , but unfortunately not in the case of XP {/,//, * ,[]} .

Figure 3

Figure 3: A homomorphism from p 1 (a) to p 2 (b)

For the case of XP {/,//, * } , [25,23] propose a method to rewrite patterns in XP {/,//, * } to a new representation such that this implication still holds. For the convenience to discuss our rewriting problems, we describe this method as below but with small change, and also call it as pattern standardization.

The standardization works as follows. For any path consisting a chain of nodes in a pattern p of XP {/,//, * } : (v 1 , v 2 , ..., v n ), we replace the label of edge (v i , v i+1 ) with '//' for i = 1, ..., n -1 if the following conditions are satisfied: (1)v 1 is the root or its label is from Σ; v n is the leaf(output node) or its label is from Σ; The following properties [25,23] hold: Lemma 4.3 (1)For a pattern p in XP {/,//, * } , p ≡ std(p); (2)For two patterns p 1 and p 2 in XP {/,//, * } , if p 2 p 1 , a homomorphism exists from std(p 1 ) to p 2 .

Finding a homomorphism between two patterns p 1 and p 2 can be done in polynomial time, specifically in O( p 1 • p 2 ) [24]. Hence, the pattern containment problem is in P for the three subclasses of XP {/,//, * ,[]} .

However, the containment problem is not in P for the whole fragment XP {/,//, * ,[]} . [23] proposes the canonical model method to obtain its complexity. This method first introduces boolean patterns, which are patterns without specifying output nodes. Given an XML tree t and a boolean pattern q, we say q(t) is true if an embedding exists between them; otherwise false. For two boolean patterns q 1 and q 2 , we say that q 1 q 2 iff ∀t ∈ T Σ , q 1 (t) implies q 2 (t). Then, this method translates the containment problem of two patterns to that of two boolean patterns. Finally, it shows the boolean pattern containment problem is in coNP-complete for XP {/,//, * ,[]} .

Complexity

In Lemma 4.1, we have shown that the rewriting existence problem is equivalent to the pattern containment problem for patterns with their roots as output nodes. However, this special pattern containment problem is still in coNP-complete, as shown below.

Notice that for a pattern p whose output node is its root, if there is an embedding from p to an XML tree t, p(t) = {t}; otherwise p(t) = φ. Obviously, patterns, with roots as output nodes, have the similar behavior as boolean patterns. Not surprisingly, we have the following complexity result by reducing the boolean pattern containment problem: Lemma 4.4 In case of XP {/,//, * ,[]} , the containment problem of two patterns with roots as output nodes is coNP-complete.

Since a special case of the rewriting existence problem is coNP-complete, we have:

The rewriting existence problem is coNP-hard in case of XP {/,//, * ,[]} .

Not surprisingly, our first result is that the compensation pattern doesn't introduce new labels. That is, if a pattern p is a compensation pattern of a pattern p using a pattern v, then p doesn't have any label from Σ which doesn't appear in p.

Lemma 5.3 Let p and v be two patterns. p doesn't introduce new labels from Σ if p is a compensation pattern of p using v.

Our second result is that the minimal compensation pattern doesn't increase size, i.e., if p is a minimal compensation pattern of p using v, then p ≤ p . Lemma 5.4 Let p and v be two patterns. If a compensation pattern of p using v exists, the minimal compensation pattern of p using v has at most size p .

The proof is based on the following two important claims which follow from the definitions. Claim 5.5 Let v be a pattern and o v be its output node. (v) ov sub is the subpattern rooted at o v of v. The concatenation from (v) ov sub to v, (v) ov sub ⊕v, is equivalent to v. Claim 5.6 Let p and v be two patterns. Let o v be the output node of v and n p ⊕v be the merged node of p ⊕v. The concatenation from p to (v) ov sub , i.e., p ⊕(v) ov sub , is isomorphic to, the subpattern rooted at n p ⊕v of p ⊕ v, i.e., (p ⊕ v)

Proof. (Lemma 5.4): Assume that p is a compensation pattern of p using v. Our idea is that we can construct a pattern based on p such that it is a compensation pattern with size less than p .

Let o v be the output node of v and (v) ov sub is the subpattern rooted at o v of v. We show that p ⊕ (v) ov sub is also a compensation pattern of p using v. For ⊕ is associative, (p

sub is a compensation pattern, then min(p ⊕ (v) ov sub ) is a compensation pattern too. We show that the size of min(p ⊕ (v) ov sub ) is less than that of p. First, p ⊕ (v) ov sub is isomorphic to the subpattern (p ⊕ v)

sub ) is a compensation pattern with size not greater than p . This implies the minimal compensation pattern of p using v has size not greater than p .

Based on the above lemmas, we can obtain the following complexity result: Theorem 5.7 Let p and v be two patterns. The problem of whether there exists a compensation pattern p of p using v such that p has size less than k is Σ p 3 , where k ≤ p .

Proof. We can guess in polynomial time a pattern p , which doesn't introduce new labels and has size less than k. And, testing p ⊕ v ≡ p is in coNP [23] (also in Σ p 2 ).

Tractable Results

In this subsection, we show the rewriting existence problem can be solved in polynomial time for the three subclasses of XP {/,//, * ,[]} . Our idea is based on the fact that the existence of a homomorphism is sufficient and necessary for containment of two patterns in the case of XP {/,//, * ,[]} 's three subclasses.

We first consider the subclass XP {/,//,[]} . The following example illustrates our intention about how to check whether a rewriting exists or not. The above example shows us that if a compensation pattern of p using v exists, there is a subpattern of p which is also a compensation pattern of p using v. Is this always true for any possible patterns p and v? The answer is yes. In the above example, we can construct a homomorphism from (p) np sub ⊕ v to p ⊕ v if there is a node n p of p mapped by a homomorphism(from p to p ⊕ v) to the merged node of p ⊕ v. We can also construct a homomorphism from p ⊕ v to (p) np sub ⊕ v if the node n p of p mapped from the merged node of p ⊕v by a homomorphism(from p ⊕v to p) is the node n p . Our next discussion and result can guarantee that n p always exists and n p must be n p .

We say that the path from a pattern p's root to output node is the selection path of p. Notice that a homomorphism between two patterns always maps one pattern's root and output node to the other's root and output node respectively. We show that if two patterns are equivalent, the sizes of their selection paths are the same.(Note that this result not only holds for XP {/,//,[]} but also for XP {/,//, * ,[]} .) Lemma 4.7 Let p 1 and p 2 be two equivalent patterns. If p 1 ≡ p 2 , the selection path of p 1 has the same size as that of p 2 in case of XP {/,//, * ,[]} .

Since two equivalent patterns' selection paths have the same size, any homomorphism between them must map nodes in the selection path of one pattern to nodes in that of the other pattern sequentially one by one. Let two patterns be p and v and there is a pattern p such that p ⊕ v is a rewriting of p using v. Obviously, the merged node n p ⊕v of p ⊕ v is in the selection path of p ⊕ v. There is a unique node n p in the selection path of p such that any homomorphism from p to p ⊕ v(or p ⊕ v to p) maps n p to n p ⊕v (or n p ⊕v to n p ). Moreover, n p has the same position in the selection path of p as the merged node n p ⊕v in that of p ⊕ v, i.e., if n p ⊕v is the i-th node in the selection path of p ⊕ v starting from the root, then n p is also the i-th node in that of p starting from the root. Since n p ⊕v is merged from the output node of v and the root of p , then n p also has the same position as the output node of v. We have the following conclusion for the subclass XP {/,//,[]} . Lemma 4.8 Let v and p be two patterns, and let n p be the node in the selection path of p with the same position as the output node of v in that of v. If a compensation pattern p of p using v exists, the subpattern (p) np sub of p is a compensation pattern of p using v.

The above lemma directly implies that we only need to consider one compensation pattern candidate (p) np sub to check whether a rewriting of p using v exists, because no compensation pattern of p using v exists if (p) np sub is not. Now, we discuss two other subclasses XP {/,//, * } and XP {/, * ,[]} . Notice that the existence of homomorphisms in both ways between p ⊕ v and p is the only condition to make the above lemma work. Hence, the above lemma can easily apply to XP {/, * ,[]} , and the following result holds: Corollary 4.9 Lemma 4.8 holds for XP {/, * ,[]} .

However, in the case of XP {/,//, * } , p ⊕ v is maybe not standardized such that there is no homomorphism from p ⊕ v to p even if p and v are standardized and p ⊕ v ≡ p. For example, let p = a/ * and v = * //b, but p ⊕ v = a/ * //b is not standardized.

We still can construct a homomorphism from (p) As illustrated by the the above example, we have: Corollary 4.11 Lemma 4.8 holds for XP {/,//, * } Finally, we have the following complexity result for the rewriting existence problem in the case of the three subclasses of XP {/,//, * ,[]} to conclude this subsection. Theorem 4.12 For three subclasses of XP {/,//, * ,[]} , the rewriting existence problem is in P .

Proof. Directly from Lemma 4.8 and its corollaries, and the fact that testing equivalence of two patterns is in P for the three subclasses.

In this subsection, we show that the problem of finding minimal rewritings is in P for the three subclasses. In subsection 4.3, we have already shown that for the three subclasses of XP {/,//, * ,[]} , if a compensation pattern of Pattern p using Pattern v exists, then there is a node n p of p such that the subpattern (p) np sub of p is also a compensation pattern.

For XP {/,//, * } , we can easily have that (p) np sub is the minimal compensation pattern based on the fact: any pattern in XP {/,//, * } is minimized. Any pattern p s.t. p ⊕ v ≡ p must have the same size as (p) np sub , because the size of p ⊕ v is equal to that of (p) np sub ⊕ v according to they are equivalent and minimized.

However, for two other subclasses, (p) np sub ⊕ v may not be minimized even if both (p) np sub and v are minimized. Hence, there maybe exists another pattern, which is also a compensation pattern but with less size than (p) np sub . An interesting question arises: can the minimal compensation pattern be found among subpatterns of (p) np sub ? We first discuss a special case of the problem of finding minimal rewritings, which restricts the pattern v's output node to its root, for the whole frag- ment XP {/,//, * ,[]} . In this special case, our next result shows that p is a compensation pattern if one exists.

Lemma 5.8 Let p be a pattern and v be a pattern whose output node is its root. If a compensation pattern of p using v exists, then p is also a compensation pattern.

The following example shows that although p may not be a minimal compensation pattern, a minimal compensation pattern can be obtained from p. np sub is also a compensation pattern, because the concatenation from p -n p to v(i.e., (p -n p ) ⊕ v) can be viewed as a pattern obtained from p ⊕ v by pruning n p (i.e., (p ⊕ v) -n p ), which is equivalent to p ⊕ v and p. We say that n p is a rewritingredundant node of p against v. In fact, n p is the only rewriting-redundant node and p-n p is a minimal compensation pattern. p -n p and (p -n p ) ⊕ v are shown in (b) and (e) respectively. This example gives us motivation to obtain the minimal compensation pattern by pruning all rewritingredundant nodes of p against v. This motivation leads to our following definition and the most important lemma of this work. Definition 5.10 Let p be a pattern and v be a pattern whose output node is its root. C p and C v are two node sets including all children of the roots of p and v respectively. n p ∈ C p is called to be a rewriting-redundant node of p against v if there exists a node

all rewriting-redundant nodes is called the rewritingredundant node set of p against v.

Lemma 5.11 Let p be a minimized pattern and v be a pattern whose output node is its root. Let R p v be the rewriting-redundant node set of p against v. If there exists a compensation pattern of p using v, then p-R p v is a minimal compensation pattern of p using v.

The proof is based on the following observation. Claim 5.12 Let p be a pattern and v be a pattern whose output node is its root. Pattern p is a compensation pattern of p using v. C p ⊕v , C p and C v are three node sets including all children of p ⊕ v, p and v's roots respectively. The following results hold:

• the root of p ⊕ v has the same label as that of p;

Proof. (Lemma 5.11): We show that p -R p v is also a compensation pattern. From Lemma 5.8, p is a compensation pattern of p using v exists. The concatenation of p -R p v to v, i.e., (p -R p v ) ⊕ v, can be viewed as the pattern obtained from p ⊕ v by pruning all nodes in R p v , i.e., (p ⊕ v) -R p v . According to our rewritingredundant node set definition, all nodes in R p v are also redundant for p ⊕ v, i.e., (p ⊕ v) -R p v ≡ p ⊕ v. Hence, p -R p v is a compensation pattern. For simplicity, we denote p -R p v as p in the rest of this proof. Now, we prove that p is a minimal compensation pattern by showing that any other compensation pattern p must have size at least as p . We show first that for each pattern in P (p ), there exists a pattern in P (p ) equivalent to it.

Since p ⊕ v and p ⊕ v are equivalent to p which is minimized, we have the fact that for each pattern in P (p), both of P (p ⊕ v) and P (p ⊕ v) have a pattern equivalent to it according to Lemma 5.2.

Notice that C p = C p -R p v and p is obtained from p by pruning nodes in R p v . Then, for a node n p ∈ C p , n p ∈ C p . We have that (p ) n p (∈ P (p )) and (p) n p are the same pattern. Because p ⊕ v ≡ p, there exists a node n p ⊕v in C p ⊕v such that (p ⊕v) n p ⊕v is equivalent to (p) n p (i.e., (p ) n p ) based on the fact. We show next that (p ⊕ v) n p ⊕v must also be in P (p ).

C p ⊕v = C p ∪ C v . This node n p ⊕v is included in either C p or C v . Actually, n p ⊕v must be included in C p . If n p ⊕v ∈ C v , we can have that n p is a rewriting-redundant node of p against v. n p must be included in R p v and should not be included in C p . This causes a contradiction. We show n p is a rewritingredundant node of p against v as follows: (p ⊕v) n p ⊕v is isomorphic to (p ⊕ v) n p ⊕v , because (p ⊕ v) n p ⊕v and (p⊕v) n p ⊕v are isomorphic to (v) n p ⊕v . Note that, in some cases, the root of v may have a different label from roots of (p ⊕v) n p ⊕v and (p⊕v) n p ⊕v . But, roots of (p ⊕v) n p ⊕v and (p⊕v) n p ⊕v have the same label as the root of p according to Claim 5.12. Hence, the fact that (p ⊕ v) n p ⊕v is isomorphic to (p ⊕ v) n p ⊕v still holds. Then, (p⊕v) n p ⊕v is equivalent to (p) n p . Since p is also compensation pattern of p using v, (p) n p is equivalent to (p⊕v) n p according to Claim 5.12. Thus, (p ⊕ v) n p ⊕v is equivalent to (p ⊕ v) n p . It means that n p is a rewriting-redundant node.

Since n p ⊕v is included in C p , (p ⊕ v) n p ⊕v is isomorphic to (p ) n p ⊕v according to Claim 5.12. Hence, we have that (p ) n p ⊕v ∈ P (p ) is equivalent to (p) n p , i.e., (p ) n p (∈ P (p )). This proves that for each pattern in P (p ), there exists a pattern in P (p ) equivalent to it.

Finally, we show that p ≤ p . From the above discussion, each pattern in P (p ) is equivalent to a corresponding pattern P (p ). Since p is minimized, each pattern in P (p ) is also minimized and has no greater size than the corresponding pattern in P (p ). In addition, no two patterns in P (p ) will be equivalent to one pattern in P (p ). Thus, it's obvious that p has less size than p . It comes to our conclusion.

So far, we only show how to obtain the minimal compensation pattern in case that v's output node is its root for the whole fragment XP {/,//, * ,[]} in the above lemma. Unfortunately, this lemma can't be extended to the general case that v's root and output node are not the same one, for XP {/,//, * ,[]} .

However, for two subclasses XP {/,//,[]} and XP {/, * ,[]} , we can reduce the problem of finding minimal rewritings for any two patterns to that for one pattern and another pattern whose root is its output node, which is shown in the following result. Lemma 5.13 Let v be a pattern and o v be its output node. Let p be a pattern and n p in p be the corresponding node with same position to o v . Assume that a compensation pattern of p using v exists. Then, Pattern p is a minimal compensation pattern of p using v iff p is a minimal compensation pattern of (p) np sub using (v) ov sub , whose output node is its root.

Proof. We only need to show that p is a compensation pattern of p using v iff p is a compensation pattern of (p) np sub using (v) ov sub . Since a compensation pattern of p using v exists, the subpattern (p) np sub of p is a compensation pattern according to Lemma 4.8.

(⇒)p is a compensation pattern of p using v, so p ⊕ v ≡ p. We can have two homomorphisms between p ⊕ v and p in both ways. Let n p ⊕v be the merged node of p ⊕ v. n p ⊕v and n p have the same position in the selection paths of p ⊕ v and p. Then, we can have mappings by these two homomorphisms between n p ⊕v and n p in both ways. Hence, we have homomorphisms in both ways between the subpattern (p ⊕ v) np sub is a compensation pattern of p using v. Hence, p is a compensation pattern of p using v.

By combining the above lemmas, we have the following complexity result of finding minimal rewritings problem:

Theorem 5.14 For the three subclasses of XP {/,//, * ,[]} , the problem of finding minimal rewritings is in P .

Proof. We use notations of Lemma 5.13 here. Finding the minimal rewriting of p using v is equivalent to finding the minimal rewriting (p) np sub using (v) ov sub . Because (v) ov sub 's output node is its root, we only need to compute the rewriting-redundant set of (p) np sub against (v) ov sub according to Lemma 5.11. Obviously, computing rewriting-redundant node set can be polynomially done since checking the equivalence of two patterns is in P for the subclasses of XP {/,//, * ,[]} .

Finding Minimal Rewritings Problem

In this section, we consider the problem of finding minimal rewritings, i.e., finding the minimal compensation pattern of p using v, where p and v are two patterns.

Obviously, this problem is related to minimization of patterns. In this section, we first introduce the minimization algorithm, and then discuss this problem's complexity and finally design an algorithm for it.

Algorithm on Minimization of Patterns

We first introduce several definitions and notations on patterns. Given a pattern p V p , E p , r p , o p and any node n in p, we denote as (p) n the subpattern of p constructed from (p) n sub by adding the root of p and connecting it to n using the same path between them in p. 3) and ( 4) separately.

Given a pattern p and a node n of p, we denote as p -n the subpattern obtained from p by pruning the subpattern (p) n sub rooted at n. Moreover, let N be a set of nodes of p and we denote as p-N the subpattern obtained from p by pruning all subpatterns rooted at nodes in N . For example, p -n * is shown in ( 4) of Fig. 6.

The above definitions can also be applied to boolean patterns in general. Using these definitions, the minimization problem is discussed below.

Given a boolean pattern q, [15] shows that for two nodes n i and n j ∈ C q , if (q) ni (q) nj , n j is redundant in q, because the subpattern q -n j is equivalent to q. Moreover, [15] also shows that q is not minimized if and only if a subpattern in P (q) is not minimized or there exist two subpatterns (q) ni and (q) nj in P (q) having (q) ni (q) nj . This result leads to the boolean pattern minimization algorithm in [15], which works as follows: For a node n j in C q , it checks whether n j is redundant, i.e., whether there exists n i ∈ C q s.t. (q) ni (q) nj . If yes, it prunes (q) nj sub and updates q to q -n j . Then, it continues the above pruning procedure until C q has no redundant nodes. Finally, for every node n i in C q which isn't pruned, it recursively minimizes (n) ni sub . The above results and minimization algorithm for boolean patterns can be trivially extended to patterns in general. As an example, the pattern shown in (1) of Fig. 6 after minimization is given in (5).

We have the following two results for patterns based on works in [15], which we will use for the finding minimal rewritings problem later. Lemma 5.1 Let p V p , E p , r p , o p be a pattern and n be a node in the path from r p to o p , (min(p)) n sub is isomorphic to min((p) n sub ). Proof. Directly from the minimization algorithm and the fact that (p) n sub , who includes the output node of p, can not be pruned away during minimization. Lemma 5.2 Let p 1 and p 2 be two equivalent patterns, and p 2 is minimized. Then, for each subpattern (p 2 ) nj ∈ P (p 2 ), there exists a subpattern (p 1 ) ni ∈ P (p 1 ) such that (p 1 ) ni ≡ (p 2 ) nj .

Proof omitted. Our proof is based on the Lemma 1 of [15]. We restate it by using our notations: let q 1 and q 2 be two boolean patterns such that q 1 q 2 . Then, for each subpattern (q 2 ) nj ∈ P (q 2 ), there exists a subpattern (q 1 ) ni ∈ P (q 1 ) s.t. (q 1 ) ni (q 2 ) nj .

Algorithm for Finding Minimal Rewritings

In the case of three subclasses, for two patterns p and v, Lemma 4.8 suggests a way to decide the existence of a compensation pattern of p using v by testing the only one compensation pattern candidate, which is a subpattern of p. Lemma 5.11 and 5.13 directly propose an algorithm to find the minimal compensation pattern by pruning rewriting-redundant nodes. The following algorithm just follows the ideas in those lemmas. It's sound and complete, and also runs in polynomial time for the three subclasses. This algorithm will run in exponential time for XP {/,//, * ,[]} since checking equivalence of two patterns and minimizing patterns are coNP-hard. However, our last result shows that this algorithm is still sound for XP {/,//, * ,[]} .

Theorem 5. 15 The finding minimal rewritings algorithm is sound for XP {/,//, * ,[]} .

Proof. We use notations of Algorithm 1 here. The algorithm checks whether (p) np sub is a compensation pattern of p using v. If it is, the algorithm returns the minimal compensation pattern of (p) np sub using (v) ov sub . Notice that when (p) np sub is a compensation pattern of p using v, then the fact holds for XP {/,//, * ,[]} that the minimal compensation pattern of (p) np sub using (v) ov sub is also the minimal compensation pattern of p using v. This follows from the proof of the necessary condition of Lemma 5.13. Hence, this algorithm is sound for XP {/,//, * ,[]} . Algorithm 1: Finding minimal rewritings Input: p and v(are two patterns) Output: a minimal compensation pattern of p using v if exists; otherwise null. end for 16: end for 17: return p -R p v ; However, this algorithm isn't complete for XP {/,//, * ,[]} , because the fact, that (p) np sub is not a compensation pattern of p using v, doesn't imply no compensation pattern of p using v exists.

Related Work

The problem of rewriting queries using views has been studied in depth in the relational model [20,21]. Recently, this problem has also been exploited in the semi-structural data model [27] with regular path queries [7,19].

Most recently, the problem of rewriting queries using materialized XML views has attracted moderate attention. In [9], Chen et al consider in client side using cached results of previous XQuery queries to answer new queries. In [32], Yang et al consider mining frequent tree patterns to materialize and use them to answer new queries. In [3], Balmin et al consider in server side using materialized XPath views, which can include XML fragments, data values, full paths, or node references, to speedup processing of XPath queries. All above works use heuristics to decide the existence of rewritings, and [3] uses another heuristic to minimize compensation queries. No theoretic analysis on this problem has been addressed in all of them. But, a lot of theoretical works have been done for containment and minimization problems of XPath queries, which lead to our theoretical research on the problem of rewriting queries using materialized XPath views.

The most similar theoretical work to ours is [14]. Authors consider the XQuery reformulation problem for XML publishing scenario. They reduce the XML data model into a relational data model under constraints such that the XQuery reformulation problem can be reduced to the rewriting problem of conjunctive queries under constraints. They show that an extended Chase and BackChase (C&B) algorithm is complete for the reformulation problem of a restricted class of XQueries called Behaved XQueries. The techniques we use and the conclusions we have in this paper are totally different to theirs.

In addition, other works about XPath queries are expressive powers [4,18], satisfiability [22], the time complexity of query evaluation [16,17], and the containment in presence of disjunction, DTDs, existential variables and SXICs (Simple XPath Integrity Constraints) [2,26,13,31].

Conclusion

In this paper, we have discussed two problems: the rewriting existence problem and finding minimal rewritings problem for a fragment of XPath: XP {/,//, * ,[]} and its three subclasses.

We have shown that the rewriting existence problem is in coNP-hard and the problem of finding minimal rewritings is in Σ p 3 for XP {/,//, * ,[]} , but both problems are in P for the three subclasses of XP {/,//, * ,[]} .

Moreover, in case of the three subclasses, we have shown that a subpattern of Pattern p is sufficient as the only one compensation pattern candidate for testing whether a rewriting of p using Pattern v exists. We have also shown that if the subpattern is a compensation pattern, then the minimal compensation pattern can be obtained from it by pruning all rewritingredundant nodes. Based on these results, we have designed an algorithm for finding minimal rewritings, which is only sound for XP {/,//, * ,[]} . However, this algorithm is complete and also runs in polynomial time for the three subclasses.

Example 5 . 9

n p ⊕v sub of p ⊕ v and the subpattern (p)