XML-to-SQL Query Translation Literature: The State of The Art and Open Problems
XML-to-SQL Query Translation Literature: The State of The Art and Open Problems
University of Wisconsin-Madison
{sekar,raghav,naughton}@cs.wisc.edu
1 Introduction
Beginning in 1999, the database research literature has seen an explosion of
publications with the goal of using an RDBMS to store and/or query XML data.
The problems addressed and solved in this area are diverse. Some publications
deal with using an RDBMS to store XML data; others deal with exporting
existing relational data in an XML view. The papers use a wide variety of XML
query languages, including subsets of XQuery, XML-QL, XPath, and even “one-
off” new proposals; they use a wide variety of languages or ad-hoc constructs to
map between the relational and XML schema; and they differ widely in what
they “push to SQL” and what they evaluate in middleware.
This diversity renders it difficult to know how the various results presented fit
together, and even makes it hard to know what if any open problems remain. As
a first step to rectifying this situation, we present a classification of the problem
space and discuss how almost 40 papers fit into this classification. As a result
of this study, we find that some basic questions are still open. In particular,
for the XML publishing of relational data and for “schema-based” shredding of
XML documents into relations, there is no published algorithm for translating
even simple path expression queries (with the // axis) into SQL when the XML
schema is recursive. It is our hope that this paper will stimulate others to refine
our classification and, more importantly, to improve the state of the art and to
address and solve the open problems that the classification reveals.
Technique Scenario
Subproblems Class of Class of
solved XML Schema XML Queries
considered handled
XPeranto XP/GAV VD,QT tree XQuery
SilkRoute XP/GAV VD,QT tree XML-QL
Rolex XP/GAV QT tree XSLT
[17] XP/GAV QT tree XSLT
[1] XP/GAV VD recursive -
Oracle XML DB XP/GAV, XS/SB VD,SS,QT recursive SQL/XML
restricted XPath1
SQL Server 2000 XP/GAV, XS/SB VD,SS,QT bounded depth restricted XPath2
SQLXML recursive
DB2 XML XP/GAV, XS/SB VD,QT non-recursive SQL extensions
Extender through UDFs
Agora XP/LAV QT non-recursive XQuery
MARS XP/GAV + QT non-recursive XQuery
XP/LAV
STORED XS/SO SS,QT all STORED
Edge XS/SO SS,QT all path expressions
Monet XS/SO SS all -
XRel XS/SO SS,QT all path expressions
[35] XS/SO SS,QT all order-based
queries
Dynamic XS/SO QT all XQuery
intervals [7]
[24, 32] XS/SB SS recursive -
[2, 16, 19, 21, 27] XS/SB SS tree -
XP/GAV: XML Publishing, Global-as-view XP/LAV: XML Publishing, Local-as-view
XS/SO: XML Storage, schema-oblivious XS/SB: XML Storage, schema-based
QT: Query Translation VD: View Definition SS: Storage scheme
restricted XPath1 : child and attribute axes
restricted XPath2 : child, attribute, self and parent axes
Table 1. Summary of various published techniques
scenario. We will look at each of these in more detail in the rest of the paper.
In addition to the characteristics from our broad classification, the table also
reports, for each solution, the class of schema considered, the class of XML
queries handled, whether it uses the “global as view” or “local as view” approach
(if the XML publishing problem is addressed), and what subproblems are solved.
The rest of the paper is organized as follows. We survey known algorithms
in published literature for XML-publishing, schema-oblivious XML storage and
schema-based XML storage in Sections 2, 3, and 4 respectively. For each scenario,
we first survey the solutions that have been proposed in published literature,
and discuss problems that remain open. When we look at XML support in
commercial RDBMS as part of this survey, we will restrict our discussion to those
features that are relevant to XML-to-SQL query translation. A full investigation
of XML support in commercial RDBMS is beyond the scope of this survey.
2 XML Publishing
The following tasks arise in the context of allowing applications to query existing
relational data as if it were XML:
1. With the exception of [1, 42], the above work considers only non-recursive
XML views of relational data. While Oracle XML DB [42] supports path
expression queries with the child and attribute axes over recursive views, it
does not support the descendant ( //) axis. Translating XML queries (with
the // axis) over recursive view schema remains open. In [1], the problem of
materializing recursive XML view schema is considered. However, as we have
mentioned, that work does not use SQL support for recursion, simulating
recursion in middleware instead. The reason for this given by the authors
is that the limited form of recursion supported by SQL cannot handle the
forms of recursion that arise in with recursive XML schema. We return to
this question at the end of this section. The following are open questions in
the context of SQL support for recursion:
– What is the class of queries/view schema for which the current support
for recursion in SQL are adequate?
– If there are cases for which SQL support for recursion is inadequate,
how do we best leverage this support? (Instead of completely simulating
recursion in middleware.)
2. Any query translation algorithm can be evaluated by two metrics: its func-
tionality, in terms of the class of XML queries handled; and its performance,
in terms of the efficiency of the resulting SQL query. Most of the translation
algorithms have not been evaluated thoroughly by either metric, which gives
rise to a number of open research problems.
– Functionality: Among the GAV-style approaches, except XPeranto, all
the above discussed work deals with languages other than XQuery. Even
in the case of XPeranto, the class of XQuery handled is unclear from [29].
It would be interesting to precisely characterize the class of XQuery
queries that can be translated by the methods currently in the literature.
– Performance: There has been almost no work comparing the quality of
SQL queries generated by various translation algorithms. In particular,
we are aware of no published performance study for the query translation
problem.
3. GAV vs. LAV: While for the GAV-style approaches, XML-to-SQL query
translation corresponds to view composition, for the LAV-style approaches
it corresponds to answering queries with views. It is not clear for what class of
XML views the equivalent query rewriting problem has published solutions.
As pointed out in [25], state-of-the-art query rewriting algorithms for SQL
semantics do not efficiently handle arbitrary levels of nesting, grouping etc.
Similarly, [9] works under set-semantics and so cannot handle certain classes
of XML view schema and aggregation in XML queries. Comparing across
the three different approaches — GAV, LAV and GAV+LAV, in terms of
both functionality and performance is an open issue.
Recursive XML View Schema and Linear Recursion in SQL In this
subsection we return to the problem of recursive XML view schema and whether
or not they can be handled by the support for recursion currently provided by
SQL.
Consider the problem of materializing a recursive XML view schema. In [1],
it is mentioned that even though SQL supports linear recursion, this is not
sufficient for materializing a recursive XML view. The reason for this is not
elaborated in the paper. The definition of an XML view has two main compo-
nents to it: the view definition language and the XML schema of the resulting
view. Hence, it must be the case that either the XML schema of the view or
the view definition language is more complex than what SQL linear recursion
can support. Clearly, if the view definition language is complex enough (say the
parent-child relationship is defined using non-linear recursion), linear recursion
in SQL will not suffice. However, most view definition languages proposed define
parent-child relationships through much simpler conditions (such as conjunctive
queries). The question arises whether SQL linear recursion is sufficient for these
view definition languages, for arbitrary XML schema.
In [6], the notion of linear and non-linear recursive DTDs is introduced. The
natural question here is whether the notions of linear recursion in SQL and DTDs
correspond. It turns out that the definition of non-linear recursive schema in [6]
has nothing to do with the traditional Datalog notion of linear and non-linear
recursion. For example, consider a classical part-subpart database. Suppose that
the DTD rule for a part element is: part → pname, part*.
According to [6], this is a non-linear recursive rule as a part element can
derive multiple part sub-elements. Hence, the entire DTD is non-linear recursive.
Indeed, it can be shown that this DTD is not equivalent to any linear-recursive
DTD. Now, suppose the underlying relational schema has two relations, Part and
Subpart with the columns: (partid,pname) and (partid,subpartid) respectively.
Now, the following SQL query extracts all data necessary to materialize the
XML view:
WITH RECURSIVE AllParts(partid,pname,rtolpath) as (
select partid,pname,’’
from Part(partid,pname)
union all
select P.partid,P.pname,rtolpath+A.partid
from AllParts A, Subpart S, Part P
where S.partid = A.partid and S.subpartid = P.partid)
select * from AllParts
In the above query, the root-to-leaf path is maintained for each part element
through the rtolpath column in order to extract the tree structure. Note how-
ever that the core SQL query executes the following linear-recursive Datalog
program.
AllParts(partid,pname) ← Part(partid,pname)
AllParts(subpartid,subpname) ←
AllParts(partid,pname) Subpart(partid,subpartid) Part(subpartid,subpname)
So, we see that a non-linear recursive rule in the DTD gets translated into
a linear recursive Datalog (SQL) rule. This implies that the notion of linear
recursion in DTDs and SQL (Datalog) do not have a direct correspondence.
Hence, the class of XML view schema/view definition languages for which SQL
linear recursion is adequate to materialize the resulting XML views needs to be
examined.
3 Schema-Oblivious XML Storage
Recall that in this scenario, the goal is to find a relational schema that works
for storing XML documents independent of the presence or absence of a schema.
The main problems addressed in this sub-space are:
1. Relational schema design: which generic relational schema for XML should
be used?
2. Query translation algorithms: given a decision for the relational schema, how
do we translate from XML queries to SQL queries.
More Complex XQuery queries The only published work that we are aware
of that deals with more general XQuery queries is [7]. The main focus of the
paper is on issues such as structural equality in FLWR where clauses, full com-
positionality of XML query expressions (in particular, the possibility of nesting
FLWR expressions within functions), and the need for constructed XML doc-
uments representing intermediate query results. As mentioned earlier, special
purpose relational operators are proposed for better performance. We note that
without these operators, the performance of their translation is likely to be in-
ferior even for simple path expressions. As an example, using their technique,
the path expression /site/people is translated to an SQL query involving five
temporary relations created using the With clause in SQL99, three of which in-
volve correlated subqueries. To conclude, excepting [7], all prior work has been
on translating path expression queries into SQL. Using the approach proposed
by [7], we observe that functionality-wise, a large fragment of XQuery can be
handled using dynamic intervals in a schema-oblivious fashion. However, without
modifications to the relational engine, its performance may not be acceptable.
4 Schema-Based XML Storage
In this section, we discuss approaches to storing XML in relational systems that
make use of a schema for the XML data in order to choose a good relational
schema. The main problems to be addressed in this subspace are
1. Relational schema selection — given an XML schema (or DTD), how should
we choose a good relational schema and XML-to-relational mapping.
2. Query translation — having chosen an XML-to-relational mapping, how
should we translate XML queries into SQL.
6 7
title5 section figure Figure
top−section.title nested−section
9 10 Figure
* e2 caption image id parentid parentcode caption image
8 Figure.caption Figure.image
title
nested−section.title
Fig. 3. Sample XML-to-Relational mapping schema
problem is open when the input XML schema is recursive. Even for a non-
recursive XML schema, a lot of interesting questions arise when the XML schema
is not a tree. For example, if there is recursion in an XPath query through //, the
straightforward approach of enumerating all satisfying paths using the schema
and handling them one at a time is no longer an efficient approach. If we wish to
reduce the problem to XML publishing, the only way to use an existing solution
is to unfold the DAG schema into an equivalent tree schema.
We now examine the translation problem from a performance perspective.
References
1. M. Benedikt, C. Y. Chan, W. Fan, R. Rastogi, S. Zheng, and A. Zhou. DTD-Directed
Publishing with Attribute Translation Grammars. In VLDB, 2002.
2. P. Bohannon, J. Freire, P. Roy, and J. Simeon. From XML schema to relations: A cost-
based approach to XML storage. In ICDE, 2002.
3. P. Bohannon, H. Korth, P.P.S. Narayan, S. Ganguly, and P. Shenoy. Optimizing view
queries in ROLEX to support navigable tree results. In VLDB, 2002.
4. Y. Chen, S. B. Davidson, and Y. Zheng. Constraint preserving XML Storage in Relations.
In WebDB, 2002.
5. S. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient Structural Joins
on Indexed XML Documents. In VLDB, 2002.
6. B. Choi. What Are Real DTDs Like. In WebDB, 2002.
7. D. DeHaan, D. Toman, M. P. Consens, and T. Ozsu. A Comprehensive XQuery to SQL
Translation using Dynamic Interval Encoding. In SIGMOD, 2003.
8. A. Deutsch, M. Fernández, and D. Suciu. Storing semistructured data with STORED. In
SIGMOD, 1999.
9. A. Deutsch and V. Tannen. MARS: A System for Publishing XML from Mixed and
Redundant Storage. In VLDB, 2003.
10. A. Deutsch and V. Tannen. Reformulation of XML Queries and Constraints. In ICDT,
2003.
11. A. Eisenberg and J. Melton. SQL/XML is Making Good Progress. SIGMOD Record,
31(2), 2002.
12. M. Fernandez, A. Morishima, and D. Suciu. Efficient Evaluation of XML Middle-ware
Queries. In SIGMOD, 2002.
13. M. Fernández, D. Suciu, and W.C. Tan. SilkRoute: Trading Between Relations and XML.
In WWW9, 2000.
14. D. Florescu and D. Kossman. Storing and Querying XML Data using an RDBMS. Data
Engineering Bulletin, 22(3), 1999.
15. T. Grust. Accelerating XPath location steps. In SIGMOD, 2002.
16. S. Hongwei, Z. Shusheng, Z. Jingtao, and W. Jing. Constraints-Preserving Mapping Al-
gorithm from XML-Schema to Relational Schema. In Engineering and Deployment of
Cooperative Information Systems (EDCIS), 2002.
17. S. Jain, R. Mahajan, and D. Suciu. Translating XSLT Programs to Efficient SQL Queries.
In WWW, 2002.
18. H. Jiang, H. Lu, W. Wang, and B. C. Ooi. XR-Tree: Indexing XML Data for Efficient
Structural Joins. In ICDE, 2003.
19. M. Klettke and H. Meyer. XML and Object-Relational Database Systems - Enhancing
Structural Mappings Based on Statistics. In WebDB, 2000.
20. R. Krishnamurthy, R. Kaushik, and J. F. Naughton. On the difficulty of finding optimal
relational decompositions for xml workloads: A complexity theoretic perspective. In ICDT,
2003.
21. D. Lee and W.W. Chu. Constraints-preserving Transformation from XML Document Type
Definition to Relational Schema. In ER, 2000.
22. C. Li, P. Bohannon, H. Korth, and P.P.S. Narayan. Composing XSL Transformations with
XML Publishing Views. In SIGMOD, 2003.
23. Q. Li and B. Moon. Indexing and querying XML data for regular path expressions. In
Proceedings of VLDB, 2001.
24. M. Mani and D. Lee. XML to Relational Conversion Using Theory of Regular Tree
Grammars. In Efficiency and Effectiveness of XML Tools and Techniques and Data
Integration over the Web (EEXTT), 2002.
25. I. Manolescu, D. Florescu, and D. Kossman. Answering XML queries over heterogeneous
data sources. In VLDB, 2001.
26. N.Bruno, N.Koudas, and D. Srivastava. Holistic Twig Joins: Optimal XML Pattern Match-
ing. In SIGMOD, 2002.
27. K. Runapongsa and J. M. Patel. Storing and Querying XML Data in Object-Relational
DBMSs. In EDBT Workshops, 2002.
28. A. Schmidt, M. Kersten, M. Windhouwer, and F. Waas. Efficient Relational Storage and
Retrieval of XML Documents. In WebDB, 2000.
29. J. Shanmugasundaram, J. Kiernan, E. J. Shekita, C. Fan, and J. Funderburk. Querying
XML Views of Relational Data. In VLDB, 2001.
30. J. Shanmugasundaram, E. Shekita, R. Barr, M. Carey, B. Lindsay, H. Pirahesh, and
B. Reinwald. Efficiently Publishing Relational Data as XML Documents. In VLDB,
2000.
31. J. Shanmugasundaram, E. Shekita, J. Kiernan, R. Krishnamurthy, S. D. Viglas,
J. Naughton, and I. Tatarinov. A general technique for querying xml documents using a
relational database system. SIGMOD Record, 30(3), 2001.
32. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Re-
lational Databases for Querying XML Documents: Limitations and Opportunities. In
VLDB, 1999.
33. D. Srivastava, S. Al-Khalifa, H.V. Jagadish, N. Koudas, J.M. Patel, and Y. Wu. Structural
Joins: A Primitive For Efficient XML Query Pattern Matching. In ICDE, 2002.
34. J. Teubner T. Grust, M. V. Keulen. Staircase Join: Teach a Relational DBMS to Watch
its (Axis) Steps. In VLDB, 2003.
35. I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang.
Storing and querying ordered XML using a relational database system. In SIGMOD, 2002.
36. W. Wang, H. Jiang, H. Lu, and J. X. Yu. PBiTree Coding and Efficient Processing of
Containment Joins. In ICDE, 2003.
37. M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based approach to
storage and retrieval of XML documents using relational databases. ACM Transactions
on Internet Technology (TOIT), 1(1):110–141, 2001.
38. C. Zhang, J.F. Naughton, D.J. DeWitt, Q. Luo, and G.Lohman. On Supporting Contain-
ment Queries in Relational Database Management Systems. In Proceedings of SIGMOD,
2001.
39. X. Zhang, B. Pielech, and E. Rundesnteiner. Honey, I Shrunk the XQuery! – An XML Al-
gebra Optimization Approach. In Workshop on Web Information and Data Management
(WIDM), 2002.
40. DB2 XML Extender. http://www-3.ibm.com/software/data/db2/extenders/xmlext/index.html.
41. INCITS H2.3 Task Group. http://www.sqlx.org.
42. Oracle9i XML Database Developer’s Guide - Oracle XML DB Release 2 (9.2).
http://otn.oracle.com/tech/xml/xmldb/content.html.
43. SQLXML and XML Mapping Technologies. http://msdn.microsoft.com/sqlxml/default.asp.
44. XML for Tables. http://http://www.alphaworks.ibm.com/tech/xtable.