Abstract
The automatic processing and management of XML-based data are ever more popular research issues due to the increasing abundant use of XML, especially on the Web. Nonetheless, several operations based on the structure of XML data have not yet received strong attention. Among these is the process of matching XML documents with XML grammars, useful in various applications such as documents classification, retrieval and selective dissemination of information. In this paper, we propose an algorithm for measuring the structural similarity between an XML document and a Document Type Definition (DTD) considered as the simplest way for specifying structural constraints on XML documents. We consider the various DTD operators that designate constraints on the existence, repeatability and alternativeness of XML elements/attributes. Our approach is based on the concept of tree edit distance, as an effective and efficient means for comparing tree structures, XML documents and DTDs being modeled as ordered labeled trees. It is of polynomial complexity, in comparison with existing exponential algorithms. Classification experiments, conducted on large sets of real and synthetic XML documents, underline our approach effectiveness, as well as its applicability to large XML repositories and databases.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aho, A., Hirschberg, D., Ullman, J.: Bounds on the Complexity of the Longest Common Subsequence Problem. ACM J. 23(1), 1–12 (1976)
Akatsu, T.: Approximate String Matching with Don’t Care Characters. Information Processing Letters 55, 235–239 (1995)
Bertino E., et al.: Measuring the Structural Similarity among XML Documents and DTDs, Technical Report, University of Genova (2002), http://www.disi.unige.it/person/MesitiM
Bertino, E., Guerrini, G., Mesiti, M.: A Matching Algorithm for Measuring the Structural Similarity between an XML Documents and a DTD and its Applications. Elsevier Computer Science, Amsterdam (2004)
Buttler, D.: A Short Survey of Document Structure Similarity Algorithms. In: Proc. of the 5th International Conference on Internet Computing, pp. 3–9 (2004)
Chamberlin, D., et al.: XQuery: A Query Language for XML (2001), http://www.w3.org/TR/2001/WD-xquery-20010215
Chawathe, S., et al.: Change Detection in Hierarchically Structured Information. In: SIGMOD, pp. 493–504 (1996)
Chawathe, S.: Comparing Hierarchical Data in External Memory. In: Proceedings of the 25th Int. Conf. on Very Large Data Bases, pp. 90–101 (1999)
Cobéna, G., et al.: Detecting Changes in XML Documents. In: Proc. of the 18th ICDE Conference, pp. 41–52 (2002)
Dalamagas, T., Cheng, T., Winkel, K., Sellis, T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)
Deutsch, A., et al.: XML-QL: A Query Language for XML. Computer Networks 31(11-16), 1155–1169 (1999)
Do, H.H., Rahm, E.: COMA: A System for Flexible Combination of Schema Matching Approaches. In: The 28th VLDB Conf., Honk Kong, pp. 610–621 (2002)
Doan, A., Domingos, P., Halevy, A.Y.: Reconciling Schemas of Disparate Data Sources: A Machine Learning Approach. In: ACM SIGMOD, pp. 509–520 (2001)
Flesca, S., et al.: Detecting Structural Similarities between XML Documents. In: WebDB, pp. 55–60 (2002)
Goldman, R., Windom, J.: Dataguides: Enabling Query Formulation and Optimization in Semi-structured Databases. In: Proc. of the 23rd VLDB Conference, pp. 436–445 (1997)
Grahne, G., Thomo, A.: Approximate Reasoning in Semi-structured Databases. In: Proc. of the 8th International Workshop on Knowledge Representation meets Databases, vol. 45 (2001), http://CEUR-WS.org/Vol-45/03-thomo.ps
Landau, G.M., Vishkin, U.: Fast Parallel and Serial Approximate String Matching. Journal of Algorithms 10, 157–169 (1989)
Lee, J.H., Kim, M.H., Lee, Y.J.: Information Retrieval Based on Conceptual Distance in IS-A Hierarchies. Journal of Documentation 49(2), 188–207 (1993)
Lee, M., et al.: XClust: Clustering XML Schemas for Effective Integration. In: CIKM, pp. 292–299 (2002)
Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Sov. Phys. Dokl. 6, 707–710 (1966)
Madhavan, J., et al.: Generic Schema Matching With Cupid. In: 27th VLDB Conference, pp. 49–58 (2001)
Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching. In: Proc. of the 18th ICDE Conference, pp. 117–128 (2002)
Myers, E.: An O(N D) Difference Algorithm and Its Variations. Algorithmica 1(2), 251–266 (1986)
Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proc. of the 5th Int. Workshop on the Web and Databases, pp. 61–66 (2002)
Pereira, F.: Technologies for Digital Multimedia Communications: An Evolution Analysis of MPEG Standards. China Communications Journal, 8–19 (2006)
van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)
Sanz, I., Mesiti, M., Guerrini, G., Berlanga Lavori, R.: Approximate Subtree Identification in Heterogeneous XML Documents Collections. In: Bressan, S., Ceri, S., Hunt, E., Ives, Z.G., Bellahsène, Z., Rys, M., Unland, R. (eds.) XSym 2005. LNCS, vol. 3671, pp. 192–206. Springer, Heidelberg (2005)
Schlieder, T.: Similarity Search in XML Data Using Cost-based Query Transformations. In: WebDB, pp. 19–24 (2001)
Shasha, D., Zhang, K.: Approximate Tree Pattern Matching. In: Pattern Matching in Strings, Trees and Arrays, Oxford University Press, Oxford (1995)
Tai, K.C.: The Tree-to-Tree correction problem. ACM J. 26, 422–433 (1979)
Wagner, J., Fisher, M.: The String-to-String correction problem. J. of the ACM (21), 168–173 (1974)
Wong, C., Chandra, A.: Bounds for the String Editing Problem. Journal of the Association for Computing Machinery 23(1), 13–16 (1976)
WWW Consortium, The Document Object Model (August 10, 2006), http://www.w3.org/DOM
Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance between Trees and Related Problems. SIAM J. 18(6), 1245–1262 (1989)
Zhang, K., Shasha, D., Wang, J.: Approximate Tree Matching in the Presence of Variable Length Don’t Cares. J. of Algorithms 16(1), 33–66 (1994)
Zhang, Z., Li, R., Cao, S., Zhu, Y.: Similarity Metric for XML Documents. In: Knowledge Management and Experience Management Workshop, Karlsruhe, Germany, pp. 255–261 (2003)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Tekli, J., Chbeir, R., Yetongnon, K. (2007). Structural Similarity Evaluation Between XML Documents and DTDs. In: Benatallah, B., Casati, F., Georgakopoulos, D., Bartolini, C., Sadiq, W., Godart, C. (eds) Web Information Systems Engineering – WISE 2007. WISE 2007. Lecture Notes in Computer Science, vol 4831. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-76993-4_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-76993-4_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-76992-7
Online ISBN: 978-3-540-76993-4
eBook Packages: Computer ScienceComputer Science (R0)