XRules: An effective algorithm for structural classification of XML data

Zaki, Mohammed J.; Aggarwal, Charu C.

doi:10.1007/s10994-006-5832-2

XRules: An effective algorithm for structural classification of XML data

Published: 27 January 2006

Volume 62, pages 137–170, (2006)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

XRules: An effective algorithm for structural classification of XML data

Download PDF

1001 Accesses
Explore all metrics

An Erratum to this article was published on 01 May 2006

Abstract

XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods in which each document is treated as a bag of words. Such techniques ignore a significant amount of information hidden inside the documents. In this paper we discuss the problem of rule based classification of XML data by using frequent discriminatory substructures within XML documents. Such a technique is more capable of finding the classification characteristics of documents. In addition, the technique can also be extended to cost sensitive classification. We show the effectiveness of the method with respect to other classifiers. We note that the methodology discussed in this paper is applicable to any kind of semi-structured data.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Abiteboul, S., Kaplan, H., & Milo, T. (2001). Compact labeling schemes for ancestor queries. ACM Symp. on Discrete Algorithms.
Abiteboul, S., & Vianu, V. (1997). Regular path expressions with constraints. ACM Int'l Conf. on Principles of Database Systems.
Aggarwal, C. C. (2002). On effective classification of strings with wavelets. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Aggarwal, C., Gates, S., & Yu, P. (1999). On the merits of using supervised clustering to build categorization systems. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Int'l Conf. on Very Large Data Bases.
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. IEEE Int'l Conf. on Data Engineering.
Alsabti, K., Ranka, S., & Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Andersen, R., Birbeck, M., Kay, M., Livingstone, S., Loesgen, B., Martin, D., Mohr, S., Ozu, N., Peat, B., Pinnock, J., Stark, P., & Williams, K. (2002). Professional XML. Wrox Press Ltd.
Ankerst, M., Ester, M., & Kriegel, H.-P. (2000). Towards an effective cooperation of the computer and the user for classification. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Satamoto, H., & Arikawa, S. (2002). Efficient substructure discovery from large semi-structured data. SIAM Int'l Conf. on Data Mining.
Asai, T., Arimura, H., Uno, T., & Nakano, S. (2003). Discovering frequent substructures in large unordered trees. Int'l Conf. on Discovery Science.
Boz, O. Cost-sensitive learning bibliography. http://home.ptd.net/ olcay/cost-sensitive.html.
Chi, Y., Yang, Y., & Muntz, R. R. (2003). Indexing and mining free trees. IEEE Int'l Conf. on Data Mining.
Chi, Y., Yang, Y., & Muntz, R. R. (2004). Hybridtreeminer: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. Int'l Conf. on Scientific and Statistical Database Management.
Cole, R., Hariharan, R., & Indyk, P. (1999). Tree pattern matching and subset matching in deterministic \(o(n \log^3 n)\)-time. 10th ACM Symp. on Discrete Algorithms.
Cook, D., & Holder, L. (1994). Substructure discovery using minimal description length and background knowledge. Journal of Artificial Intelligence Research, 1, 231–255.
Google Scholar
Cohen, W. W. (1995). Fast effective rule induction. Int'l Conf. on Machine Learning.
Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. ACM SIGMOD Conf.
Dehaspe, L., Toivonen, H., & King, R. (1998). Finding frequent substructures in chemical compounds. ACM Int'l. Conf. on Knowledge Discovery and Data Mining.
Domingos, P. (1999). MetaCost: A general method for making classifiers cost sensitive. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP: Classification by aggregating emerging patterns. Int'l Conf. on Discovery Science.
Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley.
Google Scholar
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1988). Inductive learning algorithms and representations for text categorization. ACM Int'l Conf. on Information and Knowledge Management.
Dzeroski, S., & Lavrac, N. (Eds.) (2001). Relational data mining. Springer-Verlag.
Friedman, J. H. (1977). A recursive partitioning decision rule for non-parametric classifiers. IEEE Transactions on Computers, C-26, 404–408.
Google Scholar
Garofalakis, M., Hyun, D., Rastogi, R., & Shim, K. (2000). Efficient algorithms for constructing decision trees with constraints. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Gehrke, J., Ganti, V., Ramakrishnan, R., & Loh, W.-Y. (1999). BOAT: Optimistic decision tree construction. ACM Int'l Conf. on Management of Data.
Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). RainForest: A framework for fast decision tree construction of large data sets. Int'l Conf. on Very Large Data Bases Proceedings.
Gehrke, J., Loh, W.-Y., & Ramakrishnan, R. (1999). Data mining with decision trees. ACM ACM Int'l Conf. on Knowledge Discovery and Data Mining Tutorial.
Huan, J., Wang, W., & Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. IEEE Int'l Conf. on Data Mining.
Inokuchi, A., Washio, T., & Motoda, H. (2000). An apriori-based algorithm for mining frequent substructures from graph data. European Conf. on Principles of Knowledge Discovery and Data Mining.
James, M. (1985). Classification algorithms. John Wiley & Sons.
Joachims T. (2002). Learning to classify text using support vector machines. Kluwer Academic Publishers.
Kilpelainen, P., & Mannila, H. (1995). Ordered and unordered tree inclusion. SIAM J. of Computing, 24(2), 340–356.
Google Scholar
Kramer, S., De Raedt, L., & Helma, C. (2001). Molecular feature mining in HIV data. Int'l Conf. on Knowledge Discovery and Data Mining.
Kudenko, D., & Hirsh, H. (1998). Feature generation for sequence categorization. AAAI National Conf. on Artificial Intelligence.
Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. IEEE Int'l Conf. on Data Mining.
Lesh, N., Zaki, M. J., & Ogihara, M. (2000). Scalable feature mining for sequential data. IEEE Intelligent Systems, 15(2), 48–56.
Article Google Scholar
Li, Q., & Moon, B. (2001). Indexing and querying XML data for regular path expressions. Int'l Conf. on Very Large Data Bases.
Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and efficient classification based on multiple class-association rules. IEEE Int'l Conf. on Data Mining.
Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Masciari, E. Personal Communication.
Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. Int'l Conf. on Extending Data Base Technology.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, (2)4, 345–389.
Neville, J., & Jensen, D. (2003). Collective classification with relational dependency networks. Multi-Relational Data Mining Workshop.
Neville, J., Jensen, D., & Gallagher, B. (2003). Simple estimators for relational bayesian classifers. Int'l Conf. on Data Mining.
Nigam, K., McCallum, A. K., Thrum, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.
Article MATH Google Scholar
Nijssen, S., & Kok, J. N. (2003). Efficient discovery of frequent unordered trees. Int'l Workshop on Mining Graphs, Trees and Sequences.
Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42, 203–231.
Article MATH Google Scholar
Punin, J., Krishnamoorthy, M., & Zaki, M. (2001). LOGML: Log markup language for web usage mining. WEBKDD Workshop (with SIGKDD).
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Rastogi, R., & Shim, K. (1998). PUBLIC: A decision tree classifier that integrates building and pruning. Int'l Conf. on Very Large Data Bases.
Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Int'l Conf. on Very Large Data Bases.
Shamir, R., & Tsur, D. (1999). Faster subtree isomorphism. Journal of Algorithms, 33, 267–280.
Article MATH MathSciNet Google Scholar
Termier, A., Rousset, M-C., & Sebag, M. (2002). TreeFinder: A first step towards XML data mining. IEEE Int'l Conf. on Data Mining.
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. Int'l Conf. on Machine Learning.
Wang, K., & Liu, H. Q. (1998). Discovering typical structures of documents: A road map approach. ACM Int'l Conf. on Information Retrieval.
Xiao, Y., Yao, J.-F., Li, Z., & Dunham, M. H. (2003). Efficient data mining for maximal frequent subtrees. Int'l Conf. on Data Mining.
Yan, X., & Han, J. (2002). Gspan: Graph-based substructure pattern mining. IEEE Int'l Conf. on Data Mining.
Yan, X., & Han, J. (2003). CloseGraph: Mining closed frequent graph patterns. ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining.
Yoshida, K., & Motoda, H. (1995). CLIP: Concept learning from inference patterns. Artificial Intelligence, 75(1), 63–92.
Article Google Scholar
Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Zaki, M. J., & Aggarwal, C. C. (2003). Xrules: An effective structural classifier for xml data. ACM Int'l Conf. Knowledge Discovery and Data Mining.
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., & Lohman, G. (2001). On supporting containment queries in relational database management systems. ACM Int'l Conf. on Management of Data.

Download references

Author information

Authors and Affiliations

Rensselaer Polytechnic Institute, Troy, NY, 12180
Mohammed J. Zaki
IBM T.J. Watson Research Center, Yorktown Heights, NY, 10598
Charu C. Aggarwal

Authors

Mohammed J. Zaki
View author publications
You can also search for this author in PubMed Google Scholar
Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammed J. Zaki.

Additional information

Editors: Hendrik Blockeel, David Jensen and Stefan Kramer

An erratum to this article is available at http://dx.doi.org/10.1007/s10994-006-8633-8.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zaki, M.J., Aggarwal, C.C. XRules: An effective algorithm for structural classification of XML data. Mach Learn 62, 137–170 (2006). https://doi.org/10.1007/s10994-006-5832-2

Download citation

Received: 01 August 2004
Revised: 31 January 2005
Accepted: 28 June 2005
Published: 27 January 2006
Issue Date: February 2006
DOI: https://doi.org/10.1007/s10994-006-5832-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

XRules: An effective algorithm for structural classification of XML data

Abstract

Article PDF

Similar content being viewed by others

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Clustering XML Documents Using Frequent Edge-Sets

Information Retrieval in XML Document: State of the Art

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

XRules: An effective algorithm for structural classification of XML data

Abstract

Article PDF

Similar content being viewed by others

Machine learning techniques for XML (co-)clustering by structure-constrained phrases

Clustering XML Documents Using Frequent Edge-Sets

Information Retrieval in XML Document: State of the Art

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation