Abstract
XML documents have recently become ubiquitous because of their varied applicability in a number of applications. Classification is an important problem in the data mining domain, but current classification methods for XML documents use IR-based methods in which each document is treated as a bag of words. Such techniques ignore a significant amount of information hidden inside the documents. In this paper we discuss the problem of rule based classification of XML data by using frequent discriminatory substructures within XML documents. Such a technique is more capable of finding the classification characteristics of documents. In addition, the technique can also be extended to cost sensitive classification. We show the effectiveness of the method with respect to other classifiers. We note that the methodology discussed in this paper is applicable to any kind of semi-structured data.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Abiteboul, S., Kaplan, H., & Milo, T. (2001). Compact labeling schemes for ancestor queries. ACM Symp. on Discrete Algorithms.
Abiteboul, S., & Vianu, V. (1997). Regular path expressions with constraints. ACM Int'l Conf. on Principles of Database Systems.
Aggarwal, C. C. (2002). On effective classification of strings with wavelets. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Aggarwal, C., Gates, S., & Yu, P. (1999). On the merits of using supervised clustering to build categorization systems. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Int'l Conf. on Very Large Data Bases.
Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. IEEE Int'l Conf. on Data Engineering.
Alsabti, K., Ranka, S., & Singh, V. (1998). CLOUDS: A decision tree classifier for large datasets. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Andersen, R., Birbeck, M., Kay, M., Livingstone, S., Loesgen, B., Martin, D., Mohr, S., Ozu, N., Peat, B., Pinnock, J., Stark, P., & Williams, K. (2002). Professional XML. Wrox Press Ltd.
Ankerst, M., Ester, M., & Kriegel, H.-P. (2000). Towards an effective cooperation of the computer and the user for classification. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Satamoto, H., & Arikawa, S. (2002). Efficient substructure discovery from large semi-structured data. SIAM Int'l Conf. on Data Mining.
Asai, T., Arimura, H., Uno, T., & Nakano, S. (2003). Discovering frequent substructures in large unordered trees. Int'l Conf. on Discovery Science.
Boz, O. Cost-sensitive learning bibliography. http://home.ptd.net/ olcay/cost-sensitive.html.
Chi, Y., Yang, Y., & Muntz, R. R. (2003). Indexing and mining free trees. IEEE Int'l Conf. on Data Mining.
Chi, Y., Yang, Y., & Muntz, R. R. (2004). Hybridtreeminer: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms. Int'l Conf. on Scientific and Statistical Database Management.
Cole, R., Hariharan, R., & Indyk, P. (1999). Tree pattern matching and subset matching in deterministic \(o(n \log^3 n)\)-time. 10th ACM Symp. on Discrete Algorithms.
Cook, D., & Holder, L. (1994). Substructure discovery using minimal description length and background knowledge. Journal of Artificial Intelligence Research, 1, 231–255.
Cohen, W. W. (1995). Fast effective rule induction. Int'l Conf. on Machine Learning.
Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. ACM SIGMOD Conf.
Dehaspe, L., Toivonen, H., & King, R. (1998). Finding frequent substructures in chemical compounds. ACM Int'l. Conf. on Knowledge Discovery and Data Mining.
Domingos, P. (1999). MetaCost: A general method for making classifiers cost sensitive. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Dong, G., Zhang, X., Wong, L., & Li, J. (1999). CAEP: Classification by aggregating emerging patterns. Int'l Conf. on Discovery Science.
Duda, R., & Hart, P. (1973). Pattern classification and scene analysis. New York: Wiley.
Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1988). Inductive learning algorithms and representations for text categorization. ACM Int'l Conf. on Information and Knowledge Management.
Dzeroski, S., & Lavrac, N. (Eds.) (2001). Relational data mining. Springer-Verlag.
Friedman, J. H. (1977). A recursive partitioning decision rule for non-parametric classifiers. IEEE Transactions on Computers, C-26, 404–408.
Garofalakis, M., Hyun, D., Rastogi, R., & Shim, K. (2000). Efficient algorithms for constructing decision trees with constraints. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Gehrke, J., Ganti, V., Ramakrishnan, R., & Loh, W.-Y. (1999). BOAT: Optimistic decision tree construction. ACM Int'l Conf. on Management of Data.
Gehrke, J., Ramakrishnan, R., & Ganti, V. (1998). RainForest: A framework for fast decision tree construction of large data sets. Int'l Conf. on Very Large Data Bases Proceedings.
Gehrke, J., Loh, W.-Y., & Ramakrishnan, R. (1999). Data mining with decision trees. ACM ACM Int'l Conf. on Knowledge Discovery and Data Mining Tutorial.
Huan, J., Wang, W., & Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. IEEE Int'l Conf. on Data Mining.
Inokuchi, A., Washio, T., & Motoda, H. (2000). An apriori-based algorithm for mining frequent substructures from graph data. European Conf. on Principles of Knowledge Discovery and Data Mining.
James, M. (1985). Classification algorithms. John Wiley & Sons.
Joachims T. (2002). Learning to classify text using support vector machines. Kluwer Academic Publishers.
Kilpelainen, P., & Mannila, H. (1995). Ordered and unordered tree inclusion. SIAM J. of Computing, 24(2), 340–356.
Kramer, S., De Raedt, L., & Helma, C. (2001). Molecular feature mining in HIV data. Int'l Conf. on Knowledge Discovery and Data Mining.
Kudenko, D., & Hirsh, H. (1998). Feature generation for sequence categorization. AAAI National Conf. on Artificial Intelligence.
Kuramochi, M., & Karypis, G. (2001). Frequent subgraph discovery. IEEE Int'l Conf. on Data Mining.
Lesh, N., Zaki, M. J., & Ogihara, M. (2000). Scalable feature mining for sequential data. IEEE Intelligent Systems, 15(2), 48–56.
Li, Q., & Moon, B. (2001). Indexing and querying XML data for regular path expressions. Int'l Conf. on Very Large Data Bases.
Li, W., Han, J., & Pei, J. (2001). CMAR: Accurate and efficient classification based on multiple class-association rules. IEEE Int'l Conf. on Data Mining.
Liu, B., Hsu, W., & Ma, Y. (1998). Integrating classification and association rule mining. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Masciari, E. Personal Communication.
Mehta, M., Agrawal, R., & Rissanen, J. (1996). SLIQ: A fast scalable classifier for data mining. Int'l Conf. on Extending Data Base Technology.
Murthy, S. K. (1998). Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery, (2)4, 345–389.
Neville, J., & Jensen, D. (2003). Collective classification with relational dependency networks. Multi-Relational Data Mining Workshop.
Neville, J., Jensen, D., & Gallagher, B. (2003). Simple estimators for relational bayesian classifers. Int'l Conf. on Data Mining.
Nigam, K., McCallum, A. K., Thrum, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.
Nijssen, S., & Kok, J. N. (2003). Efficient discovery of frequent unordered trees. Int'l Workshop on Mining Graphs, Trees and Sequences.
Provost, F., & Fawcett, T. (2001). Robust classification for imprecise environments. Machine Learning, 42, 203–231.
Punin, J., Krishnamoorthy, M., & Zaki, M. (2001). LOGML: Log markup language for web usage mining. WEBKDD Workshop (with SIGKDD).
Quinlan, J. R. (1993). C4.5: Programs for machine learning. Morgan Kaufmann.
Rastogi, R., & Shim, K. (1998). PUBLIC: A decision tree classifier that integrates building and pruning. Int'l Conf. on Very Large Data Bases.
Shafer, J., Agrawal, R., & Mehta, M. (1996). SPRINT: A scalable parallel classifier for data mining. Int'l Conf. on Very Large Data Bases.
Shamir, R., & Tsur, D. (1999). Faster subtree isomorphism. Journal of Algorithms, 33, 267–280.
Termier, A., Rousset, M-C., & Sebag, M. (2002). TreeFinder: A first step towards XML data mining. IEEE Int'l Conf. on Data Mining.
Tsochantaridis, I., Hofmann, T., Joachims, T., & Altun, Y. (2004). Support vector machine learning for interdependent and structured output spaces. Int'l Conf. on Machine Learning.
Wang, K., & Liu, H. Q. (1998). Discovering typical structures of documents: A road map approach. ACM Int'l Conf. on Information Retrieval.
Xiao, Y., Yao, J.-F., Li, Z., & Dunham, M. H. (2003). Efficient data mining for maximal frequent subtrees. Int'l Conf. on Data Mining.
Yan, X., & Han, J. (2002). Gspan: Graph-based substructure pattern mining. IEEE Int'l Conf. on Data Mining.
Yan, X., & Han, J. (2003). CloseGraph: Mining closed frequent graph patterns. ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining.
Yoshida, K., & Motoda, H. (1995). CLIP: Concept learning from inference patterns. Artificial Intelligence, 75(1), 63–92.
Zaki, M. J. (2002). Efficiently mining frequent trees in a forest. ACM Int'l Conf. on Knowledge Discovery and Data Mining.
Zaki, M. J., & Aggarwal, C. C. (2003). Xrules: An effective structural classifier for xml data. ACM Int'l Conf. Knowledge Discovery and Data Mining.
Zhang, C., Naughton, J., DeWitt, D., Luo, Q., & Lohman, G. (2001). On supporting containment queries in relational database management systems. ACM Int'l Conf. on Management of Data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Hendrik Blockeel, David Jensen and Stefan Kramer
An erratum to this article is available at http://dx.doi.org/10.1007/s10994-006-8633-8.
Rights and permissions
About this article
Cite this article
Zaki, M.J., Aggarwal, C.C. XRules: An effective algorithm for structural classification of XML data. Mach Learn 62, 137–170 (2006). https://doi.org/10.1007/s10994-006-5832-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-006-5832-2