Abstract
Supervised classification involves many heuristics, including the ideas of decision tree, k-nearest neighbour (k-NN), pattern frequency, neural network, and Bayesian rule, to base induction algorithms. In this paper, we propose a new instance-based induction algorithm which combines the strength of pattern frequency and distance. We define a neighbourhood of a test instance. If the neighbourhood contains training data, we use k-NN to make decisions. Otherwise, we examine the support (frequency) of certain types of subsets of the test instance, and calculate support summations for prediction. This scheme is intended to deal with outliers: when no training data is near to a test instance, then the distance measure is not a proper predictor for classification. We present an effective method to choose an “optimal” neighbourhood factor for a given data set by using a guidance from a partial training data. In this work, we find that our algorithm maintains (sometimes exceeds) the outstanding accuracy of k-NN on data sets containing pure continuous attributes, and that our algorithm greatly improves the accuracy of k-NN on data sets containing a mixture of continuous and categorical attributes. In general, our method is much superior to C5.0.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., May 1993. ACM Press.
D.W. Aha, D. Kibler, and M.K. Albert. Instance-based learning algorithms. Machine Learning, 6:37–66, 1991.
C.L. Blake and P.M. Murphy. The UCI machine learning repository. [http://www.cs.uci.edu/~mlearn/MLRepository.html]. In Irvine, CA: University of California, Department of Information and Computer Science, 1998.
T.M. Cover and P.E. Hart. Nearest neighbour pattern classification. IEEE Transactions on Information Theory, 13:21–27, 1967.
Guozhu Dong and Jinyan Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 43–52, San Diego, CA, 1999. ACM Press.
Guozhu Dong, Xiuzhen Zhang, Limsoon Wong, and Jinyan Li. CAEP: Classification by aggregating emerging patterns. In Proceedings of the Second International Conference on Discovery Science, Tokyo, Japan, pages 30–42. Springer-Verlag, December 1999.
James Dougherty, Ron Kohavi, and Mehran Sahami. Supervised and unsupervised discretization of continuous features. In Proceedings of the Twelfth International Conference on Machine Learning, pages 94–202. Morgan Kaufmann, 1995.
U.M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knowledge discovery: An overview. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1–34. AAAI/MIT Press, 1996.
U.M. Fayyad and K.B. Irani. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, pages 1022–1029. Morgan Kaufmann, 1993.
E. Fix and J. Hodges. Discriminatory analysis, non-parametric discrimination, consistency properties. Technical Report Technical Report 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, TX, 1951.
R. Kohavi, G. John, R. Long, D. Manley, and K. Pfleger. MLC++: A machine learning library in C++. In Tools with artificial intelligence, pages 740–743, 1994.
Pat Langley and Wayne Iba. Average-case analysis of a nearest neighbour algorithm. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 889–894, Chambery, France, 1993.
Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Instance-based classification by emerging patterns. In Proceedings of the Fourth European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 191–200, Lyon, France, September 2000. Springer-Verlag.
Jinyan Li, Guozhu Dong, and Kotagiri Ramamohanarao. Making use of the most expressive jumping emerging patterns for classification. In Knowledge and Information Systems: An International Journal, to appear.
Bing Liu, Wynne Hsu, and Yiming Ma. Integrating classification and association rule mining. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 80–86, New York, USA, August 1998. AAAI Press.
Dimitris Meretakis and Beat Wuthrich. Extending naive bayes classifiers using long itemsets. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 165–174, San Diego, CA, 1999. ACM Press.
J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993.
J.R. Quinlan. Improved use of continuous attributes in C4.5. Journal of Artificial Intelligence Research, 4:77–90, 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, J., Ramamohanarao, K., Dong, G. (2001). Combining the Strength of Pattern Frequency and Distance for Classification. In: Cheung, D., Williams, G.J., Li, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2001. Lecture Notes in Computer Science(), vol 2035. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45357-1_48
Download citation
DOI: https://doi.org/10.1007/3-540-45357-1_48
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41910-5
Online ISBN: 978-3-540-45357-4
eBook Packages: Springer Book Archive