Abstract
In this paper we present a novel algorithm for document clustering. This approach is based on distributional clustering where subject related words, which have a narrow context, are identified to form meta-tags for that subject. These contextual words form the basis for creating thematic clusters of documents. In a similar fashion to other research papers on document clustering, we analyze the quality of this approach with respect to document categorization problems and show it to outperform the information theoretic method of sequential information bottleneck.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999)
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR 1998, 21st ACM International Conference on Research and Development in Information Retrieval, pp. 96–103 (1998)
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of SIGIR 2001, 24th ACM International Conference on Research and Development in Information Retrieval, pp. 146–153 (2001)
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 1, 1–48 (2002)
Cutting, D., Pedersen, J., Karger, D., Tukey, J.: Scatter/Gather: Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)
Dhillon, Y., Manella, S., Kumar, R.: Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3, 1265–1287 (2003)
El-Yaniv, R., Souroujon, O.: Iterative double clustering for unsupervised and semisupervised learning. In: Proceedings of ECML 2001, 12th European Conference on Machine Learning, pp. 121–132 (2001)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd ACM-SIGIR Intemational Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 26423 (1999)
Joachims, T.: A statistical learning model for Support Vector Machines. In: SIGIR 2001, New Orleans, USA (2001)
Karipis, G., Han, E.H.: Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorisation, University of Minnesota, Technical Report TR-00-0016 (2000)
Lang, K.: Learning to Filter netnews. In: Proceedings of 12th International Conference on Machine Learning, pp. 331–339 (1995)
Lin, J.: Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory 37(1), 145–151 (1991)
Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of SIGIR 2002, 25th ACM International Conference on Research and Development in Information Retrieval, pp. 191–198 (2002)
Pantel, P., Lin, D.: Document clustering with committees. In: The 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR) (2002)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: 30th Annual Meeting of the Association for Computational Linguistics, Columbus. Ohio, pp. 183–190 (1993)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computer Surveys 34(1), 1–47 (2002)
Slonim, N., Tishby, N.: Document Clustering using word clusters via the Information Bottleneck method. In: The 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2000)
Slonim, N., Friedman, N., Tishby, N.: Unsupervised document classification using sequential information maximization. In: The 25th Annual International Conference on Research and Development in Information Retrieval (SIGIR) (2002)
Tishby, N., Pereira, F., Bialek, W.: The Information bottleneck method. In: The 37th annual Allerton Conference on Communication, Control, and Computing (1999) (invited paper to)
Van Rijsbergen, C.J.: Information retrieval. Butterworth-Heinemann, Butterworths (1979)
Zamir, O., Etzioni, O.: Web document Clustering. In: A feasibility demonstration in ACM SIGIR 1998, pp. 46–54 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dobrynin, V., Patterson, D., Rooney, N. (2004). Contextual Document Clustering. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_13
Download citation
DOI: https://doi.org/10.1007/978-3-540-24752-4_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive