Abstract
The large amount of text data which are continuously produced over time in a variety of large scale applications such as social networks results in massive streams of data. Typically massive text streams are created by very large scale interactions of individuals, or by structured creations of particular kinds of content by dedicated organizations. An example in the latter category would be the massive text streams created by news-wire services. Such text streams provide unprecedented challenges to data mining algorithms from an efficiency perspective. In this chapter, we review text stream mining algorithms for a wide variety of problems in data mining such as clustering, classification and topic modeling. We also discuss a number of future challenges in this area of research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
C. C. Aggarwal. Data Streams: Models and Algorithms, Springer, 2007.
C. C. Aggarwal, J. Han, J. Wang, P. Yu. On Demand Classification of Data Streams, KDD Conference, 2004.
C. C. Aggarwal, P. S. Yu. A Framework for Clustering Massive Text and Categorical Data Streams, SIAM Conference on Data Mining, 2006.
C. C. Aggarwal, J. Han. J. Wang, P. Yu. A Framework for Clustering Evolving Data Streams, VLDB Conference, 2003.
J. Allan, R. Papka, V. Lavrenko. On-line new event detection and tracking. ACM SIGIR Conference, 1998.
J. Allan, V. Lavrenko, H. Jin. First story detection in tdt is hard. ACM CIKM Conference, 2000.
J. Allan, V. Lavrenko, D. Malin, R. Swan. Detections, bounds and timelines: Umass and tdt3, Proceedings of the Topic Detection and Tracking Workshop, 2000.
I. Androutsopoulos, J. Koutsias, K. V. Chandrinos, C. D. Spyropoulos. An experimental comparison of naive Bayesian and keywordbased anti-spam filtering with personal e-mail messages. Proceedings of the ACM SIGIR Conference, 2000.
A. Banerjee, J. Ghosh. Competitive learning mechanisms for scalable, balanced and incremental clustering of streaming texts, NIPS Conference, 2003.
D. Blei, J. Lafferty. Dynamic topic models. ICML Conference, 2006.
T. Brants, F. Chen, A. Farahat. A system for new event detection. ACM SIGIR Conference, 2003.
L. O’Callaghan, A. Meyerson, R. Motwani, N. Mishra, S. Guha. Streaming-Data Algorithms for High-Quality Clustering. ICDE Conference, 2002.
K. Chai, H. Ng, H. Chiu. Bayesian Online Classifiers for Text Classification and Filtering, ACM SIGIR Conference, 2002.
M. Charikar. Similarity Estimation Techniques from Rounding Algorithms, STOC Conference, 2002.
W. Cohen, Y. Singer. Context-sensitive learning methods for text categorization. ACM Transactions on Information Systems, 17(2), pp. 141–173, 1999.
K. Crammer, Y. Singer. A New Family of Online Algorithms for category ranking, ACM SIGIR Conference, 2002.
D. Cutting, D. Karger, J. Pedersen, J. Tukey. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proceedings of the SIGIR, 1992.
I. Dagan, Y. Karov, D. Roth. Mistake-driven learning in text categorization. Conference Empirical Methods in Natural Language Processing, 1997.
A. P. Dempster, N. M. Laird, D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society 39: pp. 1–38, 1977.
D. Fisher. Knowledge Acquisition via incremental conceptual clustering. Machine Learning, 2: pp. 139–172, 1987.
Y. Freund, R. Schapire, Y. Singer, M. Warmuth. Using and combining predictors that specialize. Proceedings of the 29th Annual ACM Symposium on Theory of Computing, pp. 334–343, 1997.
Y. Freund, R. Schapire. Large Margin Classification using the perceptron Algorithm, COLT, 1998.
G. P. C. Fung, J. X. Yu, H. Lu. Classifying text streams in the presence of concept drifts. PAKDD Conference, 2004.
G. P. C. Fung, J. X. Yu, P. Yu, H. Lu. Parameter Free Bursty Events Detection in Text Streams, VLDB Conference, 2005.
J. H. Gennari, P. Langley, D. Fisher. Models of incremental concept formation. Journal of Artificial Intelligence, 40: pp. 11–61, 1989.
Q. He, K. Chang, E.-P. Lim, J. Zhang. Bursty feature representation for clustering text streams. SDM Conference, 2007.
J. Kleinberg, Bursty and hierarchical structure in streams, ACM KDD Conference, pp. 91–101, 2002.
A. Kontostathis, L. Galitsky,W. M. Pottenger, S. Roy, D. J. Phelps. A survey of emerging trend detection in textual data mining. Survey of Text Mining, pp. 185–224, 2003.
J. Leskovec, L. Backstrom, J. Kleinberg. Meme Tracking and the Dynamics of the News Cycle, KDD Conference, 2009.
D. Lewis. The TREC-4 filtering track: description and analysis. Proceedings of TREC-4, 4th Text Retrieval Conference, pp. 165–180, 1995.
D. Lewis, R. E. Schapire, J. P. Callan, R. Papka. Training algorithms for linear text classifiers. ACM SIGIR Conference, 1996.
Y.-B. Liu, J.-R. Cai, J. Yin, A. W.-C. Fu. Clustering Text Data Streams, Journal of Computer Science and Technology, Vol. 23(1), pp. 112–128, 2008.
Q. Mei, C.-X. Zhai. Discovering Evolutionary Theme Patterns from Text- An Exploration of Temporal Text Mining, ACM KDD Conference, 2005.
H. T. Ng, W. B. Goh, K. L. Low. Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Conference, 1997.
F. Rosenblatt. The perceptron: A probabilistic model for information and storage organization in the brain, Psychological Review, 65: pp. 386–407, 1958.
S. Petrovic, M. Osborne, V. Lavrenko. Streaming First Story Detection with Application to Twitter. Proceedings of the ACL Conference, pp. 181–189, 2010.
N. Sahoo, J. Callan, R. Krishnan, G. Duncan, R. Padman. Incremental Hierarchical Clustering of Text Documents, ACM CIKM Conference, 2006.
T. Salles, L. Rocha, G. Pappa, G. Mourao, W. Meira Jr., M. Goncalves. Temporally-aware algorithms for document classification. ACM SIGIR Conference, 2010.
H. Sayyadi, M. Hurst, A. Maykov. Event Detection in Social Streams, AAAI, 2009.
H. Schutze, C. Silverstein. Projections for Efficient Document Clustering, ACM SIGIR Conference, 1997.
H. Schutze, D. Hull, J. Pedersen. A comparison of classifiers and document representations for the routing problem. ACM SIGIR Conference, 1995.
A. Surendran, S. Sra. Incremental Aspect Models for Mining Document Streams. PKDD Conference, 2006.
H. Wang, W. Fan, P. Yu, J. Han, Mining Concept-Drifting Data Streams with Ensemble Classifiers, KDD Conference, 2003.
X. Wang, C.-X. Zhai, X. Hu, R. Sproat. Mining Correlated Bursty Topic Patterns from Correlated Text Streams, ACM KDD Conference, 2007.
E. Wiener, J. O. Pedersen, A. S. Weigend. A Neural Network Approach to Topic Spotting. SDAIR, pp. 317–332, 1995.
Y. Yang, J. Carbonell, R. Brown, T. Pierce, B. T. Archibald, X. Liu. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems, 14(4):32–43, 1999.
Y. Yang, T. Pierce, J. Carbonell. A study on retrospective and online event detection. ACM SIGIR Conference, 1998.
Y.Yang, J. Carbonell, C. Jin. Topic-conditioned Novelty Detection. ACM KDD Conference, 2002.
L. Yao, D. Mimno, A. McCallum. Efficient methods for topic model inference on streaming document collections, ACM KDD Conference, 2009.
K. L. Yu, W. Lam. A new on-line learning algorithm for adaptive text filtering. ACM CIKM Conference, 1998.
J. Zhang, Z. Ghahramani, Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In Saul L., Weiss Y., Bottou L. (eds) Advances in Neural Information Processing Letters, 17, 2005.
Y. Zhang, X. Li, M. Orlowska. One Class Classification of Text Streams with Concept Drift, ICDMW Workshop, 2008.
Q. Zhao, P. Mitra. Event Detection and Visualization for Social Text Streams, ICWSM, 2007.
S. Zhong. Efficient Streaming Text Clustering. Neural Networks, Volume 18, Issue 5-6, 2005.
http://projects.ldc.upenn.edu/TDT/
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Aggarwal, C.C. (2012). Mining Text Streams. In: Aggarwal, C., Zhai, C. (eds) Mining Text Data. Springer, Boston, MA. https://doi.org/10.1007/978-1-4614-3223-4_9
Download citation
DOI: https://doi.org/10.1007/978-1-4614-3223-4_9
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4614-3222-7
Online ISBN: 978-1-4614-3223-4
eBook Packages: Computer ScienceComputer Science (R0)