Abstract
Recently, the proposed non-parametric Bayesian based techniques which aim to model short-length textual documents through the multinomial distribution on the bag-of-words (BOW), aka mixture model-based approach. Although existing model can effectively deal with the topic/concept drift and textual sparsity problems, they are unable to exploit the semantic sequential representation of text as well as the co-occurrence relationships between words. To meet these challenges, we propose a novel approach called as GOWSeqStream. Our proposed model is a joint integration of graph-of-words (GOW) and deep sequential encoding within the Dirichlet Process Mixture Model (DPMM) framework to improve the performance of text clustering task. Extensive experiments in benchmark real-world datasets demonstrate the effectiveness of our proposed GOWSeqStream model in comparing with recent state-of-the-art baselines. Experimental outputs in terms of NMI standard metric demonstrate the outperformances of proposed GOWSeqStream model over the recent well-known text stream clustering baselines, such as MStream, NPMM and OSDM.










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
20-Newsgroups dataset: http://qwone.com/~jason/20Newsgroups/.
Tweet-Set dataset: http://trec.nist.gov/data/microblog.html.
Google News website: https://news.google.com/news/
NLP-Toolkit: https://www.nltk.org/.
Word2Vec & pretrained word embeddings data:https://code.google.com/archive/p/word2vec/.
DTM model (C/C + +): https://github.com/blei-lab/dtm.
MStream model (Python): https://github.com/jackyin12/MStream.
OSDM model (Python): https://github.com/JayKumarr/OSDM.
VNTC dataset: https://github.com/duyvuleo/VNTC.
References
Ganguli I, Sil J, Sengupta N (2021) Nonparametric method of topic identification using granularity concept and graph-based modeling. Neural Comput Appl 1–21
Hassani A, Iranmanesh A, Mansouri N (2021)Text mining using nonnegative matrix factorization and latent semantic analysis. Neural Comput Appl 1–22
Nakamura T, Shirakawa M, Hara T, Nishio S (2019) Wikipedia-based relatedness measurements for multilingual short text clustering. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 18(2):1–25
Ruan YP, Ling ZH, Zhu X (2020) Condition-transforming variational autoencoder for generating diverse short text conversations. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(6):1–13
Zhao S, Gao Y, Ding G, Chua TS (2017) Real-time multimedia social event detection in microblog. IEEE Trans Cybernet 48(11):3218–3231
Pham P, Nguyen LT, Vo B, & Yun U (2021) Bot2Vec: a general approach of intra-community oriented representation learning for bot detection in different types of social networks. Inf Syst 101771
Blei DM, & Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning
Amoualian H, Clausel M, Gaussier E, & Amini MR (2016) Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
Du N, Farajtabar M, Ahmed A, Smola AJ, & Song L (2015) Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining
Yin J and Wang J (2015) A text clustering algorithm using an online clustering scheme for initialization. In: ACM International Conference on Knowledge Discovery and Data Mining
Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, and de Rijke M (2016) Explainable user clustering in short text streams. In: International ACM conference on research and de- velopment in information retrieval
Liang S, Yilmaz E, & Kanoulas E (2016) Dynamic clustering of streaming short documents. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining
Livieris IE, Stavroyiannis S, Iliadis L, Pintelas P (2021) Smoothing and stationarity enforcement framework for deep learning time-series forecasting. Neural Comput Appl 1–15
Yin J, Chao D, Liu Z, Zhang W, Yu X, Wang J (2018) Model-based clustering of short text streams. In: ACM international conference on knowledge discovery and data mining
Chen J, Gong Z, Liu W (2020) A Dirichlet process biterm-based mixture model for short text stream clustering. Appl Intell 1–11
Ameur MSH, Belkebir R, Guessoum A (2020) Robust arabic text categorization by combining convolutional and recurrent neural networks. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(5):1–16
Kumar J, Shao J, Uddin S, Ali W (2020) An online semantic-enhanced dirichlet model for short text stream clustering. In: Proceedings of the 58th annual meeting of the association for computational linguistics
Chen J, Gong Z, Liu W (2019) A nonparametric model for online topic discovery with word embeddings. Inf Sci 504:32–47
Liu Y, Che W, Wang Y, Zheng B, Qin B, Liu T (2019) Deep contextualized word embeddings for universal dependency parsing. ACM Trans Asian Low-Resour Language Inf Process (TALLIP) 19(1):1–17
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint http://arxiv.org/abs/1301.3781
Pirbhulal S, Pombo N, Felizardo V, Garcia N, Sodhro AH, Mukhopadhyay SC (2019) Towards machine learning enabled security framework for iot-based healthcare. In: 2019 13th international conference on sensing technology (ICST), IEEE
AHMAD Ijaz et al (2020) Machine learning meets communication networks: current trends and future challenges. IEEE Access 8:223418–223460
Lin Y, Jin X, Chen J, Sodhro AH, Pan Z (2019) An analytic computation-driven algorithm for decentralized multicore systems. Futur Gener Comput Syst 96:101–110
Talat R, Obaidat MS, Muzammal M, Sodhro AH, Luo Z, Pirbhulal S (2020) A decentralised approach to privacy preserving trajectory mining. Futur Gener Comput Syst 102:382–392
Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
Wei X, Sun J, Wang X (2007) Dynamic mixture models for multiple time-series. IJCAI 7:2909–2914
Iwata T, Watanabe S, Yamada T, Ueda N (2009) Topic tracking model for analyzing consumer purchase behavior. In: Twenty-first international joint conference on artificial intelligence
Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. Society for industrial and applied mathematics
Aggarwal CC, Philip SY, Han J, & Wang J (2003) in A framework for clustering evolving data streams. In: Proceedings 2003 VLDB conference
Zhong S (2005) Efficient streaming text clustering. Neural Netw 18(5–6):790–798
Cao F, Estert M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the 2006 SIAM international conference on data mining
Shou L, Wang Z, Chen K, Chen G (2013) Sumblr: continuous summarization of evolving tweet streams. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Aggarwal CC, Philip SY (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196
Yan X, Han J (2002) gspan: graph-based substructure pattern mining. In: Proceedings of IEEE international conference on data mining
Huan J, Wang W, Prins J (2003) Efficient mining of frequent subgraphs in the presence of isomorphism. In: Third IEEE international conference on data mining
Duan T, Lou Q, Srihari SN, & Xie X (2019) Sequential embedding induced text clustering, a non-parametric bayesian approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K & Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: human language technologies
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning (PMLR)
Hoang VCD, Dinh D, Le Nguyen N, Ngo HQ (2007) A comparative study on vietnamese text classification methods. In: 2007 IEEE international conference on research, innovation and vision for the future
Vu T, Nguyen DQ, Nguyen DQ, Dras M, Johnson M (2018) Vncorenlp: a Vietnamese natural language processing toolkit. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: demonstrations
Acknowledgement
This research is funded by Thu Dau Mot University, Binh Duong, Vietnam under grant number DT21.1-069.
Funding
This research is funded by Thu Dau Mot University, Binh Duong, Vietnam under grant number DT21.1–069.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Vo, T. GOWSeqStream: an integrated sequential embedding and graph-of-words for short text stream clustering. Neural Comput & Applic 34, 4321–4341 (2022). https://doi.org/10.1007/s00521-021-06563-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-021-06563-w