Abstract
Traditional clustering algorithms are inapplicable to many real-world problems where limited knowledge from domain experts is available. Incorporating the domain knowledge can guide a clustering algorithm, consequently improving the quality of clustering. In this paper, we propose SS-NMF: a semi-supervised non-negative matrix factorization framework for data clustering. In SS-NMF, users are able to provide supervision for clustering in terms of pairwise constraints on a few data objects specifying whether they “must” or “cannot” be clustered together. Through an iterative algorithm, we perform symmetric tri-factorization of the data similarity matrix to infer the clusters. Theoretically, we show the correctness and convergence of SS-NMF. Moveover, we show that SS-NMF provides a general framework for semi-supervised clustering. Existing approaches can be considered as special cases of it. Through extensive experiments conducted on publicly available datasets, we demonstrate the superior performance of SS-NMF for clustering.
Similar content being viewed by others
References
Bansal N, Blum A, Chawla S (2002) Correlation clustering. Proceedings of the 43rd symposium on foundations of computer science, pp 238–247
Bar-Hillel A, Hertz T, Shental N, Weinshall D (2003) Learning distance functions using equivalence relations. Proceedings of the 20th international conference on machine learning, pp 11–18
Basu S, Banerjee A, Mooney RJ (2002) Semi-supervised clustering by seeding. Proceedings of the 19th international conference on machine learning, pp 27–34
Basu S, Bilenko M, Mooney RJ (2004) A probabilistic framework for semi-supervised clustering. Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 59–68
Blum A, Mitchell TM (1998) Combining labeled and unlabeled data with co-training. Annual workshop on computational learning theory, Proceedings of the 11th annual conference on Computational learning theory, pp 92–100
Boley D (1998). Principal direction divisive partitioning. Data Mining Knowledge Discovery 2(4): 325–344
Charikar M, Guruswami V, Wirth A (2003) Clustering with qualitative information. Proceedings of the 44th IEEE symposium on foundations of computer science, pp 524–533
Chung FRK (1997) Spectral Graph Theory, American Mathematical Society
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc B pp 1–38
Ding C, He X, Simon HD (2005) On the equivalence of nonnegative matrix factorization and spectral clustering. Proceedings of SIAM international conference on data mining, pp 606–610
Ding C, Li T, Peng W, Park H (2006) Orthogonal nonnegative matrix tri-factorizations for clustering. Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM Press, New York, pp 126–135
Dubnov S, EI-Yaniv R, Gdalyahu Y, Schneidman E, Tishby N and Yona G (2002). A new nonparametric pairwise clustering algorithm based on iterative estimation of distance profiles. Machine Learning 47(1): 35–61
Duda RO, Hart PE and Stork DG (2000). Pattern Classification. Wiley, New York
Fiedler M (1973). Algebraic connectivity of graphs. Czechoslovak Math J 23: 298–305
Fiedler M (1975a). Eigenvectors of acyclic matrices. Czechoslovak Math J 25: 607–618
Fiedler M (1975b). A property of eigenvectors of nonnegative symmetric matrices and its application to graph theory. Czechoslovak Math J 25: 619–633
Fisher D (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning 2: 139–172
Godbole S, Harpale A, Sarawagi S, Chakrabarti S (2004) Document classification through interactive supervision of document and term labels. Proceedings of the 8th European conference on principles and practice of knowledge discovery in databases, pp 185–196
Gondek D and Hofmann T (2007). Non-redundant data clustering. Knowledge Information Systems 12(1): 1–24
Hagen L and Kahng AB (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans CAD Integrated Circuits Systems 11(9): 1074–1085
Han E-H, Karypis G (2000) Centroid-based document classification: Analysis and experimental results. Proceedings of the 4th European conference on principles of data mining and knowledge discovery, pp 424–431
Hersh W, Buckley C, Leone T, Hickam D (1994) Ohsumed: an interactive retrival evaluation and new large test collection for research. Proceedings of 17th ACM SIGIR conference on research and development in information retrieval, pp 192–201
Hinneburg A and Keim D (2003). A general approach to clustering in large databases with noise. Knowledge Information Systems 5(4): 387–415
Hinneburg A, Keim DA (1998) An efficient approach to clustering in large multimedia databases with noise. Proceedings of the 4th international conference on knowledge discovery and data mining, pp 58–65
Hotho A, Staab S, Stumme G (2003) Text clustering based on background knowledge, Technical report 425. University of Karlsruhe, Institute AIFB, Karlsruhe
Huang Y, Mitchell TM (2006) Text clustering with extended user feedback. Proceedings of the 29th ACM SIGIR conference on research and development in information retrieval, pp 413–420
Jain AK, Murty MN and Flynn PJ (1999). Data clustering: a review. ACM computing surveys 31(3): 264–323
Ji X, Xu W (2006) Document clustering with prior knowledge. Proceedings of the 29th ACM SIGIR conference on research and development in information retrieval, pp 405–412
Joachims T (1999) Transductive inference for text classification using support vector machines. Proceedings of the 16th international conference on machine learning, pp 200–209
Jones R, McCallum A, Nigam K, Riloff E (1999) Bootstrapping for text learning tasks. Workshop on text mining: foundations, techniques and applications, proceedings of international joint conference on artifical intelligence, pp 52–63
Kamvar SD, Klein D, Manning CD (2003) Spectral learning. Proceedings of the 18th international joint conference on artificial intelligence, pp 561–566
Kaufman L and Rousseeuw P (1990). Finding Groups in Data: an introduction to cluster analysis. Wiley, New York
Kim H and Lee S (2004). An intelligent information system for organizing online text documents. Knowledge Information Systems 6(2): 125–149
Klein D, Kamvar S, Manning C (2002) From instance-level constraints to space-level constraints: Making the most of prior knowledge in data clustering. Proceedings of the 19th international conference on machine learning, pp 307–314
Koga H, Ishibashi T and Watanabe T (2007). Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge Information Systems 12(1): 25–53
Kulis B, Basu S, Dhillon I, Mooney R (2005) Semi-supervised graph clustering: a kernel approach. Proceedings of the 22nd international conference on machine learning, pp 457–464
Lee DD and Seung HS (1999). Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791
Lee D, Seung H (2001) Algorithms for non-negative matrix factorization. Proceedings of annual conference on neural information processing systems 13, pp 556–562
Lewis DD (1999) Reuters-21578 text categorization test collection distribution 1.0, http://www.research.att/lewis
Li T, Ding C (2006) The relationships among various nonnegative matrix factorization methods for clustering. Proceedings of the 6th IEEE international conference on data mining, pp 362–371
Liu B, Li X, Lee WS, Yu PS (2004) Text classification by labeling words. Proceedings of AAAI conference on artificial intelligence, pp 425–430
Long B, Zhang Z, Yu PS (2005) Co-clustering by block value decomposition. Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining, pp 635–640
Long B, Zhang Z, Yu PS (2007) Relational clustering by symmetric convex coding. Proceedings the 24th annual international conference on machine learning, pp 569–576
MacQueen J (1967) Some methods for classsification and analysis of multivariate observations. Proceedings of 5th Berkeley symposium on mathematical statistics and probability, pp 281–297
Newman C, Hettich S, Merz C (1998) UCI repository of machine learning databases
Nigam K, McCallum AK, Thrun S, Mitchell TM (1998) Learning to classify text from labeled and unlabeled documents. Proceedings of AAAI conference on artificial intelligence, pp 792–799
Raghavan H, Madani O, Jones R (2005) Interactive feature selection. Proceedings of international joint conference on artificial intelligence, pp 841–846
Sheikholesami G, Chatterjee S, Zhang A (1998) Wavecluster: a multi-resolution clustering approach for very large spatial databases. Proceedings of the international conference on very large databases, pp 428–439
Shi J and Malik J (2000). Normalized cuts and image segmentation. IEEE Trans on PAMI 22(8): 888–905
TREC (n.d.) Text retrieval conference, http://trec.nist.gov
Wagstaff K, Cardie C, Rogers S, Schroedl S (2001) Constrained k-means clustering with background knowledge. Proceedings of the 18th international conference on machine learning, pp 577–584
Xing EP, Ng AY, Jordan M, Russell S (2002) Distance metric learning, with application to clustering with side-information. Advances in neural information processing systems 15, pp 502–512
Xu W, Liu X, Gong Y (2003) Document clustering based on non-negative matrix factorization. Proceedings of the 26th ACM SIGIR conference on research and development in information retrieval, pp 267–273
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. Proceedings of the 14th international conference on machine learning, pp 412–420
Zhang T, Ramakrishnan R, Livny M (1996) Birch: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD international conference on management of data, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, Y., Rege, M., Dong, M. et al. Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17, 355–379 (2008). https://doi.org/10.1007/s10115-008-0134-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0134-6