Abstract
In general practice, the perception of noise has been inevitably negative. Specific to data analytic, most of the existing techniques developed thus far comply with a noise-free assumption. Without an assistance of data pre-processing, it is hard for those models to discover reliable patterns. This is also true for k-means, one of the most well known algorithms for cluster analysis. Based on several works in the literature, they suggest that the ensemble approach can deliver accurate results from multiple clusterings of data with noise completely at random. Provided this motivation, the paper presents the study of using different consensus clustering techniques to analyze noisy data, with k-means being exploited as base clusterings. The empirical investigation reveals that the ensemble approach can be robust to low level of noise, while some exhibit improvement over the noise-free cases. This finding is in line with the recent published work that underlines the benefit of small noise to centroid-based clustering methods. In addition, the outcome of this research provides a guideline to analyzing a new data collection of uncertain quality level.


















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Agrawal P, Sarma AD, Ullman J, Widom J (2010) Foundations of uncertain-data integration. Proc VLDB Endow 3(1–2):1080–1090
Aidos H, Carreiras C, Silva H, Fred A (2013) Evidence accumulation approach applied to EEQ analysis. In: Proceedings of international conference on pattern recognition applications and methods, pp 479–484
Asuncion A, Newman DJ (2007) UCI machine learning repository. Irvine University of California, Irvine
Balcan MF, Liang Y, Gupta P (2014) Robust hierarchical clustering. J Mach Learn Res 15:4011–4051
Bernecker T, Kriegel HP, Renz M, Verhein F, Zufle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 119–128
Bshouty NH, Jackson JC, Tamon C (2003) Uniform-distribution attribute noise learnability. Inf Comput 187(2):277–290
Chan E, Ching W, Ng M, Huang J (2004) An optimization algorithm for clustering using weighted dissimilarity measures. Pattern Recognit 37(5):943–952
Cooke EJ, Savage RS, Kirk PDW, Darkins R, Wild DL (2011) Bayesian hierarchical clustering for microarray time series data with replicates and outlier measurements. BMC Bioinform 12(399):1–12
Deshpande A, Guestrin C, Madden SR, Hellerstein JM, Hong W (2005) Model-based approximate querying in sensor networks. Int J Very Large Data Bases 14(4):417–443
Domeniconi C, Al-Razgan M (2009) Weighted cluster ensembles: methods and analysis. ACM Trans Knowl Discov Data 2(4):1–40
Fern XZ, Brodley CE (2004) Solving cluster ensemble problems by bipartite graph partitioning. In: Proceedings of international conference on machine learning, pp 36–43
Fisher DH (1987) Knowledge acquisition via incremental conceptual clustering. Mach Learn 2:139–172
Fred ALN, Jain AK (2005) Combining multiple clusterings using evidence accumulation. IEEE Trans Pattern Anal Mach Intell 27(6):835–850
Frenay B, Verleysen M (2014) Classification in the presence of label noise: a survey. IEEE Trans Neural Netw Learn Syst 25(5):845–869
Garcia-Escudero LA, Gordaliza A, Matran C, Mayo-Iscar A (2008) A general trimming approach to robust cluster analysis. Ann Stat 36(3):1324–1345
Ghinita G, Karras P, Kalnis P, Mamoulis N (2007) Fast data anonymization with low information loss. In: Proceedings of international conference on very large data bases, pp 758–769
Gionis A, Mannila H, Tsaparas P (2007) Clustering aggregation. ACM Trans Knowl Discov Data 1(1):4
Gullo F, Tagarelli A (2012) Uncertain centroid based partitional clustering of uncertain data. Proc VLDB Endow 5(7):610–621
Gullo F, Ponti G, Tagarelli A (2013) Minimizing the variance of cluster mixture models for clustering uncertain objects. Stat Anal Data Min 6(2):116–135
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of IEEE conference on computer vision and pattern recognition, pp 770–778
Huang D, Lai J, Wang CD (2016) Ensemble clustering using factor graph. Pattern Recognit 50:131–142
Huang D, Lai JH, Wang CD (2016) Robust ensemble clustering using probability trajectories. IEEE Trans Knowl Data Eng 28(5):1312–1326
Huang D, Wang CD, Lai JH (2018) Locally weighted ensemble clustering. IEEE Trans Cybern 48(5):1460–1473
Huang J, Ng M, Rong H, Li Z (2005) Automated variable weighting in k-means type clustering. IEEE Trans Pattern Anal Mach Intell 27(5):657–668
Huang X, Ye Y, Zhang H (2014) Extensions of kmeans-type algorithms: a new clustering framework by integrating intracluster compactness and intercluster separation. IEEE Trans Neural Netw Learn Syst 25(8):1433–1446
Hulse JDV, Khoshgoftaar TM, Huang H (2007) The pairwise attribute noise detection algorithm. Knowl Inf Syst 11(2):171–190
Iam-On N, Boongoen T (2013) Pairwise similarity for cluster ensemble problem: link-based and approximate approaches. Trans Large Scale Data Knowl Centered Syst 9:95–122
Iam-On N, Boongoen T (2015) Comparative study of matrix refinement approaches for ensemble clustering. Mach Learn 98(1–2):269–300
Iam-On N, Boongoen T, Garrett S (2010) LCE: a link-based cluster ensemble method for improved gene expression data analysis. Bioinformatics 26(12):1513–1519
Iam-On N, Boongoen T, Garrett S, Price C (2011) A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell 33(12):2396–2409
Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666
Jiang B, Pei J, Tao Y, Lin X (2013) Clustering uncertain data based on probability distribution similarity. IEEE Trans Knowl Data Eng 25(4):751–763
Jurek A, Nugent C, Bi Y, Wu S (2014) Clustering-based ensemble learning for activity recognition in smart homes. Sensors 14:12,285–12,304
Kao B, Lee SD, Cheung DW, Ho WS, Chan KF (2008) Clustering uncertain data using voronoi diagrams. In: Proceedings of IEEE international conference on data mining, pp 333–342
Karypis G, Kumar V (1998) A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J Sci Comput 20(1):359–392
Karypis G, Kumar V (1998) Multilevel k-way partitioning scheme for irregular graphs. J Parallel Distrib Comput 48(1):96–129
Karypis G, Kumar V (1998) A parallel algorithm for multilevel graph-partitioning and sparse matrix ordering. J Parallel Distrib Comput 48(1):71–95
Karypis G, Aggarwal R, Kumar V, Shekhar S (1999) Multilevel hypergraph partitioning: applications in VLSI domain. IEEE Trans VLSI Syst 7(1):69–79
Kerr MK, Churchill G (2001) Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc Natl Acad Sci 98:8961–8965
Kim E, Kim S, Ashlock D, Nam D (2009) MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform 10:260
Kim H, Thiagarajan JJ, Bremer P (2014) Image segmentation using consensus from hierarchical segmentation ensembles. In: Proceedings of IEEE international conference on image processing, pp 3272 – 3276
Kriegel HP, Kroger P, Sander J, Zimek A (2011) Density-based clustering. WIREs Data Min Knowl Discov 1(3):231–240
Mantas CJ, Abellan J, Castellano JG (2016) Analysis of credal-c4.5 for classification in noisy domains. Expert Syst Appl 61:314–326
McQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, pp 281–297
Medvedovic M, Yeung KY, Bumgarner RE (2004) Bayesian mixture model based clustering of replicated microarray data. Bioinformatics 20:1222–1232
Mirkin B (2001) Reinterpreting the category utility function. Mach Learn 45:219–228
Mirylenka K, Giannakopoulos G, Do LM, Palpanas T (2017) On classifier behavior in the presence of mislabeling noise. Data Min Knowl Discov 31(3):661–701
Monti S, Tamayo P, Mesirov JP, Golub TR (2003) Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach Learn 52(1–2):91–118
Ng A, Jordan M, Weiss Y (2001) On spectral clustering: analysis and an algorithm. Adv Neural Inf Process Syst 14:849–856
Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data. In: Proceedings of IEEE international conference on data mining, pp 436–445
Nguyen N, Caruana R (2007) Consensus clusterings. In: Proceedings of IEEE international conference on data mining, pp 607–612
Osoba O, Kosko B (2013) Noise-enhanced clustering and competitive learning algorithms. Neural Netw 37:132–140
Osoba O, Kosko B (2016) The noisy expectation-maximization algorithm for multiplicative noise injection. Fluct Noise Lett 15(1):1–23
Ronan T, Qi Z, Naegle KM (2016) Avoiding common pitfalls when clustering biological data. Sci Signal 9(432):1–13
Santos CP, Carvalho DM, Nascimento M (2016) A consensus graph clustering algorithm for directed networks. Expert Syst Appl 54:121–135
Sloutsky R, Jimenez N, Swamidass SJ, Naegle KM (2013) Accounting for noise when clustering biological data. Brief Bioinform 14:423–436
Sluban B, Gamberger D, Lavrac N (2014) Ensemble-based noise detection: noise ranking and visual performance evaluation. Data Min Knowl Discov 28(2):265–303
Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3:583–617
Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining, pp 273–282
Tijms H (2004) Understanding probability: chance rules in everyday life. Cambridge University Press, Cambridge
Topchy AP, Jain AK, Punch WF (2005) Clustering ensembles: models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27(12):1866–1881
Weng F, Jiang Q, Chen L, Hong Z (2007) Clustering ensemble based on the fuzzy KNN algorithm. In: Proceedings of international conference on software engineering, artificial intelligence, networking, and parallel/distributed computing, pp 1001–1006
Xiao W, Yang Y, Wang H, Li T, Xing H (2016) Semi-supervised hierarchical clustering ensemble and its application. Neurocomputing 173:362–1376
Yu Z, Wong HS (2009) Class discovery from gene expression data based on perturbation and cluster ensemble. IEEE Trans NanoBiosci 8(2):147–160
Zhang H, Chow TWS, Wu QMJ (2016) Organizing books and authors by multilayer som. IEEE Trans Neural Netw Learn Syst 27(12):2537–2550
Zhong C, Yue X, Zhang Z, Lei J (2015) A clustering ensemble: two-level-refined co-association matrix with path-based transformation. Pattern Recognit 48:2699–2709
Zhu X, Wu X (2004) Class noise vs attribute noise: a quantitative study of their impacts. Artif Intell Rev 22(3–4):177–210
Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387
Acknowledgements
This work is funded by IAPP1-100077 (Newton RAE-TRF): Anomaly Traffic Identification through Artificial Intelligence, Cyber Security and Big Data Analytics Technologies. It is also partly supported by Mae Fah Luang University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Iam-On, N. Clustering data with the presence of attribute noise: a study of noise completely at random and ensemble of multiple k-means clusterings. Int. J. Mach. Learn. & Cyber. 11, 491–509 (2020). https://doi.org/10.1007/s13042-019-00989-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13042-019-00989-4