Dimension Reduction of Health Data Clustering

abdullah embong

Dimension Reduction of Health Data Clustering

abdullah embong

2011, Arxiv preprint arXiv:1110.3569

visibility

…

description

10 pages

link

1 file

Abstract: The current data tends to be more complex than conventional data and need dimension reduction. Dimension reduction is important in cluster analysis and creates a smaller data in volume and has the same analytical results as the original representation. A clustering process needs data reduction to obtain an efficient processing time while clustering and mitigate curse of dimensionality. This paper proposes a model for extracting multidimensional data clustering of health database. We implemented four dimension ...

International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) Dimension Reduction of Health Data Clustering Rahmat Widia Sembiring1, Jasni Mohamad Zain2, Abdullah Embong3 1,2 Faculty of Computer System and Software Engineering, Universiti Malaysia Pahang Lebuhraya Tun Razak, 26300, Kuantan, Pahang Darul Makmur, Malaysia 3 School of Computer Science, Universiti Sains Malaysia 11800 Minden, Pulau Pinang, Malaysia 1 rahmatws@yahoo.com, 2jasni@ump.edu.my, 3ae@cs.usm.my ABSTRACT The current data tends to be more complex than conventional data and need dimension reduction. Dimension reduction is important in cluster analysis and creates a smaller data in volume and has the same analytical results as the original representation. A clustering process needs data reduction to obtain an efficient processing time while clustering and mitigate curse of dimensionality. This paper proposes a model for extracting multidimensional data clustering of health database. We implemented four dimension reduction techniques such as Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Self Organizing Map (SOM) and FastICA. The results show that dimension reductions significantly reduce dimension and shorten processing time and also increased performance of cluster in several health datasets. KEYWORDS DBSCAN, dimension reduction, SVD, PCA, SOM, FastICA. 1 Introduction The current data tends to be multidimensional and high dimension, and more complex than conventional data. Many clustering algorithms have been proposed and often produce clusters that are less meaningful. The use of multidimensional data will result in more noise, complex data, and the possibility of unconnected data entities. This problem can be solved by using clustering algorithm. Several clustering algorithms grouped into cell-based clustering, density based clustering, and clustering oriented. To obtain an efficient processing time to mitigate a curse of dimensionality while clustering, a clustering process needs data reduction. Dimension reduction is a technique that is widely used for various applications to solve curse of dimensionality. Dimension reduction is important in cluster analysis, which not only makes the high dimensional data addressable and reduces the computational cost, but can also provide users with a clearer picture and visual examination of the data of interest [6]. Many emerging dimension reduction techniques proposed, such as Local Dimensionality Reduction (LDR) tries to find local correlations in the data, and performs dimensionality reduction on the locally correlated clusters of data individually [3], where dimension reduction as a dynamic process adaptively adjusted and integrated with the clustering process [4]. Sufficient Dimensionality Reduction (SDR) is an iterative algorithm [8], 1041 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) which converges to a local minimum of and hence solves the Max-Min problem as well. A number of optimizations can solve this minimization problem, and reduction algorithm based on Bayesian inductive cognitive model used to decide which dimensions are advantageous [11]. Developing an effective and efficient clustering method to process multidimensional and high dimensional dataset is a challenging problem. This paper is organized into a few sections. Section 2 will present the related work. Section 3 explains the materials and method. Section 4 elucidates the results followed by discussion in Section 5. Section 6 deals with the concluding remarks. 2 Related Work Functions of data mining are association, correlation, prediction, clustering, classification, analysis, trends, outliers and deviation analysis, and similarity and dissimilarity analysis. Clustering technique is applied when there is no class to predict but rather when the instances divide into natural groups [20]. Clustering for multidimensional data has many challenges. These are noise, complexity of data, and data redundancy. To mitigate these problems dimension reduction needed. In statistics, dimension reduction is the process of reducing the number of random variables. The process classified into feature selection and feature extraction [18], and the taxonomy of dimension reduction problems [16] shown in Figure.1. Dimension reduction is the ability to identify a small number of important inputs (for predicting the target) from a much larger number of available inputs, and is effective in cases when there are more inputs than cases or observations. Dimension reduction Dimension reduction Records reduction Increasing learning performance Reducing of irrelevant dimension Attribute reduction Attribute decomposition Future selection Record selection Function decomposition Simple decomposition Reduction of redundant dimension Variable selection Figure 1. Taxonomy of dimension reduction problem Dimension reduction methods associated with regression, additive models, neural network models, and methods of Hessian [6], one of which is the local dimension reduction (LDR), which is looking for relationships in the dataset and reduce the dimensions of each individual, then using a multidimensional index structure [3]. Nonlinear algorithm gives better performance than PCA for sound and image data [14], on the other studies mentioned Principal Component Analysis (PCA) is based on dimension reduction and texture classification scheme can be applied to manifold statistical framework [3]. In most applications, dimension reduction performed as pre-processing step [5], performed with traditional statistical methods that will parse an increasing number of observations [6]. Reduction of dimensions will create a more effective domain characterization [1]. Sufficient Dimension Reduction (SDR) is a generalization of nonlinear regression problems, where the extraction of features is as important as the matrix factorization [8], while SSDR (Semi-supervised Dimension Reduction) 1042 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) is used to maintain the original structure of high dimensional data [27]. The goals of dimension reduction methods are to reduce the number of predictor components and to help ensure that these components are independent. The method designed to provide a framework for interpretability of the results, and to find a mapping F that maps the input data from the space to lower dimension feature space denotes as [26, 15]. Dimension reduction techniques, such as principal component analysis (PCA) and partial least squares (PLS) can used to reduce the dimension of the microarray data before certain classifier is used [25]. We compared four dimension reduction techniques and embedded in DBSCAN, these dimension reduction are: A. SVD The Singular Value Decomposition (SVD) is a factorization of a real or complex matrix. The equation for SVD of X is X=USVT [24], where U is an m x n matrix, S is an n x n diagonal matrix, and VT is also an n x n matrix. The columns of U are called the left singular vectors, {uk}, and form an orthonormal basis for the assay expression profiles, so that ui·uj = 1 for i = j, and ui·uj = 0 otherwise. The rows of VT contain the elements of the right singular vectors, {vk}, and form an orthonormal basis for the gene transcriptional responses. The elements of S are only nonzero on the diagonal, and are called the singular values. Thus, S = diag(s1,...,sn). Furthermore, sk > 0 for 1 ≤ k ≤ r, and si = 0 for (r+1) ≤ k ≤ n. By convention, the ordering of the singular vectors is determined by high-to-low sorting of singular values, with the highest singular value in the upper left index of the S matrix. B. PCA PCA is a dimension reduction technique that uses variance as a measure of interestingness and finds orthogonal vectors (principal components) in the feature space that accounts for the most variance in the data [19]. Principal component analysis is probably the oldest and best known of the techniques of multivariate analysis, first introduced by Pearson, and developed independently by Hotelling [12]. The advantages of PCA are identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. It is a powerful tool for analysing data by finding these patterns in the data. Then compress them by dimensions reduction without much loss of information [23]. Algorithm PCA [7] shown as follows: a. Recover basis: Calculate and let U = eigenvectors of XXT corresponding to the top d eigenvalues. b. Encode training data: Y = UTX where Y is a d x t matrix of encodings of the original data. c. Reconstruct training data: d. Encode test example: where y is dimensional encoding of x. e. Reconstruct test example: a d- C. SOM 1043 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) A self-organizing map (SOM) is a type of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two-dimensional), discretized representation of the input space of the training samples, and called a map [14]. Self-organizing maps are different from other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space. This makes SOMs useful for visualizing lowdimensional views of high-dimensional data, akin to multidimensional scaling. The model was first described as an artificial neural network by the Finnish professor Teuvo Kohonen, and is sometimes called a Kohonen map. D. FastICA Independent Component Analysis (ICA) introduced by Jeanny Hérault and Christian Jutten in 1986, later clarified by Pierre Comon in 1994 [22]. FastICA is one of the extensions of ICA, which is based on point iteration scheme to find the nongaussianity [9], can also be derived as approximate Newton iteration, FastICA using the following formula: where and , matrices W need to orthogonalized after each phase have been processed. 3 Material and Method This study is designed to find the most efficient dimension reduction technique. In order to achieve this objective, we implemented a model for efficiency of the cluster performed by first reducing the dimensions of datasets [21]. There are four dimension reduction techniques tested in the proposed model, namely SVD, PCA, SOM, FastICA. Performance-1 Original Datasets Dimension Reduction (DR) Clustering technique Performance Data to Similarity Filtering Performance-2 Cluster Model Figure 2. Proposed model compared based on dimension reduction and DBSCAN clustering Dimensions reduction result is processed into DBSCAN cluster technique. DBSCAN needs ε (eps) and the minimum number of points required to form a cluster (minPts) including mixed euclidean distance as distance measure. For the result of DBSCAN clustering using functional data to similarity, it calculates a similarity measure from the given data (attribute based), and another output of DBSCAN that is measured is performance-1, this simply provides the number of clusters as a value. Result of data to similarity takes an exampleSet as input for filter examples and returns a new exampleSet including only the examples that fulfill a condition. By specifying an implementation of a condition, and a parameter string, arbitrary filters can be applied and directly derive a performance-2 as measure from a specific data or statistics value, then process expectation maximum cluster with parameter k=2, max runs=5, max optimization step=100, quality=1.0E-10 and install distribution=k-means run. 4 Result Testing of model performance was conducted on four datasets model; e-coli, acute implant, blood transfusion and 1044 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) prostate cancer. Dimension reduction used SVD, PCA, SOM and FastICA. By using RapidMiner, we conducted the testing process without dimension reduction and clustering, and then compared with the results of clustering process using dimension reduction. Result of attribute dimension reduction shown in Table 1. Table 1. Attribute Dimension Reduction Number of attribute for each datasets Dimension reduction E-coli Acute implant Blood transfusion Prostate cancer with SVD 1 1 1 1 with PCA 5 4 1 3 with SOM 2 2 1 2 with FastICA without dimension reduction 8 8 5 18 8 8 5 18 reduction By implementing four different reduction techniques SVD, PCA, SOM, and FastICA, and continuously applying the cluster method based on cluster density. We obtained results, Figure. 3a present e-coli datasets based on DBSCAN without dimension reduction, Figure. 3b is the result of the cluster of E-coli datasets based on DBSCAN within SVD, Figure. 3c is a cluster of Ecoli datasets based on DBSCAN and PCA, Figure. 3d is a cluster of E-coli datasets based on DBSCAN and SOM, while Figure. 3e is the result of the cluster by using DBSCAN within FastICA dimension reduction. To find out efficiency we conducted the testing and record for processing time, as shown in Table 2. Table 2. Processing time Processing time for each datasets Dimension reduction Acute implant Blood transfusion Prostate cancer E-coli with SVD with PCA 19 9 61 39 27 14 47 35 with SOM 34 22 51 41 with FastICA without dimension reduction 67 12 58 148 22 11 188 90 Using SVD, PCA, SOM and FastICA we also conducted the testing process and found performance no of cluster, as shown in Table 3. Figure 3.1. E-coli datasets based on DBSCAN without dimension reduction Figure. 3b. E-coli datasets based on DBSCAN and SVD Figure. 3c. E-coli datasets based on DBSCAN and PCA Figure 3d. E-coli datasets based on DBSCAN and SOM Table 3. Performance no of cluster Performance no of cluster for each datasets Acute Blood implant transfusion Dimension reduction E-coli Prostate cancer with SVD 2 10 13 1 with PCA 2 2 2 2 with SOM 2 7 17 1 with FastICA 1 1 51 2 without dimension 8 10 13 1 Figure. 3e. E-coli datasets based on DBSCAN and FastICA 1045 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) Another result obtained for acute implant datasets, at Figure. 4a-e present acute implant datasets based on DBSCAN without dimension reduction and within a various dimension reduction. Figure 4a. Acute implant datasets based on DBSCAN without dimension reduction Figure 4b. Acute implant datasets based on DBSCAN and SVD Figure 4c. Acute implant datasets based on DBSCAN and PCA Figure 4d. Acute implant datasets based on DBSCAN and SOM Figure 4e. Acute implant datasets based on DBSCAN and FastICA The third was dataset tested is blood transfusion. Some of the result we present at Figure 5a-e, result obtained for blood transfusion datasets, based on DBSCAN without dimension reduction and within a various dimension reduction. Figure 5a. Blood transfusion datasets based on DBSCAN without dimension reduction Figure 5b. Blood transfusion datasets based on DBSCAN and SVD Figure 5c. Blood transfusion datasets based on DBSCAN and PCA Figure 5d. Blood transfusion datasets based on DBSCAN and and SOM Figure 5e. Blood transfusion datasets based on DBSCAN and FastICA Using same dimension reduction techniques, we clustered prostate cancer, result we present at Figure 6a-e, based on DBSCAN without dimension reduction and within a various dimension reduction. Figure 6a. Prostate cancer datasets based on DBSCAN without dimension reduction Figure 6b. Prostate cancer datasets based on DBSCAN and SVD 1046 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) Figure 6c. Prostate cancer datasets based on DBSCAN and PCA Figure 6d. Prostate cancer datasets based on DBSCAN and SOM Figure 6e. Prostate cancer datasets based on DBSCAN and FastICA Each cluster process, especially ahead of determined value of ɛ=1, and the value MinPts=5, while the number of clusters (k=2) that will be produced was also determined before. 5 Discussion Dimension reduction before clustering process is to obtain efficient processing time and increase accuracy of cluster performance. Based on results in previous section, dimension reduction can shorten processing time and has lowest number of attribute. Figure. 7 shows DBSCAN with SVD has lowest number of reduced attribute. Figure 7. Reduction number of attributes Another evaluation for model implementation is comparison of processing time. In general dimension reduction decreased time to process. For several datasets we found DBSCAN within SVD has lowest processing time. Figure 8. Processing time for each attribute Cluster process with FastICA dimension reduction has highest cluster performance for blood datasets (Figure. 9), but lowest in other datasets, while PCA has lowest performance for overall datasets. Figure 9. Performance no of cluster for each attribute 6 Conclusion The discussion above has shown that applying a dimension reduction technique will shorten the processing time. Dimension reduction before clustering process is to obtain efficient processing time and increase cluster performance. DBSCAN with SVD has lowest processing time for several datasets. SVD also create lowest number of reduced attribute. In general, 1047 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) dimension reduction shows an increased cluster performance. 1048 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) References 1. Bi, Jinbo, Kristin Bennett, Mark Embrechts, Curt Breneman and Minghu Song: “Dimensionality Reduction via Sparse Support Vector Machine”, Journal of Machine Learning Research 3, pp.1229-1243 (2003) 2. Chakrabarti, Kaushik, Sharad Mehrotra: “Local Dimensionality Reduction : A New Approach To Indexing High Dimensional Space”, Proceeding Of The 26th VLDB Conference, Cairo, Egypt, pp.89-100 (2000) 3. Choi, S. W.; Martin, E. B.; Morris, A. J.; Lee, I.-B: “Fault detection based on a maximum likelihood PCA mixture”, Ind. Eng. Chem. Res., 44, 2316−2327, (2005) 4. Ding, Chris, Tao Li: Adaptive Dimension Reduction Using Discriminant Analysis and K-means Clustering, International Conference on Machine Learning, Corvallis, OR, 2007 5. Ding, Chris, Xiaofeng He, Hongyuan Zha, Horst Simon: Adaptive Dimension Reduction For Clustering High Dimensional Data”, Lawrence Berkeley National Laboratory, pp.1-8 (2002) 6. Fodor, I.K: “A Survey of Dimension Reduction Techniques. LLNL Technical Report, UCRL-ID-148494”, pp.1-18 (2002) 7. Ghodsi, Ali: “Dimensionality Reduction, A Short Tutorial, Technical Report”, 2006-14, Department of Statistics and Actuarial Science, University of Waterloo, pp. 5-6 (2006) 8. Globerson, Amir, Naftali Tishby: “Sufficient Dimensionality Reduction”, Journal of Machine Learning Research 3, pp. 1307-1331 (2003) 9. Hyvaerinen, Aapo, Erkki Oja: “Independent Component Analysis: Algorithms and Applications”, Neural Networks, pp. 411-430 (2002) 10. Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 13, pp.411-430 (2000). 11. Jin, Longcun, Wanggen Wan, Yongliang Wu, Bin Cui, Xiaoqing Yu, Youyong Wu: “A Robust High-Dimensional Data Reduction Method”, The International Journal Of Virtual Reality 9(1), pp.55-60 (2010) 12. Jolliffe, I.T: “Principal Component Analysis”, Springer Verlag New York Inc. New York, pp. 7-26 (2002) 13. Kambhatla, Nanda , Todd K. Leen: Fast “Non_Linear Dimension Reduction”, (1994) 14. Kohonen, T., Kaski, S. and Lappalainen, H. (1997). Self-organized formation of various invariant-feature filters in the adaptivesubspace SOM. Neural Computation, 9: 1321-1344. 15. Larose, Daniel T: “Data Mining Methods And Models”, John Wiley & Sons Inc, New Jersey, pp.1-15 (2006) 16. Maimon, Oded, Lior Rokach: “Data Mining And Knowledge Discovery Handbook”, Springer Science+Business Media Inc, pp.9497 (2005) 17. Maimon, Oded, Lior Rokach: “Decomposition Methodology For Knowledge Discovery And Data Mining”, World Scientific Publishing Co, Pte, Ltd, Danvers MA, pp. 253-255 (2005) 18. Nisbet, Robert, John Elder, Gary Miner: “Statistical Analysis & Data Mining Application”, Elsevier Inc, California, pp.111-269 (2009) 19. Poncelet, Pascal, Maguelonne Teisseire, Florent Masseglia: “Data Mining Patterns: New Methods And Application”, Information Science Reference, Hershey PA, pp. 120-121 (2008) 20. Sembiring, Rahmat Widia, Jasni Mohamad Zain, Abdullah Embong: “Clustering High Dimensional Data Using Subspace And Projected Clustering Algorithm”, International Journal Of Computer Science & Information Technology (IJCSIT) Vol.2, No.4, pp.162-170 (2010) 21. Sembiring, Rahmat Widia, Jasni Mohamad Zain, Abdullah Embong: “Alternative Model for Extracting Multidimensional Data BasedOn Comparative Dimension Reduction”, ICSECS (2), pp. 28-42, (2011) 22. Sembiring, Rahmat Widia, Jasni Mohamad Zain: “Cluster Evaluation Of Density Based Subspace Clustering”, Journal Of Computing, Volume 2, Issue 11, pp.14-19 (2010) 23. Smith, Lindsay I: “A Tutorial on Principal Component Analysis”, http://www.cs.otago.ac.nz/cosc453/student_t utorials/principal_components.pdf, pp.12-16 (2002) 24. Wall, Michael E., Andreas Rechtsteiner, Luis M. Rocha: "Singular value decomposition and principal component analysis”, A Practical Approach to Microarray Data Analysis. D.P. Berrar, W. Dubitzky, M. 1049 International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050 The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085) Granzow, Kluwer: Norwell, MA, LANL LAUR-02-4001, eds. pp. 91-109 (2003) 25. Wang, John: “Encyclopaedia of Data Warehousing and Data Mining”, Idea Group Reference, Hershey PA, pp. 812 (2006) 26. Xu, Rui, Donald C. Wunsch II: “Clustering”, John Wiley & Sons, Inc, New Jersey, pp. 237-239 (2009) 27. Zhang, Daoqiang, Hua Zhou Zhi, Songcan Chen: “Semi-Supervised Dimensionality Reduction”, 7th SIAM International Conference on Data Mining, pp.629-634, (2008) 1050

Log In

Dimension Reduction of Health Data Clustering

Related papers

Related topics