International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Dimension Reduction of Health Data Clustering
Rahmat Widia Sembiring1, Jasni Mohamad Zain2, Abdullah Embong3
1,2
Faculty of Computer System and Software Engineering,
Universiti Malaysia Pahang
Lebuhraya Tun Razak, 26300, Kuantan, Pahang Darul Makmur, Malaysia
3
School of Computer Science, Universiti Sains Malaysia
11800 Minden, Pulau Pinang, Malaysia
1
rahmatws@yahoo.com, 2jasni@ump.edu.my, 3ae@cs.usm.my
ABSTRACT
The current data tends to be more complex
than conventional data and need dimension
reduction. Dimension reduction is important
in cluster analysis and creates a smaller data
in volume and has the same analytical
results as the original representation. A
clustering process needs data reduction to
obtain an efficient processing time while
clustering
and
mitigate
curse
of
dimensionality. This paper proposes a model
for extracting multidimensional data
clustering of health database. We
implemented four dimension reduction
techniques such as Singular Value
Decomposition
(SVD),
Principal
Component
Analysis
(PCA),
Self
Organizing Map (SOM) and FastICA. The
results show that dimension reductions
significantly reduce dimension and shorten
processing time and also increased
performance of cluster in several health
datasets.
KEYWORDS
DBSCAN, dimension reduction, SVD, PCA,
SOM, FastICA.
1 Introduction
The current data tends to be
multidimensional and high dimension,
and more complex than conventional
data. Many clustering algorithms have
been proposed and often produce
clusters that are less meaningful. The use
of multidimensional data will result in
more noise, complex data, and the
possibility of unconnected data entities.
This problem can be solved by using
clustering algorithm. Several clustering
algorithms grouped into cell-based
clustering, density based clustering, and
clustering oriented. To obtain an
efficient processing time to mitigate a
curse of dimensionality while clustering,
a clustering process needs data
reduction. Dimension reduction is a
technique that is widely used for various
applications to solve curse of
dimensionality.
Dimension reduction is important in
cluster analysis, which not only makes
the high dimensional data addressable
and reduces the computational cost, but
can also provide users with a clearer
picture and visual examination of the
data of interest [6]. Many emerging
dimension
reduction
techniques
proposed, such as Local Dimensionality
Reduction (LDR) tries to find local
correlations in the data, and performs
dimensionality reduction on the locally
correlated clusters of data individually
[3], where dimension reduction as a
dynamic process adaptively adjusted and
integrated with the clustering process
[4].
Sufficient Dimensionality Reduction
(SDR) is an iterative algorithm [8],
1041
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
which converges to a local minimum of
and
hence
solves the Max-Min problem as well. A
number of optimizations can solve this
minimization problem, and reduction
algorithm based on Bayesian inductive
cognitive model used to decide which
dimensions are advantageous [11].
Developing an effective and efficient
clustering
method
to
process
multidimensional and high dimensional
dataset is a challenging problem.
This paper is organized into a few
sections. Section 2 will present the
related work. Section 3 explains the
materials and method. Section 4
elucidates the results followed by
discussion in Section 5. Section 6 deals
with the concluding remarks.
2 Related Work
Functions of data mining are association,
correlation,
prediction,
clustering,
classification, analysis, trends, outliers
and deviation analysis, and similarity
and dissimilarity analysis. Clustering
technique is applied when there is no
class to predict but rather when the
instances divide into natural groups [20].
Clustering for multidimensional data has
many challenges. These are noise,
complexity
of
data,
and
data
redundancy. To mitigate these problems
dimension
reduction
needed.
In
statistics, dimension reduction is the
process of reducing the number of
random variables. The process classified
into feature selection and feature
extraction [18], and the taxonomy of
dimension reduction problems [16]
shown in Figure.1. Dimension reduction
is the ability to identify a small number
of important inputs (for predicting the
target) from a much larger number of
available inputs, and is effective in
cases when there are more inputs than
cases or observations.
Dimension
reduction
Dimension
reduction
Records
reduction
Increasing
learning
performance
Reducing of
irrelevant
dimension
Attribute
reduction
Attribute
decomposition
Future
selection
Record
selection
Function
decomposition
Simple
decomposition
Reduction of
redundant
dimension
Variable
selection
Figure 1. Taxonomy of dimension reduction
problem
Dimension
reduction
methods
associated with regression, additive
models, neural network models, and
methods of Hessian [6], one of which is
the local dimension reduction (LDR),
which is looking for relationships in the
dataset and reduce the dimensions of
each
individual,
then
using
a
multidimensional index structure [3].
Nonlinear algorithm gives better
performance than PCA for sound and
image data [14], on the other studies
mentioned
Principal
Component
Analysis (PCA) is based on dimension
reduction and texture classification
scheme can be applied to manifold
statistical framework [3].
In most applications, dimension
reduction performed as pre-processing
step [5], performed with traditional
statistical methods that will parse an
increasing number of observations [6].
Reduction of dimensions will create a
more effective domain characterization
[1]. Sufficient Dimension Reduction
(SDR) is a generalization of nonlinear
regression
problems,
where
the
extraction of features is as important as
the matrix factorization [8], while SSDR
(Semi-supervised Dimension Reduction)
1042
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
is used to maintain the original structure
of high dimensional data [27].
The goals of dimension reduction
methods are to reduce the number of
predictor components and to help ensure
that these components are independent.
The method designed to provide a
framework for interpretability of the
results, and to find a mapping F that
maps the input data from the space
to
lower dimension feature space
denotes as
[26, 15].
Dimension reduction techniques, such as
principal component analysis (PCA) and
partial least squares (PLS) can used to
reduce the dimension of the microarray
data before certain classifier is used [25].
We compared four dimension
reduction techniques and embedded in
DBSCAN, these dimension reduction
are:
A. SVD
The Singular Value Decomposition
(SVD) is a factorization of a real or
complex matrix. The equation for SVD
of X is X=USVT [24], where U is an
m x n matrix, S is an n x n diagonal
matrix, and VT is also an n x n matrix.
The columns of U are called the left
singular vectors, {uk}, and form an
orthonormal basis for the assay
expression profiles, so that ui·uj = 1 for i
= j, and ui·uj = 0 otherwise. The rows of
VT contain the elements of the right
singular vectors, {vk}, and form an
orthonormal basis for the gene
transcriptional responses. The elements
of S are only nonzero on the diagonal,
and are called the singular values. Thus,
S = diag(s1,...,sn). Furthermore, sk > 0 for
1 ≤ k ≤ r, and si = 0 for (r+1) ≤ k ≤ n. By
convention, the ordering of the singular
vectors is determined by high-to-low
sorting of singular values, with the
highest singular value in the upper left
index of the S matrix.
B. PCA
PCA is a dimension reduction
technique that uses variance as a
measure of interestingness and finds
orthogonal
vectors
(principal
components) in the feature space that
accounts for the most variance in the
data [19]. Principal component analysis
is probably the oldest and best known of
the techniques of multivariate analysis,
first introduced by Pearson, and
developed independently by Hotelling
[12].
The advantages of PCA are
identifying patterns in data, and
expressing the data in such a way as to
highlight
their
similarities
and
differences. It is a powerful tool for
analysing data by finding these patterns
in the data. Then compress them by
dimensions reduction without much loss
of information [23]. Algorithm PCA [7]
shown as follows:
a. Recover basis:
Calculate
and let
U
=
eigenvectors
of
XXT
corresponding to the top d
eigenvalues.
b. Encode training data:
Y = UTX where Y is a d x t matrix of
encodings of the original data.
c. Reconstruct training data:
d. Encode test example:
where y is
dimensional encoding of x.
e. Reconstruct test example:
a
d-
C. SOM
1043
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
A self-organizing map (SOM) is a
type of artificial neural network that is
trained using unsupervised learning to
produce a low-dimensional (typically
two-dimensional),
discretized
representation of the input space of the
training samples, and called a map [14].
Self-organizing maps are different from
other artificial neural networks in the
sense that they use a neighborhood
function to preserve the topological
properties of the input space. This makes
SOMs useful for visualizing lowdimensional views of high-dimensional
data, akin to multidimensional scaling.
The model was first described as an
artificial neural network by the Finnish
professor Teuvo Kohonen, and is
sometimes called a Kohonen map.
D. FastICA
Independent Component Analysis
(ICA) introduced by Jeanny Hérault and
Christian Jutten in 1986, later clarified
by Pierre Comon in 1994 [22]. FastICA
is one of the extensions of ICA, which is
based on point iteration scheme to find
the nongaussianity
[9], can also be
derived as
approximate
Newton
iteration, FastICA using the following
formula:
where
and
, matrices W
need to orthogonalized after each phase
have been processed.
3 Material and Method
This study is designed to find the
most efficient dimension reduction
technique. In order to achieve this
objective, we implemented a model for
efficiency of the cluster performed by
first reducing the dimensions of datasets
[21]. There are four dimension reduction
techniques tested in the proposed model,
namely SVD, PCA, SOM, FastICA.
Performance-1
Original
Datasets
Dimension
Reduction
(DR)
Clustering
technique
Performance
Data to
Similarity
Filtering
Performance-2
Cluster
Model
Figure 2. Proposed model compared based on
dimension reduction and DBSCAN clustering
Dimensions reduction result is
processed into DBSCAN cluster
technique. DBSCAN needs ε (eps) and
the minimum number of points required
to form a cluster (minPts) including
mixed euclidean distance as distance
measure. For the result of DBSCAN
clustering using functional data to
similarity, it calculates a similarity
measure from the given data (attribute
based), and another output of DBSCAN
that is measured is performance-1, this
simply provides the number of clusters
as a value.
Result of data to similarity takes an
exampleSet as input for filter examples
and returns a new exampleSet including
only the examples that fulfill a
condition.
By
specifying
an
implementation of a condition, and a
parameter string, arbitrary filters can be
applied
and
directly
derive
a
performance-2 as measure from a
specific data or statistics value, then
process expectation maximum cluster
with parameter k=2, max runs=5, max
optimization step=100, quality=1.0E-10
and install distribution=k-means run.
4 Result
Testing of model performance was
conducted on four datasets model; e-coli,
acute implant, blood transfusion and
1044
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
prostate cancer. Dimension reduction
used SVD, PCA, SOM and FastICA. By
using RapidMiner, we conducted the
testing process without dimension
reduction and clustering, and then
compared with the results of clustering
process using dimension reduction.
Result of attribute dimension reduction
shown in Table 1.
Table 1. Attribute Dimension Reduction
Number of attribute for each datasets
Dimension
reduction
E-coli
Acute
implant
Blood
transfusion
Prostate
cancer
with SVD
1
1
1
1
with PCA
5
4
1
3
with SOM
2
2
1
2
with FastICA
without
dimension
reduction
8
8
5
18
8
8
5
18
reduction
By implementing four different
reduction techniques SVD, PCA, SOM,
and FastICA, and continuously applying
the cluster method based on cluster
density. We obtained results, Figure. 3a
present e-coli datasets based on
DBSCAN without dimension reduction,
Figure. 3b is the result of the cluster of
E-coli datasets based on DBSCAN
within SVD, Figure. 3c is a cluster of Ecoli datasets based on DBSCAN and
PCA, Figure. 3d is a cluster of E-coli
datasets based on DBSCAN and SOM,
while Figure. 3e is the result of the
cluster by using DBSCAN within
FastICA dimension reduction.
To find out efficiency we conducted
the testing and record for processing
time, as shown in Table 2.
Table 2. Processing time
Processing time for each datasets
Dimension
reduction
Acute
implant
Blood
transfusion
Prostate
cancer
E-coli
with SVD
with PCA
19
9
61
39
27
14
47
35
with SOM
34
22
51
41
with FastICA
without
dimension
reduction
67
12
58
148
22
11
188
90
Using SVD, PCA, SOM and FastICA
we also conducted the testing process
and found performance no of cluster, as
shown in Table 3.
Figure 3.1. E-coli
datasets based on
DBSCAN without
dimension reduction
Figure. 3b. E-coli
datasets based on
DBSCAN and
SVD
Figure. 3c. E-coli
datasets based on
DBSCAN and PCA
Figure 3d. E-coli
datasets based on
DBSCAN and
SOM
Table 3. Performance no of cluster
Performance no of cluster
for each datasets
Acute
Blood
implant
transfusion
Dimension
reduction
E-coli
Prostate
cancer
with SVD
2
10
13
1
with PCA
2
2
2
2
with SOM
2
7
17
1
with FastICA
1
1
51
2
without dimension
8
10
13
1
Figure. 3e. E-coli
datasets based on
DBSCAN and FastICA
1045
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Another result obtained for acute
implant datasets, at Figure. 4a-e present
acute implant datasets based on
DBSCAN without dimension reduction
and within a various dimension
reduction.
Figure 4a. Acute
implant datasets based on
DBSCAN without
dimension reduction
Figure 4b. Acute
implant datasets
based on DBSCAN
and SVD
Figure 4c. Acute implant
datasets based on
DBSCAN and PCA
Figure 4d. Acute
implant datasets
based on DBSCAN
and SOM
Figure 4e. Acute implant
datasets based on
DBSCAN and FastICA
The third was dataset tested is blood
transfusion. Some of the result we
present at Figure 5a-e, result obtained
for blood transfusion datasets, based on
DBSCAN without dimension reduction
and within a various dimension
reduction.
Figure 5a. Blood
transfusion datasets based
on DBSCAN without
dimension reduction
Figure 5b. Blood
transfusion datasets
based on DBSCAN
and SVD
Figure 5c. Blood
transfusion datasets based
on DBSCAN and PCA
Figure 5d. Blood
transfusion datasets
based on DBSCAN
and and SOM
Figure 5e. Blood
transfusion datasets based
on DBSCAN and
FastICA
Using same dimension reduction
techniques, we clustered prostate cancer,
result we present at Figure 6a-e, based
on DBSCAN without dimension
reduction and within a various
dimension reduction.
Figure 6a. Prostate
cancer datasets based on
DBSCAN without
dimension reduction
Figure 6b.
Prostate cancer
datasets based on
DBSCAN and
SVD
1046
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Figure 6c. Prostate cancer
datasets based on
DBSCAN and PCA
Figure 6d.
Prostate cancer
datasets based on
DBSCAN and
SOM
Figure 6e. Prostate cancer
datasets based on
DBSCAN and FastICA
Each cluster process, especially ahead
of determined value of ɛ=1, and the
value MinPts=5, while the number of
clusters (k=2) that will be produced was
also determined before.
5 Discussion
Dimension
reduction
before
clustering process is to obtain efficient
processing time and increase accuracy of
cluster performance. Based on results in
previous section, dimension reduction
can shorten processing time and has
lowest number of attribute. Figure. 7
shows DBSCAN with SVD has lowest
number of reduced attribute.
Figure 7. Reduction number of attributes
Another evaluation for model
implementation is comparison of
processing time. In general dimension
reduction decreased time to process. For
several datasets we found DBSCAN
within SVD has lowest processing time.
Figure 8. Processing time for each attribute
Cluster process with FastICA
dimension reduction has highest cluster
performance for blood datasets (Figure.
9), but lowest in other datasets, while
PCA has lowest performance for overall
datasets.
Figure 9. Performance no of cluster for each
attribute
6 Conclusion
The discussion above has shown that
applying
a
dimension
reduction
technique will shorten the processing
time. Dimension reduction before
clustering process is to obtain efficient
processing time and increase cluster
performance. DBSCAN with SVD has
lowest processing time for several
datasets. SVD also create lowest number
of reduced attribute. In general,
1047
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
dimension reduction shows an increased
cluster performance.
1048
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
References
1. Bi, Jinbo, Kristin Bennett, Mark Embrechts,
Curt Breneman and Minghu Song:
“Dimensionality Reduction via Sparse
Support Vector Machine”, Journal of
Machine Learning Research 3, pp.1229-1243
(2003)
2. Chakrabarti, Kaushik, Sharad Mehrotra:
“Local Dimensionality Reduction : A New
Approach To Indexing High Dimensional
Space”, Proceeding Of The 26th VLDB
Conference, Cairo, Egypt, pp.89-100 (2000)
3. Choi, S. W.; Martin, E. B.; Morris, A. J.; Lee,
I.-B: “Fault detection based on a maximum
likelihood PCA mixture”, Ind. Eng. Chem.
Res., 44, 2316−2327, (2005)
4. Ding, Chris, Tao Li: Adaptive Dimension
Reduction Using Discriminant Analysis and
K-means
Clustering,
International
Conference on Machine Learning, Corvallis,
OR, 2007
5. Ding, Chris, Xiaofeng He, Hongyuan Zha,
Horst Simon: Adaptive Dimension Reduction
For Clustering High Dimensional Data”,
Lawrence Berkeley National Laboratory,
pp.1-8 (2002)
6. Fodor, I.K: “A Survey of Dimension
Reduction Techniques. LLNL Technical
Report, UCRL-ID-148494”, pp.1-18 (2002)
7. Ghodsi, Ali: “Dimensionality Reduction, A
Short Tutorial, Technical Report”, 2006-14,
Department of Statistics and Actuarial
Science, University of Waterloo, pp. 5-6
(2006)
8. Globerson, Amir, Naftali Tishby: “Sufficient
Dimensionality Reduction”, Journal of
Machine Learning Research 3, pp. 1307-1331
(2003)
9. Hyvaerinen, Aapo, Erkki Oja: “Independent
Component Analysis: Algorithms and
Applications”, Neural Networks, pp. 411-430
(2002)
10. Hyvarinen, A., Oja, E.: Independent
Component Analysis: Algorithms and
Applications. Neural Networks 13, pp.411-430 (2000).
11. Jin, Longcun, Wanggen Wan, Yongliang Wu,
Bin Cui, Xiaoqing Yu, Youyong Wu: “A
Robust High-Dimensional Data Reduction
Method”, The International Journal Of
Virtual Reality 9(1), pp.55-60 (2010)
12. Jolliffe,
I.T:
“Principal
Component
Analysis”, Springer Verlag New York Inc.
New York, pp. 7-26 (2002)
13. Kambhatla, Nanda , Todd K. Leen: Fast
“Non_Linear Dimension Reduction”, (1994)
14. Kohonen, T., Kaski, S. and Lappalainen, H.
(1997). Self-organized formation of various
invariant-feature filters in the adaptivesubspace SOM. Neural Computation, 9:
1321-1344.
15. Larose, Daniel T: “Data Mining Methods
And Models”, John Wiley & Sons Inc, New
Jersey, pp.1-15 (2006)
16. Maimon, Oded, Lior Rokach: “Data Mining
And Knowledge Discovery Handbook”,
Springer Science+Business Media Inc, pp.9497 (2005)
17. Maimon,
Oded,
Lior
Rokach:
“Decomposition
Methodology
For
Knowledge Discovery And Data Mining”,
World Scientific Publishing Co, Pte, Ltd,
Danvers MA, pp. 253-255 (2005)
18. Nisbet, Robert, John Elder, Gary Miner:
“Statistical Analysis & Data Mining
Application”, Elsevier Inc, California,
pp.111-269 (2009)
19. Poncelet, Pascal, Maguelonne Teisseire,
Florent Masseglia: “Data Mining Patterns:
New Methods And Application”, Information
Science Reference, Hershey PA, pp. 120-121
(2008)
20. Sembiring, Rahmat Widia, Jasni Mohamad
Zain, Abdullah Embong: “Clustering High
Dimensional Data Using Subspace And
Projected
Clustering
Algorithm”,
International Journal Of Computer Science &
Information Technology (IJCSIT) Vol.2,
No.4, pp.162-170 (2010)
21. Sembiring, Rahmat Widia, Jasni Mohamad
Zain, Abdullah Embong: “Alternative Model
for Extracting Multidimensional Data BasedOn Comparative Dimension Reduction”,
ICSECS (2), pp. 28-42, (2011)
22. Sembiring, Rahmat Widia, Jasni Mohamad
Zain: “Cluster Evaluation Of Density Based
Subspace
Clustering”,
Journal
Of
Computing, Volume 2, Issue 11, pp.14-19
(2010)
23. Smith, Lindsay I: “A Tutorial on Principal
Component
Analysis”,
http://www.cs.otago.ac.nz/cosc453/student_t
utorials/principal_components.pdf, pp.12-16
(2002)
24. Wall, Michael E., Andreas Rechtsteiner, Luis
M. Rocha: "Singular value decomposition
and principal component analysis”, A
Practical Approach to Microarray Data
Analysis. D.P. Berrar, W. Dubitzky, M.
1049
International Journal on New Computer Architectures and Their Applications (IJNCAA) 1(3): 1041-1050
The Society of Digital Information and Wireless Communications, 2011 (ISSN: 2220-9085)
Granzow, Kluwer: Norwell, MA, LANL LAUR-02-4001, eds. pp. 91-109 (2003)
25. Wang, John: “Encyclopaedia of Data
Warehousing and Data Mining”, Idea Group
Reference, Hershey PA, pp. 812 (2006)
26. Xu, Rui, Donald C. Wunsch II: “Clustering”,
John Wiley & Sons, Inc, New Jersey, pp.
237-239 (2009)
27. Zhang, Daoqiang, Hua Zhou Zhi, Songcan
Chen: “Semi-Supervised Dimensionality
Reduction”,
7th
SIAM
International
Conference on Data Mining, pp.629-634,
(2008)
1050