An Approach of Hybrid Clustering Technique For Maximizing Similarity of Gene Expression

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 14

An Approach of Hybrid Clustering Technique for

Maximizing Similarity of Gene Expression

SYNOPSIS
Introduction
Cluster analysis or clustering is the task of grouping a set of objects in such a
way that objects in the same group (called a cluster) are more similar (in some
sense or another) to each other than to those in other groups (clusters). It is a
main task of exploratory data mining, and a common technique for
statistical data analysis, used in many fields, including machine learning, pattern
recognition, image analysis, information retrieval, bioinformatics, data
compression, and computer graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be
solved. It can be achieved by various algorithms that differ significantly in their
notion of what constitutes a cluster and how to efficiently find them. Popular
notions of clusters include groups with small distances among the cluster
members, dense areas of the data space, intervals or particular statistical
distributions. Clustering can therefore be formulated as a multi-objective
optimization problem. The appropriate clustering algorithm and parameter
settings (including values such as the distance function to use, a density
threshold or the number of expected clusters) depend on the individual data set
and intended use of the results. Cluster analysis as such is not an automatic task,
but an iterative process of knowledge discovery or interactive multi-objective
optimization that involves trial and failure. It is often necessary to modify data
preprocessing and model parameters until the result achieves the desired
properties.
Clustering is a smart technique to extracting biologically significant information
from gene dataset. Many sub clustering methods are used in gene expression
data to analyze the gene functionalities, and aggregation between their pair of

1
genes. The primary objective of the research work is to analyze microarray gene
expression data by finding the similarity between the genes and groups into
clusters. There were many clustering algorithms had proposed to recognize the
subset of co-regulated gene clusters thereby maximizing the clusters for both
positive and negative correlated genes. HI-Clustering is the one such algorithm
for maximal subspace coregulated gene that finds the maximum number of
bicluster based on sub cluster constraint. The ERRFCM (Enhanced Robust
Rough Fuzzy C-Mean) is another method which increases the probability
membership of the clusters and also handles the overlapping gene clusters
effectively. It is also useful in dealing with probabilistic lower approximation
and possibility lower approximation. The proposed methods are used to identify
the strong group of coexpressed genes and to produce the best result. A
microarray consists of the thousands of gene expression level and process the
high number of genes, the one way ANOVA model is produce best results when
compared to other testing methods such as Two sample T-Test and Paired T-Test
which is considered as a common parameter like mean, covariance, standard
deviation for each pair of genes and various algorithms are tested using
Bicluster analysis. The scope of this thesis work is to analyze gene expression
data using statistical methods and find the maximum number of clusters in
given data set.
A cluster analysis is the most commonly performed procedure (often regarded
as a first step) on a set of gene expression profiles. In most cases, a post hoc
analysis is done to see if the genes in the same clusters can be functionally
correlated. While past successes of such analyses have often been reported in a
number of microarray studies (most of which used the standard hierarchical
clustering, UPGMA, with one minus the Pearson's correlation coefficient as a
measure of dissimilarity), often times such groupings could be misleading.
More importantly, a systematic evaluation of the entire set of clusters produced
by such unsupervised procedures is necessary since they also contain genes that

2
are seemingly unrelated or may have more than one common function. Here we
quantify the performance of a given unsupervised clustering algorithm applied
to a given microarray study in terms of its ability to produce biologically
meaningful clusters using a reference set of functional classes. Such a reference
set may come from prior biological knowledge specific to a microarray study or
may be formed using the growing databases of gene ontologies (GO) for the
annotated genes of the relevant species.

Literature Review
This section discusses various works carried out by existing researchers on data
mining techniques, gene expression data, microarray, statistical methods used
for measure the similarity, the advantages and limitations of existing clustering
techniques, different data types and data repositories which are used for mining
knowledge. The cluster is group of object one with another based on the
similarity between the objects. The correlation is calculated from the micro
array gene expression data to form the cluster. The performance of each work is
compared with existing work.

Alter et al. (2009)1 applied the principal component analysis (PCA) to capture
the majority of the variations within the genes by a small set of principal
components (PCs), called eigen-genes. The samples were then projected on
the new lower-dimensional PC space. However, eigen-genes did not necessarily
had strong correlation with informative genes.
Ding et al. (2012)2 used a statistic method to select the genes which show large
variance in the expression matrix. Then a min-max cut hierarchical divisive
clustering approach was applied to cluster samples. Finally, the samples were
ordered such that adjacent samples were similar and samples far away were
different.

3
Danasingh Asir Antony Gnana Singh et.al (2015) 3 The performance of the
density based clustering, expectation maximization (EM) clustering, and K-
means clustering were analysed in terms of the sum of squared error (SSE), and
log likelihood on various gene expression data.
Arifovic et al (2010)4 explained that the genetic algorithm would had a better
convergence for wider range of parameters.
Hommes et al (2011)5 suggested that cobweb model can show consistent
rational behaviour for non-linear dynamic models. The similarity measure used
in the cobweb is the distance measure.
Alejos et al (2005)6 presented a technique to calculate the magnetization of the
simulated system with improved accuracy by means of the Preisach model.

Noteworthy Contributions
Zhechong Zhao et al (2013)7 presented a cobweb plot which is used to
illustrate graphically the iterative procedure and to analyse stability.
Yuni Xia et al (2012)8 proposed a conceptual clustering algorithm which can
explicitly handle the uncertainties in the values of the dataset. Total utility (TU)
index is introduced to measure the quality of the clustering.
Tood et al (2011)9 stated that this algorithm is suitable for outcomes that are
clumped together.
Brankov et al (2002)10 proposed the normalized cross correlation that has better
performance than the traditionally used Euclidean distance which is used as the
similarity measure for the expectation maximization. The EM algorithm is
mainly suitable for the analysis of the image data.
Lagendijk et al (2011)11 applied the maximum likelihood approach to identify
and restore noisy data in the blurred images. This EM method can facilitate
maximizing likelihood functions that arise in statistically estimating problems.
Figueiredo et al (2003)12 presented the algorithm for the restoration of the
images using the penalized likelihood.

4
Fessler et al (2014)13 presented a new update of sequentially alternating the
parameters between the several small hidden data spaces which are defined by
the algorithm designer.
Manoj et al (2013)14 suggested that the farthest first algorithm is suitable for the
large dataset and the clusters produced are non-uniform. So they developed an
optimized farthest first clustering algorithm to produce uniform clusters.
Chung-Ming et al (2012)15 proposed a farthest first forwarding algorithm to
reduce the transmission delay in the vehicular adhoc networks (VANETs).
H. K. Yogish et al (2014)16 proposed a strategy of farthest first traversal for
finding the frequent traversal path in the navigation and reorganization of the
website structure. This clustering algorithm can eventually speed up the
clustering since there are only few adjustments in the data.
According to Bilenko et al (2004)17 The constraint based methods and distance
function learning methods are the similarity metric used in the algorithm. The
major advantage is that it is a heuristic based method that is fast, scalable and
appropriate for large datasets. But it is difficult to compare the quality of the
cluster produced. It does not hold good for non-globular clusters and is very
sensitive to outliers.
Jiang-She Zhang et al (2000)18 proposed a clustering algorithm for the
processing of the images. They are computationally stable and insensitive to
initialization. They also produce consistent clusters.
Thomas et al (2005)19 proposed a collaborative filtering which is a combination
of the correlation and singular value decomposition (SVD) to improve accuracy.
A weighted co-clustering algorithm is designed in incremental and parallel
versions and the results are empirically evaluated.
Lagendijk et al (2010)20 proposed two different methods to estimate the
performances of individual classifiers and then combine them based on the
weight of the individual classifiers.

5
Edward J. Coyle et al (2003)21 proposed a randomized algorithm which is
mainly applicable for the sensors for the generation of the cluster heads in a
hierarchical manner.
Michael Dittenbach et al (2000) 22 presented a growing hierarchical self-
organizing map that evolves on the input data during the unsupervised training
process.
Guangyu et al (2011)23 developed a comparative analysis and suggested that
hierarchical clustering is better when compared with the conventional
clustering. It produces an extensive hierarchy of clusters that merge with other
ideas that are present at a certain distance.
Li Tu et al (2007)24 proposed a framework called D-stream for clustering using
the density based approach.
Mitra et al (2002)25 suggested a nonparametric data reduction scheme. The
procedure followed here is separating the dense area objects from less dense
area with the aid of an arbitrary object.
Tapas et al (2002)26 identified that the algorithm works faster as the separation
between the cluster increases. This algorithm is applicable for the segmentation
of images and data compression.
Kanungo et al (2002)27 proposed that the K-means algorithm runs faster as the
separation between the cluster increases.
Jakob J. Verbeek et al (2003)28 suggested a solution to reduce the
computational load without affecting the quality of the solution significantly.
The algorithm is robust, fast and easy to understand. It also yields better results
when the dataset are well separated or distinct from each other. It does not work
efficiently for non-linear and categorical data.
Yeung et al. (2010)29 extended the idea of prediction strength and proposed
an approach to cluster validation for gene clusters. Intuitively, if a cluster of
genes has possible biological significance, then the expression levels of the

6
genes within that cluster should also be similar to each other in test samples
that were not used to form the cluster.
Jakt et al. (2011)30 integrated the assessment of the potential functional
significance of both gene clusters and the corresponding postulated regulatory
motifs (common DNA sequence patterns), and developed a method to estimate
the probability (-value) of finding a certain number of matches to a motif in all
of the gene clusters. A smaller probability indicates a higher significance of the
clustering results.

Research Methodology
The evolution of technologies that are employing to collect and store data
demands to handle large dump of data becomes inevitable as well as
significance and the information of interest had to be extracted from these files
of data. This task of data preprocessing to provide the desired information
becomes extremely challenging with the traditional data analysis techniques. So
the field of data mining becomes more complex and thus becomes sparkling
significant as the data grows bigger since every day. To uncover the relationship
among the data that intern exploring new ways to generate an appropriate affine
set of the expected final outcome of the mining. In processing affine set with
every new relations uncovered in each step of the mining namely classification,
clustering, association, sequence analysis and regression. For every sub task, the
choice of the processing algorithms and techniques from statistics, machine
learning, fuzzy, artificial intelligence, data warehousing etc. lies on the
complexity and nature of the data that involves in the process. Bio information
is a field that employs various tools and techniques from computer science and
its applications to analysis and integrates the biological information such as
genetic information in molecular and atomic protein structure. This information
is vital in discovering new and efficient drugs for genetic disorders and some
new diseases. Some of such analyses are protein structure prediction,

7
classification of genes, classifying tumor cells, clustering of gene expression
data etc. Microarray data becomes significant to interpret those biological data
into a computable data. Thereby integrating efficient procedures and techniques
from data mining and bioinformatics might open up exciting chances of
discovering efficient medicines and treatment procedure for deadly diseases like
cancers.

Several obvious aims of these data analysis are the following:

1. Identify genes whose expression levels reflect biological processes of


interest (such as development of cancer).

2. Group the tumors into classes that can be differentiated on the basis of
their expression profiles, possibly in a way that can be interpreted in terms of
clinical classification. For example one hopes to use the expression profile of a
tumor to select the most effective therapy.

3. Finally, the analysis can provide clues and guesses for the specific gene
formed.

HI-Clustering for Maximizing Sub Clusters in the Gene Expression Data

Clustering is a trendy technique for extracting biologically significant


information from genetic dataset. Many sub clustering methods are used in
Gene Expression data to analyze the gene functionalities, and aggregation
between their pair of genes. The relationship between couple of gene is
measured using rank correlation technique which will be either positive or
negative correlation value based on the organ functionalities of the gene pair.
Since genes are responsible for several biological processes they should belong
to different modules for every such processes or response or an action. DNA
microarray is used to analyze the gene with a set of experimental circumstances.
There are various clustering algorithms had proposed for recognizing the subset

8
of co regulated gene clusters and maximizing the clusters for both positive and
negative regulated genes. Most of the existing methods had emphasis on the
individual gene ranking method that didnt consider any correlation among
genes. The proposed method uses the subset ranking methods which will
determine the discerning ability of gene subset that measures the linear
relationship and correlation value among the set of genes by using this idea the
partition genes are mutually cluster with another. The conventional clustering
algorithm does not generate the maximum sub clusters of the given gene
dataset. In the proposed work, HI-Clustering algorithm is proposed which
conducts various extensive experimental studies in yeast dataset in order to
maximize the sub clusters to produce better performance.

Maximizing Gene Pairs Using an Enhanced Fuzzy Clustering


The important task of clustering in data mining includes an effective; the
functional analysis of gene clustering investigation that can be performed by
using various algorithms. The clustering techniques are used to realize gene
functions, cellular process, gene regulation and subtypes of cells. The
techniques that are traditionally used to measure the performance of the gene
overlapping are CLICK, SOM, and rRFCM. The proposed ErRFCM increase
the probability membership of the clusters and also efficiently handle the
overlapping gene clusters. It is much useful to deal with probabilistic lower
approximation and possibility lower approximation. The core idea is to identify
the strong group of Co expressed genes to produce the better result. The gene
clusters can be produced by using either of HCM, FCM, RFCM, SOM, CLICK
and rRFCM algorithms and can be visualized by Tree View software for 14
Microarray dataset.

9
Expected Outcomes of the Study
The gene expression data analysis is important filed in biological area, which
follows different techniques to group the related genes. Currently, clustering is
such a utensil broadly used in gene expression data analysis to achieve
biological information. A primary aim of such an analysis will be the detection
of groups of genes that demonstrate similar expression patterns that can smooth
the progress of the scientist to conduct suitable diagnosis and conduct of
patients. The clustering techniques are very useful to find the co-regulated gene,
in which each method has own limitations and predefined number of clusters.
To defeat this, a new HI-Clustering method will develop and experience with
the pattern recognition.

10
References

[1] Alter O., Brown P.O. and Bostein D. Singular value decomposition for
genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci.
USA, Vol. 97(18):10101-10106, August 2009.
[2] Ding, Chris. Analysis of gene expression profiles: class discovery and leaf
ordering. In Proc. of International Conference on Computational Molecular
Biology (RECOMB), pages 127-136, Washington, DC, April 2012.
[3] Danasingh Asir Antony Gnana Singh, Subramanian Appavu Alias
Balamurugan, Epiphany Jebamalar Leavline, Improving the Accuracy of the
Supervised Learners using Unsupervised based Variable Selection, Asian
Journal of Information Technology, 13.9 (2014): 530-537.
[4] Arifovic, Jasmina, Genetic algorithm learning and the cobweb model,
Journal of Economic dynamics and Control, 18.1 (2010): 3- 28.
[5] Hommes, Cars H, On the consistency of backward-looking expectations:
The case of the cobweb, Journal of Economic Behavior & Organization, 33.3
(2011): 333-362.
[6] Alejos, scar, and Edward Della Torre, The generalized cobweb method,
Magnetics, IEEE Transactions on 41, 5 (2005): 1552-1555.
[7] Zhao, Zhechong, and Lei Wu, Stability analysis for power systems with
pricebased demand response via Cobweb Plot, Proc. IEEE PES General
Meeting, 2013.
[8] Yuni Xia,Bowei Xi, Conceptual Clustering Categorical Data with
Uncertainty, 19th IEEE International Conference on Tools with Artificial
Intelligence.
[9] Moon, Tood K, The expectation-maximization algorithm, Signal
processing magazine, IEEE 13, 6 (2011): 47-60.

11
[10] Brankov, Jovan G., et al. Similarity based clustering using the expectation
maximization algorithm, Image Processing, 2002, Proceedings, 2002
International Conference, Vol. 1, IEEE, 2002.
[11] Lagendijk, Reginald L., Jan Biemond, and Dick E. Boekee, Identification
and restoration of noisy blurred images using the expectation-maximization
algorithm, IEEE Transactions on Acoustics, Speech, and Signal Processing
[see also IEEE Transactions on Signal Processing], 38 (7) (2011).
[12] Figueiredo, Mrio AT, and Robert D. Nowak, An EM algorithm for
wavelet-based image restoration, Image Processing, IEEE Transactions on 12.8
(2003): 906-916.
[13] Fessler, Jeffrey, and Alfred O. Hero, Space-alternating generalized
expectation-maximization algorithm, Signal Processing, IEEE Transactions on
42, 10 (2014): 2664-2677.
[14] Kumar, Manoj, An optimized farthest first clustering algorithm,
Engineering (NUiCONE), 2013, Nirma University International Conference on
IEEE, 2013.
[15] Huang, Chung-Ming, et al. A farthest-first forwarding algorithm in
VANETs, ITS Telecommunications (ITST), 2012, 12th International
Conference on IEEE, 2012.
[16] Vadeyar, Deepshree A., and H. K. Yogish, Farthest First Clustering in
Links Reorganization, International Journal of Web & Semantic Technology, 5,
3 (2014): 17.
[17] Bilenko, Mikhail, Sugato Basu, and Raymond J. Mooney, Integrating
constraints and metric learning in semi-supervised clustering, Proceedings of
the twenty-first international conference on Machine learning, ACM, 2004.
[18] Leung, Yee, Jiang-She Zhang, and Zong-Ben Xu, Clustering by scale-
space filtering, Pattern Analysis and Machine Intelligence, IEEE Transactions
on, 22,12 (2000): 1396-1410.

12
[19] George, Thomas, and Srujana Merugu, A scalable collaborative filtering
framework based on co-clustering, Data Mining, Fifth IEEE International
Conference on IEEE, 2005.
[20] Lagendijk, Reginald L., Jan Biemond, and Dick E. Boekee, Identification
and restoration of noisy blurred images using the expectation-maximization
algorithm, IEEE Transactions on Acoustics, Speech, and Signal Processing
[see also IEEE Transactions on Signal Processing], 38 (7) (2010).
[21] Bandyopadhyay, Seema, and Edward J. Coyle, An energy efficient
hierarchical clustering algorithm for wireless sensor networks, INFOCOM
2003, Twenty-Second Annual Joint Conferences of the IEEE Computer and
Communications, IEEE Societies, Vol. 3, IEEE, 2003.
[22] Dittenbach, Michael, Dieter Merkl, and Andreas Rauber, The growing
hierarchical self-organizing map, IJCNN, IEEE, 2000.
[23] Pei, Guangyu, et al. A wireless hierarchical routing protocol with group
mobility, Wireless Communications and Networking Conference, 1999,
WCNC, IEEE, 2011.
[24] Chen, Yixin, and Li Tu, Density-based clustering for real-time stream
data, Proceedings of the 13th ACM SIGKDD international conference on
Knowledge discovery and data mining, ACM, 2007.
[25] Mitra, Pabitra, C. A. Murthy, and Sankar K. Pal, Density-based multiscale
data condensation, Pattern Analysis and Machine Intelligence, IEEE
Transactions on 24, 6 (2002): 734-747.
[26] Kanungo, Tapas, et al. An efficient k-means clustering algorithm:
Analysis and implementation, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 24, 7 (2002): 881-892.
[27] Kanungo, Tapas, et al. An efficient k-means clustering algorithm:
Analysis and implementation, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 24, 7 (2002): 881-892.

13
[28] Likas, Aristidis, Nikos Vlassis, and Jakob J. Verbeek, The global k-means
clustering algorithm, Pattern recognition, 36, 2 (2003): 451-461.
[29] Yeung, K.Y., Haynor, D.R. and Ruzzo, W.L., Validating Clustering for
Gene Expression Data, Bioinformatics, Vol 17(4):309-318, 2010.
[30] Jakt, L.M., Cao, L., Cheah, K.S.E., Smith, D.K., Assessing clusters and
motifs from gene expression data, Genome research, 11:112-123, 2011.
[31] Sara C. Madeira and Arlindo L. Oliveira Biclustering Algorithms for
Biological Data Analysis: A Survey IEEE Transactions on computational
biology and bioinformatics, vol. 1, no. 1, january-march 2004, 24-45
[32] Heather L. Turner, Trevor C. Bailey, Wojtek J. Krzanowski, and Cheryl A.
Hemingway Biclustering Models for Structured Microarray Data IEEE/ACM
Transactions on computational biology and bioinformatics, vol. 2, no. 4,
october-december 2005,316-329
[33] Wen-Hui Yang, Dao-Qing Dai, Member, IEEE, and Hong Yan, Fellow,
IEEE, Finding Correlated Biclusters from Gene Expression Data IEEE
Transactions on knowledge and data engineering, vol. 23, no. 4, april 2011, 568-
584.
[34] Banu Dost, Chunlei Wu, Andrew Su, and Vineet Bafna, TCLUST: A Fast
Method for Clustering Genome-Scale Expression Data, IEEE/ACM
transactions on computational biology and bioinformatics, vol. 8, no. 3,
may/june 2011,808-818.
[35] D. Jiang, C. Tang, and A. Zhang, Cluster Analysis for Gene Expression
Data: A Survey, IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp.
1370-1386, Nov. 2004.
[36] P. Maji and S.K. Pal, Rough Set Based Generalized Fuzzy CMeans
Algorithm and Quantitative Indices, IEEE Trans. System, Man, and
Cybernetics, Part B: Cybernetics, vol. 37, no. 6, pp. 1529-1540, Dec. 2007.
[37].D. Dembele and P. Kastner, Fuzzy C-Means Method for Clustering
Microarray Data, Bioinformatics, vol. 19, no. 8,pp. 973-980, 2003.

14

You might also like