Discretization of Gene Expression Data Revised
Discretization of Gene Expression Data Revised
Discretization of Gene Expression Data Revised
Abstract
Gene expression measurements represent the most important source of biological data used to unveil the interaction and
functionality of genes. In this regard, several data mining and machine learning algorithms have been proposed that
require, in a number of cases, some kind of data discretization to perform the inference. Selection of an appropriate discretization process has a major impact on the design and outcome of the inference algorithms, as there are a number of
relevant issues that need to be considered. This study presents a revision of the current state-of-the-art discretization techniques, together with the key subjects that need to be considered when designing or selecting a discretization approach for
gene expression data.
Key words: discretization; data preprocessing; gene expression data; gene expression analysis; data mining; machine
learning
Introduction
Recent developments in mRNA quantification techniques enable the simultaneous measurement of the expression level of a
large number of genes for a given experimental condition. Both
microarray and RNA-seq technologies are providing an unprecedented amount of meaningful biological data. In this regard,
numerous machine learning methods have been extensively
used in the analysis of gene expression data (GED) obtained
from these experiments [17]. The input data are required to be
discrete in several cases, as with any modeling algorithm using
discrete-state models [3, 7, 8].
Data discretization is a technique used in computer science
and statistics, frequently applied as a preprocessing step in the
analysis of biological data. In general, the aim of GED discretization is to allow the application of algorithms for the inference
Cristian A. Gallo has a PhD in Computer Science and has a postdoctoral position at Laboratory of Research and Development in Scientific Computing. Her
research interests is in the area of machine learning applied to bioinformatics optimization problems.
Roco L. Cecchini has a PhD in Computer Science. She is a researcher at Laboratory of Research and Development in Scientific Computing. Her Research
interests focus on data-mining and text-mining methods with application to bioinformatic problems.
Jessica Carballido has a PhD in Computer Science and is a Professor and a Researcher at the Universidad Nacional del Sur. His research group focuses on
machine learning techniques applied to bioinformatic optimization problems.
Sandra Micheletto has a PhD in Agronomy with a minor in Bioinformatics. Her research lies in microarrays data analysis and computational methods to
study plant gene expression under drought stress conditions.
Ignacio Ponzoni has a PhD in Computer Science. He is the current president of the Argentinean Association on Computational Biology and Bioinformatics
(A2B2C) and he is also a member of the ISCB. His research interests focus on evolutionary computing and machine learning methods applied to systems
biology and molecular informatics.
Submitted: 22 May 2015; Received (in revised form): 26 July 2015
C The Author 2015. Published by Oxford University Press. For Permissions, please email: journals.permissions@oup.com
V
Corresponding author. Ignacio Ponzoni, Laboratorio de Investigacion y Desarrollo en Computacion Cientfica (LIDeCC), Dpto. de Cs. e Ing.
de la Computacion, UNS, 1253 Av. Alem, Baha Blanca, Argentina 8000. Tel.: 542914595135; Fax: 542914595136. E-mail: ip@cs.uns.edu.ar.
Gallo et al.
that best suits a particular case of interest. In this review, a further revision of the classical and state-of-the-art approaches for
discretization of GED is proposed, reviewing the most recent research in the field and including supervised discretization for
GED that has not been addressed previously. Additionally, an
in-depth analysis of the main features of the discretization of
GED is also presented, providing a valuable tool for guidance in
the selection of an appropriate discretization approach. Finally,
a software package that implements most of the GED discretization methods is also provided.
In the following section, the problem definition will be introduced followed by the key issues to be considered when dealing
with the discretization of GED. Next, the state-of-the-art
approaches used for GED will be summarized. Finally, a discussion is elaborated.
| 3
formation of interest.
In this section, the main features of the gene expression discretization will be presented and analyzed as follows:
Supervision
The two high-level features of the discretization of GED, namely
supervised and unsupervised, define whether the particular
approach relies on class label information to perform the discretization. So, in the supervised discretization, the values of gi
are assigned regarding the class label information of a specific
knowledge domain. The manner in which the discretization
method considers the class information depends on the relation
between the data and the class labels, and on the heuristic
measure used to determine the best cut points. However, most
discretization approaches proposed in the literature for GED are
unsupervised and follow the problem definition explained in
the previous section.
Sample type
Level of discretization
Another characteristic to consider about the discretization of
GED is the level of discretization. In the simplest case, the alP
phabet
may contain only two symbols (see Figure 1), where
one symbol is used for upregulation (or activation) and the
other symbol is used for downregulation (or inhibition).
Thereby, the expression matrix is usually transformed into a
binary matrix, where 1 means upregulation and 0 means
downregulation. Another widely used scheme is to consider a
ternary set of discretization symbols, {1, 1, 0}, meaning downregulated, upregulated or no-change. Nevertheless, the values in matrix A may be discretized in a multilevel manner to an
The sample type is related to the previous feature, i.e. the type
of experiment that determines the meaning of the columns in
the data matrix. There are basically two types of samples, the
equilibrium (steady state) expression levels that correspond to a
static situation, and the time series expression levels that are
gathered during a phenotypic phase, such as in the cell cycle
[23]. Generally, in the steady state expression data, different experimental conditions refer to different tissues, temperatures,
chemical compounds or any other condition that may produce
different regulatory behavior among the sampled genes. Each
element aij of A contains the expression value of gene gi in the
sample or experimental condition j. On the other hand, the time
Figure 1. Workflow of the discretization process with two discrete states (R {0, 1}) for the gene gi. In the example, the GED A and the discretized GED A are composed
by N genes and four experimental conditions. The discretization algorithm fd takes A and gi and infers the cut point P {7} and the discretization scheme D {[0.4, 7],
(7, 10]} to obtain the discretized expression profile of gi in the GED A. Then, the discretized GED A is the input of an inference algorithm that will extract some kind of in-
Gallo et al.
Figure 2. Main features of gene expression discretization with their multiple variants.
Data scope
condition profile or the whole matrix. When the goal is to discretize the matrix A with a level of discretization of two, these
approaches follow the basic formulation given in Equation 1 to
determine aij, where d represents the metric used in the
computation:
(
aij
1 if
0
aij d
otherwise
(1)
8
1
>
>
<
aij
1
>
>
:
0
if
aij < d
if
aij > d
(2)
WCSSD
X
aij 2p0 ;p1
jaij l0 j2
k1
X
jaij lr j2
(3)
r1 a 2p ;p
r r1
ij
otherwise
| 5
Gallo et al.
Figure 3. Simple clustering approach for a level of discretization of two, where S represents the expression values to be discretized. After sorting the expression values
in a list L, the WCSS of all the discretization schemes Di such that 1 < p < 4 are calculated. Then, the best scheme is the one with lower WCSS, thus given the discretized
expression values of S as shown.
1
2
3
4
1*1 1 ! aij 5 1
2*1 2 ! aij 5 1
3*1 3 ! aij 5 1
4*1 4 ! aij 5 2
1*2 2 ! aij 5 1
2*2 4 ! aij 5 2
3*2 6 ! aij 5 2
4*2 8 ! aij 5 3
1*3 3 ! aij 5 1
2*3 6 ! aij 5 2
3*3 9 ! aij 5 3
4*3 12 ! aij 5 3
1*4 4 ! aij 5 2
2*4 8 ! aij 5 3
3*4 12 ! aij 5 3
4*4 16 ! aij 5 3
Table 1. An example of the Bikmeans discretization with k 3 regarding the discretization obtained with the (k 1)-means algorithm in the gene
profile (agij) and in the condition profile (acij). The discrete state aij (bolded), with 1 aij k, is assigned to aij if (aij)2 agij acij < (aij1)2.
| 7
for the gene gi is constructed, where each vertex is an expression value and each edge is the Euclidean distance between the vertexes. Then, the graph becomes disconnected until three components are obtained, discretizing the values according to the alphabet R {0, 1, 2}.
1
0
if
aij aij1 0
otherwise
(4)
Figure 4. An example of the discretization of a gene expression profile gi with six experimental conditions using the SSD algorithm. First, a complete weighted graph
Gallo et al.
1 if
otherwise
(5)
(6)
if
aij1 6 0
if
if
if
aij1 0 ^ aij 0
(7)
if
aij t
if
aij t
(8)
otherwise
i) Calculate the score of each interval in Dek as the entropy of the target
values (cijs). For an interval L[ek-1 : ek] derived from the values
of a gene expression matrix A, and a target class label c belonging to C, which can take k jCj values, the entropy can be
defined as:
eLek1 : ek
jCj
X
Pc cj log2 Pc cj
(9)
j1
Where P(c cj) is the probability that an instance in the current interval takes the value cj.
ii) Discretize each interval recursively into two new subintervals.
Given an interval L[ef : el] and its entropy e(L[ek-1 : ek]), a cut
point p can be greedily specified if we try to minimize the
joint entropy of the subintervals defined by p in L[ek-1 : ek]:
ep; Lek1 : ek
jLek1 : ep j
jLep1 : ek j
eLek1 : ep
eLep1 : ek
Lek1 : ek
Lek1 : ek
(10)
Where L[ek-1 : ep] and L[ep1 : ek] are the new two subintervals
defined by the cut point p in L[ek : ek1].
The FI method does not guarantee that the optimal cut point
will be discovered with minimum entropy because it does not
accomplish an exhaustive search. However, when only dealing
with two classes, the optimal cut point with minimum entropy
can be computed in a greedy manner [7, 27]. Gallo et al. [7] and
Ponzoni et al. [27] used this approach to compute the
discretization of the regulator genes in the inference of gene association rules from time series GED.
Another supervised discretization approach for GED, called
Efficient Bayesian Discretization (EBD), was proposed by
Lustgarten et al. [38]. It is based on the method developed by
Boulle [39] that uses dynamic programming to search all the
possible discretizations and a Bayesian measure to score each
one of them, thus ensuring the optimal one is found. In the case
of EBD, it improves the method proposed by Boulle [39] by allowing the incorporation of prior knowledge and decreasing the algorithm time complexity [38]. The EBD algorithm consists of
two principal steps:
a) Evaluate each discretization by means of the score.
ScoreM PM PS=M
(11)
PS=M
jCj
j C j 1! Y
nij !
j
C
j
1
n
!
i
i1
j1
I
Y
(12)
n1 vi n2 vi
N1
N2
(13)
| 9
Supervision
Unsupervised
Supervised
Unsupervised
Supervised
Unsupervised
Unsupervised
Supervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Unsupervised
Supervised
Bikmeans
EBD
EFD
Entropy based
discretization
Erdals et al. method
EWD
FI
K-means clustering
Max - X%Max
Mean
MeanPlusEstDev
Median
Qu et al. method
Soinovs change
state
SSD
Top %X
TSD
Steady State/
Time Series
Time Series
Steady State/
Time Series
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Steady State/
Time Series
Time Series
Sample type
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
RNA-seq/Microarray
Ternary
Binary
Multilevel
Binary
Ternary
Multilevel
Binary
Ternary
Binary
Binary
Multilevel
Ternary
Binary
Multilevel
Multilevel
Binary
Binary
Multilevel
Multilevel
Multilevel
Row
Row
Row/Column/
Matrix
Data point
Row
Row/Column/
Matrix
Row/Column/
Matrix
Matrix
Row/Column/
Matrix
Row/Column/
Matrix
Row/Column/
Matrix
Data point
Row
Row/Column/
Matrix
Row
Data point
Row/Column/
Matrix
Row
Row
Row/Column
Level of
Data scope
discretization
Yes
Yes
Yes
No
Yes
No
Mean
Yes
Max value
Yes
Yes
No
Euclidean distance
Yes
Yes
No
Yes
Application
examples on GED
Requires
input
parameter
Partition entropy
Equal frequency
Bayesian score
Euclidean distance
Evaluation
measure
Method
10
| Gallo et al.
Discussion
11
12
| Gallo et al.
Key Points
In gene expression data analysis, the discretization of
Funding
This work was supported by Consejo Nacional de
Investigaciones Cientficas y Tecnicas [Grant number 1122012-0100471] and Secretaria de Ciencia y Tecnologa (UNS)
[Grant numbers 24/N032, 24/ZN26].
References
1. Friedman N, Goldszmidt M. Discretization of continuous attributes while learning Bayesian networks. In: Saitta L (ed).
ICML96. Proceedings of the 13th International Conference on
Machine Learning; 1996 July 3-6; Bari, Italy. San Francisco CA:
Morgan Kauffman Publishers, 1996, 15765.
2. Dougherty J, Kohavi R, Sahami M. Supervised and unsupervised discrimination of continuous Features. In: Prieditis A,
Russell S (eds). ICML95. Proceedings of the 12th International
Conference on Machine learning; 1995 July 9-12; Tahoe City, United
States. San Francisco CA: Morgan Kauffman Publishers, 1995,
194202.
3. Karlebach G, Shamir R. Modelling and analysis of gene regulatory networks. Nat Rev Mol Cell Biol 2008;9:77080.
4. Alves R, Rodriguez-Baena DS, Aguilar-Ruiz JS. Gene association analysis: a survey of frequent pattern mining from gene
expression data. Brief Bioinform 2009;11(2):21024.
5. Vignes M, Vandel J, Allouche D, et al. Gene regulatory network
reconstruction using bayesian networks, the Dantzig selector, the Lasso and their meta-analysis. PLoS One
2011;6(12):e29165.
6. Vijesh N, Chakrabarti SK, Sreekumar J. Modeling of gene regulatory networks: a review. J Biomed Sci Eng 2013;6:22331.
7. Gallo CA, Carballido JA, Ponzoni I. Discovering time-lagged
rules from microarray data using gene profile classifiers. BMC
Bioinformatics 2011;12:123.
8. Madeira SC, Oliveira AL. An evaluation of discretization
methods for non-supervised analysis of time-series gene expression data [Internet]. Lisboa: Instituto de Engenharia de
Sistemas e Computadores Investigacao e Desenvolvimento
em Lisboa (INESC-ID); 2008 December [cited 2014 Dec 29].
Report No.: 42/2005. http://algos.inesc-id.pt/jpa/InscI/poisson/varwwwhtml/portal/ficheiros/publicacoes/3369.pdf
13
27. Ponzoni I, Azuaje F, Augusto J, et al. Inferring adaptive regulation thresholds and association rules from gene expression
data through combinatorial optimization learning. IEEE/ACM
Trans Comp Biol Bioinform 2007;4(Suppl 4):62434.
28. Becquet C, Blachon S, Jeudy B, et al. Strong-association-rule
mining for large-scale gene-expression data analysis: a case
study
on
human
sage
data.
Genome
Biol
2002;3(12):research0067.
29. Pensa RG, Leschi C, Besson J, et al. Assessment of discretization techniques for relevant pattern discovery from gene
expression data. In: Zaki MJ, Morishita S, Rigoutsos I (eds).
BIOKDD 2004. Proceedings of the 4th Workshop on Data Mining in
Bioinformatics; 2004 Aug. 22; Seattle, United States. New York:
ACM, 2004, 2430.
30. Lonardi S, Szpankowski W, Yang Q. Finding biclusters by random projections. In: Sahinalp SC, Muthukrishnan S,
Dogrusoz U (eds). CPM 2004. Proceedings of the 15th Annual
Symposium on Combinatorial Pattern Matching; 2004 Jul 5-7;
Istanbul, Turkey. Berlin: Springer Berlin Heidelberg, 2004, 102
16.
31. Aloise D, Deshpande A, Hansen P, et al. NP-hardness of
Euclidean
sum-of-squares
clustering.
Mach
Learn
2009;75:2459.
32. MacQueen J. Some methods for classification and analysis of
multivariate observations. In: Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability; 1965 Jun
21-July 18, 1965 Dec 27, 1966 Jan 7; Berkeley, United States.
Berkeley: University of California Press, 1967, 28197.