Thesis AI - Carlos Fenollosa Bielsa
Thesis AI - Carlos Fenollosa Bielsa
Thesis AI - Carlos Fenollosa Bielsa
Master Thesis
Advisors:
Ramon Goñi Macià, MMB, PCB
Javier Vázquez-Salceda, LSI, UPC
DNA sequencing projects have started the race to fully annotate complete
genomes, including the human one. Despite that, little is known about ge-
netic regulation, the mechanisms that control where and when the genes are
expressed, and promoters are maybe the most important of these mechanisms.
An increasing number of studies have been focused on the DNA molecule
and its structure. This has lead to a set of physical properties which can be com-
puted from mathematical models, and describe some aspects of this molecule.
Unfortunately, the existing tools are scattered through the different web sites of
many research groups, and extracting data with them is still very unpleasant.
The first part of this thesis presents DNAlive, a new platform to calculate DNA
physical properties, showing the results in a visual and useful way for genetic
researchers, cross-linking the data with external databases.
For the second part, a full study of DNA physical descriptors has been
performed, revealing significative similarities between them. Using that data,
a set of neural networks has been developed to detect promoters on a DNA
sequence. The resulting software is the second version of ProStar, the MMB
group’s1 latest promoter predictor.
group hosted at the Parc Cientı́fic de Barcelona. I have been working there since 2007, in a
joint programme with the Barcelona Supercomputing Center, and this MSc Thesis has been
developed as one of the group’s projects.
Acknowledgements
This Thesis could be summed up as Structural biology and genetics for com-
puter scientists: From nothing to “a lot” in just a year. That is why these
acknowledgements are the most important part of this thesis.
First of all, I need to thank Ramon Goñi for offering me a career in the
Life Sciences, for his invaluable help and guidance since day one, for his interest
and dedication, for calling me from Madrid whenever it was needed and sending
me e-mails at late night hours. You reviewed my work as if it were your own
responsibility. Thanks for that.
On the other side, I’m sure that Javier Vázquez has also learned a lot of
biology by brute force. It’s not easy to understand genes and promoters, let
alone if the teacher is me. Everything would have been easier if I had chosen
the blue pill named Contract, but I love challenges. Thanks, and sorry!
I’d also like to thank Modesto Orozco for getting time from nowhere to meet
with me and teach me concepts from Biology 101. Maybe he would get the
Nobel Prize if he finally admitted that he has discovered 30-hour-long days.
At the LSI, Karina Gibert and Lluı́s Belanche offered me great help on the
PCA and the neural network. I still keep the napkin from the cafeteria where
Karina drafted the matrix I needed to build to run the PCA.
Thanks to José Antonio Alcántara for his fast help with all my server issues.
A special acknowledgement also for the people at the INB, and I mean all of
them, for explaining me the internals of BioMoby, which are not always nice.
Many eyes have been upon DNAlive and have helped me to debug the whole
platform, so thanks to the people in the MMB group in general and the web
testers in particular, especially David Torrents at the BSC.
To all my workmates, I’m glad I’ve met you! It is a pleasure to work with
you and a privilege to have those profound conversations from time to time.
In no particular order, but ladies first: Montse, Laura, Jordi and Felix, with
occasional outsiders.
Finally, as I don’t want to leave anybody out of these acknowledgements,
here’s a placeholder: thanks to $you2 too!
To my mom: it’s okay to be stressed when there’s only a week left to hand
the manuscript in. Don’t try to calm me down, I need the stress to stay awake.
You are guilty for teaching me curiosity and ambition, paying my University so
that I could study on weekends, I don’t want to screw it up in the last minute.
2 PHP joke. Not Perl. PHP.
i
I’m fine now. I’m sure that grandpas will be very happy because I have finished,
at last.
To my friends: It’s okay not to go out saturday nights if you need to stay
until five o’clock AM running batches in the computer nodes. I don’t want to
disconnect and relax for a couple of hours because I’ll feel guilty afterwards, but
I want to tell you that I appreciate the effort. Really. Remember how you were
so worried that you didn’t allow me to attend your PFC defense? We can party
afterwards (and we will).
This Thesis also accounts as my PFC for Computer Engineering. Seven long
years, the latter two attending the MSc, have allowed me to meet a lot of cool
people at the FIB. Some carried on, some dropped out, some switched to the
technical engineering and finished in only three years. As I love writing —I
think you have already noticed it by the length of this section— on day zero
I joined l’Oasi, a student group at the FIB that publishes a magazine every
semester. This was the best decision I made during my studies, because I found
the boldest people in college there. l’Oasi has been my second home for five
years.
ii
As an example, the gene expression is so tricky that there could be a gene
hidden in some obscure region of the DNA and we don’t know of its existence
because it expresses only in the liver, and only in the first month of pregnancy.
Then, it becomes silent forever. How are we supposed to discover it? Do we
need to make a map of all the expressed DNA in all organs for every 10 minutes
since the first second of gestation? The answer is even worse than the question:
yes, we need; and no, we don’t know how to generate this data. Or where to
store and process it.
But it turns out that a computer guy can learn biology. Fortunately, all the
institutions that form the MMB group have weekly seminars —this sometimes
means more than five seminars a week— where everybody can learn from every
field. Some are boring, some are interesting, some are so specific that you
only understand them some months later, after working a bit on that field and
reading a lot of Wikipedia articles4 . But, in the end, you learn something.
It’s funny because biologists and chemists also tell you their own challenge:
learn to use a computer and program software. They might have a PhD, but
some only know enough computer science to run MS Word in Windows, because
they’ve always worked in a lab. The solution? Collaborate. They explain me
what a propeller is, and I explain them how to Perl5 . Sadly, from time to time I
am asked a question on FORTRAN, and I can’t help. In the end, it is all about
mixing experiences, and it is very rewarding.
iii
Contents
1 Introduction 2
1.1 Definition of the problem and objectives . . . . . . . . . . . . . . 3
1.2 Application of the AI . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Structure of this thesis . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Biology 6
2.1 Central dogma of the molecular biology . . . . . . . . . . . . . . 6
2.2 Genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Promoters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Physical properties of the DNA . . . . . . . . . . . . . . . . . . . 12
2.5 Nucleosomes and Chromosomes . . . . . . . . . . . . . . . . . . . 15
2.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . 15
iv
5.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . 44
6 Results 47
6.1 DNAlive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 ProStar 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7 Conclusions 56
7.1 DNAlive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.2 ProStar 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
A DNAlive publication 67
v
List of Figures
vi
List of Tables
1
Chapter 1
Introduction
Sequencing projects have provided the genome sequence of many evolved organ-
isms, including mammals. Unfortunately, less information exists on the detailed
mechanisms controlling gene expression.
Promoter1 and gene detection is one of today’s biggest bioinformatic prob-
lems because of its complexity. Genes control all the functions of the human
body, but the models found to date do not work properly for some kind of genes.
Furthermore, it is a recent field of study, and very important research groups are
trying to tackle this problem with bioinformatic tools. The use of AI techniques
over statistical models can help researchers to find genetic models, even if they
cannot be fully understood.
Despite the impressive success of high-throughput experimental techniques,
the determination of promoters is still a big challenge in the post-genomic era.
Promoter prediction would be straightforward if all genes were perfectly anno-
tated and promoters were always placed at the same relative location respective
to the gene.
Unfortunately, annotation is often subject of large uncertainties and recent
genomic analysis has shown that the naive concept of a gene having a single start
point located near the gene is not correct. Furthermore, one gene region might
have several Transcription Start Sites, one promoter might induce transcription
start at different sites and promoter regions often overlap [Cea05]. On the other
hand, sequence signals associated to promoters are rather unspecific, generat-
ing a large percentage of false positives [BSC+ 02]. In other words, theoretical
prediction of promoters is still one of the greatest challenges in bioinformatics.
Bioinformatics is a recent field of study, and involves the use of computers
and algorithmic knowledge to solve biological problems. In the last 30 years,
thanks to the huge improvements in computers and databases, biologists have
started to use them to cross biologic data with the hope of finding new informa-
tion. Later on, with the introduction of multidisciplinary researchers, the data
was converted to information structures and now almost every biological field
can use computers to enhance their work.
On the interface part, software tools for bioinformatics range from simple
command-line tools, to more complex graphical programs and standalone web-
services. With the increasing availability of Internet tools and the invention of
1 The reader might find in this introduction some biological terms. All of them are explained
in chapter 2.
2
remote data exchanging protocols, it is now possible to run software in remote
servers, obtain the results, parse them, and send the results to another machine
which will display them in a graphical window.
Databases are huge and scattered, and there is room for improvement in data
display interfaces as well as new methodologies to extract more information from
that data. The trend is moving from using only one algorithm and one interface
—a terminal running a script— to the use multiple sources of information and
multiple interfaces.
3
• Analyze the DNA descriptors and determine which are more valuable to
detect promoters, using data mining techniques.
• Promoters will be classified in four groups, each one representing a dif-
ferent promoter type. Then, each one will be treated separately to gain
specificity.
• A neural network for each group will be trained and then used to predict
the presence of a promoter in a new DNA sequence.
The promoter detection task has usually been labeled as a symbolic problem.
The DNA sequence in its naked form was the only information used, and each
element of the sequence is a symbol (A, C, G, T). Classic techniques include
pattern matching or stemming, or even Hopfield networks to reduce noise.
By using DNA numerical descriptors instead of the plain sequence, the prob-
lem turns to be subsymbolic. After computing the descriptor, the original se-
quence is discarded, only the numerical values are kept. The use of this method
allows to plot a graph of the descriptor signal along the sequence, and some
important DNA elements can be discovered by looking at the signal. Unfortu-
nately, this is not yet the case for promoters.
The descriptors analysis will help to reduce the search space and input noise.
These values have biological meaning, but it is unknown which one of them best
describes a promoter. Using a comparison, looking for promoters by using the
DNA temperature value could be like trying to define a musical composition by
the color of the speaker it plays on.
Statistical methods will be used to compute the relevance of each descriptor,
starting with a Pearson correlation and then running a Principal Components
Analysis. Correlations will tell which descriptors are redundant, and the PCA
will determine the weight (i.e. relevance) of each descriptor.
Promoters can be clustered in different ways, based on some of their proper-
ties. In this thesis, they will be classified by analyzing the presence or absence
of the two most common elements, forming four groups. The classification is
quite simple, and one interesting conclusion will be whether these groups are
differentiated by physical descriptors or not.
In order to get a class predictor (promoter or non-promoter), many classifiers
could be used. For this problem, neural networks were chosen, as it is a very
flexible method, specially suited to treat sets of inputs as signals and to build
models that are, to some extent, noise-resistant.
Other alternatives, such as support vector machines, were discarded because
the lack of a kernel function. Self organizing maps are used in [ASRdP08] but, as
we are going to cluster the promoter instances beforehand, it makes no difference
compared to a neural network.
Other AI techniques were discarded because the lack of application. Some
types of learning, like reinforcement learning, cannot be applied, as the answer
space is not big enough. In a same manner, it is impossible to use deductive
learning or rule-based learning, because there is little knowledge about the so-
lution, no model has been found to define a promoter, and thus the model has
to be created.
This thesis is a clear example of the application of the AI in many research
fields. In bioinformatics, promoter prediction is a main problem which will lead
to better genetic research.
4
The only way to validate the predictions is to compare them to an annotated
DNA database, but these databases are very small. When a predictor performs
well, research groups use it to find places where it detects unknown promoters
with the most confidence value, and then look for them in vitro, in a biology
laboratory.
Not only is it important to discover promoters, but also improve the confi-
dence values and reduce the number of false positives. This thesis’ results will
provide light on the application of new methodologies to a current problem.
5
Chapter 2
Biology
DNA was first isolated by the Swiss physician Friedrich Miescher who, in 1869,
discovered a microscopic substance in the pus of discarded surgical bandages.
Since then, much effort has been put in the investigation of this element and its
role in the living beings. It was not until 1953 when, based on X-ray diffraction
images taken by Rosalind Franklin and the information that the bases were
paired, James D. Watson and Francis Crick suggested what is now accepted as
the first accurate model of DNA structure in the journal Nature [CW53].
This chapter defines the terms that will be used in the rest of the Thesis,
and provides a basic but detailed introduction to the concepts that are needed
to understand this work.
Figure 2.1: Simplified models for the cells of eukaryotes and prokaryotes.
6
Figure 2.2: A chunk of DNA in its stable 3D form (Source: Wikipedia)
For this Thesis, unless specified, all references to cells are meant to be for
eukaryotes, the ones present in human organisms. As stated, every cell of the
human body contains a nucleus, where the DNA is surrounded by a liquid
composed of water and other molecules.
It is well known that the DNA is a double helix (see fig. 2.2) that contains
genetic information. However, this big molecule has very intricate physical and
chemical structure, and plays a critical role in genetics.
An easy metaphor to understand the DNA structure is to view it as a long
string, composed of character sequences. Each of the four letters that form the
DNA (A, C, G, T) represent a small molecule that has a name, with its atoms,
its bonds and its physical forces. These letters are named Adenine, Cytosine,
Guanine and Thymine, and their generic name is base. Each one has a sugar
molecule which acts as a scaffold, the backbone. The combination of a base
with a sugar forms a nucleotide1 .
A chunk of DNA expressed as GGCAATTACGACGGTATAACT means that there
are two Guanine molecules, followed by a Cytosine, followed by two Adenines,
two Thymines, and so on.
Furthermore, every nucleotide is facing another nucleotide, forming the afore-
mentioned double helix. The nucleotides always pair themselves in the following
fashion: C pairs with G, while A pairs with T (or U, in RNA), and vice versa. For
example, AACT always faces TTGA.
By convention, one of the strands is read from the top to the bottom, while
the other is meant to be read from the bottom to the top. The “top” end is
called the 5’ end, and the “bottom” is called 3’ end. A picture of the chemical
representation of the DNA can be seen in figure 2.3.
Today, it is known that the DNA encodes the information that expresses all
the human body. It has information on the hair color, the gender, the number
1 However, in this thesis, base and nucleotide are used as synonyms.
7
Figure 2.3: The DNA chemical structure. The left strand is read from top (5’
end) to bottom (3’ end), while the right strand is considered to be inverted. As
presented here, each nucleotide is composed of different atoms, and the structure
is held by atomic bonds. The “backbone” is the scaffold which subjects the
nucleotides. (Source: Wikipedia)
of toes per feet, and so on. Many things about the DNA have already been
discovered, like the translation process, but some others still remain unknown,
like the gene regulation.
The central dogma of the molecular biology [Cri70] enunciates the
normal flow of information that regulates genetics on living beings. Basically, it
states that the DNA holds the genetic information, and the different means of
transferring it. This flow can be interpreted as a graph, as depicted in Figure
2.4.
Each DNA sequence encodes some function in the body. One of the most
important, however, is the production of proteins. Proteins are very complex
molecules that, in turn, are formed by smaller molecules called amino acids.
In every three DNA nucleotides the code for a protein molecule is expressed; for
example, nine DNA nucleotides are equivalent to three amino acids.
In fact, there is another molecule involved in this process, the RNA, which
is very similar to DNA but usually appears in a single strand form. It acts
as a mold, being the intermediary between the DNA chain and the protein.
Furthermore, in RNA, the nucleotide uracil (U) substitutes the DNA’s thymine.
The DN A → P rotein process consists of two phases:
The Central Dogma also explains a third information transfer: the DNA
replication, used when the DNA duplicates, for example, when transferring
8
Figure 2.4: Central dogma of the molecular biology. The blue arrows determine
the normal information flow, while the red arrows point to special cases. (Source:
Wikipedia)
2.2 Genes
Genetics is the area of biology that studies genes, which is the basic unit of
DNA information. All these are, in fact, DNA sequences placed strategically
in the body cells. Using naive examples, one can say that, if the sequence
CCTTACAAAATAGGGTG is present at about the 12,413,574th position of someone’s
21st Chromosome, then that person is going to have green eyes.
In the last paragraph there were three remarkable pieces of information:
• the DNA sequence,
• the position and the Chromosome where it is placed (the Transcription
Start Site, or TSS), and
9
• its function.
Today, many of the human genes rest still undiscovered, mostly because one
—or more— of the above information is unknown. The most usual case is that
scientists know that between the 12,400,000th and the 12,450,000th position
of the human’s 21st chromosome there is some sequence that regulates some
unknown function, nothing else.
Nowadays there are very powerful computers that are capable to handle large
character sequences and work with them. The whole Human Genome lengths
about 3 Gigabases, three thousand million letters. But, how can we extract any
data from just plain letters?
Given the fact that the DNA has physical properties that define its flexibility,
and that a protein is not a language parser that reads the whole 3 Gigabases
before attaching somewhere, the conclusion is clear: all interactions that happen
in the human body and, in extension, everything regarding DNA and proteins,
are ruled only by physical forces (e.g. electro-chemical forces).
Different families of genes are expressed by different proteins. For instance,
Polymerase II (Pol-II), a very large protein with more than 10 subunits, is
the main player in the transcription of genes encoding for messenger RNA
(mRNA), the type of RNA that transports the information from the DNA to the
proteins. However, it is also clear that Pol-II is not able to start transcription
by itself, but needs of a large number of additional proteins.
Those proteins, named transcription factors (TF), create a large protein
cluster bound to DNA, the pre-initiation complex, which precedes the tran-
scription of the corresponding gene [SK03]. The region 200-300 bp2 upstream
the core promoter (see sec. 2.3), where the TF binds, defines the proximal
promoter area, where multiple transcription factor binding sites are located.
Additional signals are received from enhancers which can be bound far —even
thousands of bases away— in the sequence of DNA, but that the DNA structure
probably locates close enough as to allow interaction with the pre-initiation
complex, the proteins involved in gene translation.
Genes can be divided in two big groups: coding and non-coding. The
coding sequence determines what the gene produces, while the non-coding se-
quences are known to be genes, but somehow they do not seem to be translated.
Coding genes account for about 30% of the total, while non-coding account for
70% of all genes.
2.3 Promoters
A promoter is a small region (hundreds of bases) located just before a gene,
and it is the place where the transcription factors attach to start expressing a
gene.
They are of extreme relevance because of several reasons. First of all, they
are strictly related to genes, and finding a new promoter on a relatively deserted
DNA sequence usually leads to find a new gene. Second, promoters regulate gene
expression, which looks like Figure 2.5.
The core promoter is the region immediately upstream of the TSS, where
the transcription initiation complex assembles. It is located upstream of the
2 Base Pairs, a size unit meaning a base A, C, G, T and its complementary T, G, C, A
10
Figure 2.5: The process of gene expression. A Transcription Factor (blue, thin
and curly) has bound to the DNA (orange, double strand) and is producing
the RNA (green, in between the top opened DNA strands) which will, in turn,
express the gene in the final Protein. (Source: Wikipedia)
coding part of the gene, sometimes up to several thousand base pairs, and is
responsible for basal transcription as well as transcriptional regulation of the
gene it is linked with.
Biologically speaking, there are many signals that help us detect promoters
[AD07]. One is the Initiator Element (Inr), located around the TSS and
can work independently or synergistically with the TATA box. However, the
prevalence of the Inr element in mammalians is not clear, but it seems to be
quite abundant in Drosophila3 . TATA boxes and CpG islands are very clear
signals that have been studied thoroughly, and they will be reviewed.
Still, the presence of a TATA box or a CpG island is not a requirement for
the presence of a promoter, since there exist promoters without them. Similarly,
their absence does not imply that there is no promoter.
recombination and mutations are common, and they share a lot of DNA with humans.
11
deviations from this consensus have been found in different genes [SK03].
It is generally found in the 25-30 upstream of the transcription start site
(TSS), but again this distance can change depending on the organism [HS89].
Traditionally, the TATA box was supposed to be present in around 30% of
genes, but it is not present in oncogenes, growth factors and house-keeping
genes [SS03]. Furthermore, recent experiments performed within the ENCODE
project4 strongly suggest that the number ot TSS linked to TATA-box can be
even lower [BTC+ 06].
launched by the US National Human Genome Research Institute in September 2003. The
goal is to find all functional elements in the human genome, one of the most critical projects
after the successful completion of the Human Genome Project.
5 In this document, the terms DNA physical property, DNA descriptor and DNA mechanical
12
depicted in figure 2.6. B-DNA is the standard form, and the B name is used
only for description purposes, as it is commonly referred just as “DNA”. A-DNA
and Z-DNA differ significantly in their geometry and dimensions to B-DNA,
although they still form helical structures.
The A form is likely to occur only in dehydrated samples of DNA, such as
those used in crystallographic experiments, and possibly in hybrid pairings of
DNA and RNA strands. It is a right-handed double helix fairly similar to the
more common and well-known B-DNA form, but with a shorter, more compact
helical structure.
Segments of DNA that cells have methylated (see 2.3.2) for regulatory pur-
poses may adopt the Z geometry. It is a left-handed double helical structure in
which the double helix winds to the left in a zig-zag pattern, instead of to the
right, like the more common B-DNA form. There is also evidence of protein-
DNA complexes forming Z-DNA structures.
Other conformations are possible; (C)ovalent mitomycin-DNA [SFL+ 95],
(D)elta-DNA [SMHK01], (E)ccentric-DNA [VEH00], (L)ambda-DNA [SCHH82],
(P)auling-DNA [ABLC98] and S-DNA [CLH+ 96] have been described so far, al-
though they only appear in some specific organisms and are not common.
Figure 2.6: The structures of A-DNA, B-DNA and Z-DNA. B-DNA is the canon-
ical form, discovered by Watson and Crick 50 years ago. A-DNA is typical on
RNA, but also present in DNA sometimes. Notice how the Z-DNA winds to the
left, whereas the A- and B-DNA forms wind to the right. (Source: Wikipedia)
At another structural level, we can also find DNA triplexes and quadruplexes.
These DNA structures are composed of three or even four strands, instead of
the typical two-stranded model.
The ability of the DNA to form these structures is known since the 50s-
60s, but in the last years many studies have brought back the interest for them
[SCH96], due to their possible implication in biological processes such as the
transcription and to their potential biotechnological applications, like a viable
information-encoding system. [LGK05]. Expanded DNA, especially quadru-
13
Figure 2.7: A triplex structure, side and top images. The two original strands
are depicted in red, and the extra strand appears in green.
Figure 2.8: A quadruplex structure, side and top images. The whole structure
is stabilized by a N a+ or K + ion in the center.
14
Figure 2.9: A nucleosome (green) bound to the DNA (red to blue scale).
15
Figure 2.10: This picture shows how the DNA is compacted many times, forming
chromosomes. The resulting complex, composed of DNA and proteins, is called
chromatin. (Source: Wikipedia)
4), respectively, and more on these concepts will be explained in those chapters.
Next, we will overview some of the available tools which apply these concepts
in bioinformatics software.
16
Chapter 3
17
Figure 3.1: The UCSC Genome Browser plotting many DNAlive descriptors.
On the top, there are the genomic coordinates; on the left, the name of each
track.
Figure 3.2: Managing tracks in the UCSC Genome Browser. The interface
provides many alternatives to display the data, allowing a full customization of
each track.
18
3.2 Sequence search
BLAST (Basic Local Alignment Search Tool) is an algorithm for comparing
primary biological sequence information, such as the aminoacid sequences of
different proteins or the nucleotides of DNA sequences. A BLAST search enables
a researcher to compare a query sequence with a library or database of sequences,
and identify library sequences that resemble the query sequence. For example,
following the discovery of a previously unknown gene in the mouse, a scientist
will typically perform a BLAST search of the human genome to see if humans
carry a similar gene.
BLAT is a BLAST-like tool designed to quickly find DNA sequences of 95%
and greater similarity of length 25 bases or more. It keeps only an index of the
entire genome in memory, which takes up a bit less than a gigabyte of RAM. The
genome itself is not kept in memory, allowing BLAT to deliver high performance
on a reasonably priced box. Then, a user can input a DNA sequence, and the
BLAT server will quickly (5-6 seconds) return the genomic coordinates where
that sequence is located; that is very useful to display annotated DNA from a
sequence which was thought to be unknown.
19
Because of the complexity and expensiveness of classic MD, a simulation of
the simulation technique is performed to determine basic molecular dynamics
with a Monte Carlo (MC) method.
The MC approach relies on the distribution of random numbers to approx-
imate expensive physical or mathematical calculations. Instead of trying to
reproduce the dynamics of a system, it generates states according to appro-
priate Boltzmann probabilities. It employs a Markov chain procedure in order
to determine a new state for a system from a previous one. According to its
stochastic nature, this new state is accepted at random. The avoidance of
dynamics —in Markov processes, each state is independent from the previous
one— restricts the method to studies of static quantities only, but the freedom
to choose moves makes the method very flexible.
20
3.5 ProStar 1
After this exhaustive review of the promoter detection techniques, the MMB
group decided to implement a mechanism to derive all the known physical prop-
erties and then analyze possible differential properties of DNA in regulatory
regions in vertebrates. As expected, parameters are often correlated and the
informative content is not always easy to determine [OcLNR02]. Beyond the
idea that these parameters with stronger signals near TSS are supposed to play
a determinant role in regulatory mechanism, they searched for this reduced set
of parameters that were supposed to be able to cooperatively predict a promoter
region.
In ProStar’s case, first of all, the six helical force constants were calcu-
lated by means of MD simulations [PMSS07], as depicted in figure 3.3. Four
duplexes containing several replicas of every possible combination of two nu-
cleotides were used, plus four extra sequences related to CpG islands and TATA
boxes: GCCTATAAACGCCTATAA, CTAGGTGGATGACTCATT, CACGGAACCGGTTCCGTG and
GGCGCGCACCACGCGCGG. The trajectories were manipulated to represent the de-
formability of a given step along the three rotations and the three translations.
The dataset was gathered from protein-coding genes annotated by the HA-
VANA group, following the EGASP workshop rules[BBB+ 06]. Then, the method
was trained for promoter recognition with a collection of 500-nts4 sequences
that comprised intervals of 250 nucleotides upstream and 250 downstream of
the training TSS set. For the negative set, random 500-nts sequences from
transcribed regions were selected, making sure that they did not overlap with
the positive set. For the recognition of the strand, the method was trained with
a collection of DNA sequences that comprised 900-nts upstream and 900 down-
stream of the same TSS. The reverse complementary sequences were taken as
the negative set.
Using the MD derived parameters, any DNA sequence of size n can be de-
scribed as a 6-dimensional deformation vector
v = (twist, tilt, roll, shif t, slide, rise). For a given deformation, the values asso-
ciated to every dinuecleotide step in the sequence are summed and then divided
by n − 1. For example, the twist deformation score of sequence ACGC would be
(0.036[AC]+0.014[CG]+0.025[GC])/3 = 0.025. The 6-dimensional vector of the
same sequence would then v(ACGT ) = (0.025, 0.033, 0.022, 1.200, 2.547, 8.230)
(see fig. 3.4)
To classify the input sequences into the promoter class (kx ) or the non-
promoter class (ky ), the Mahalanobis distance is used. Computing the physical
properties of every sequence of the dataset a set of vectors for every class is
concluded (X for class kx and Y for ky ). The Mahalanobis distance DM
between the set of vectors X and Y is defined in eq. 3.1:
21
Every possible
combination of
CTGA
MD simulations
6 helical
force
constants
Negative set:
Positive set:
Random
Coding genes
sequences from
from HAVANA
HAVANA
Promoter
database
22
6 helical
User input
force
constants
Mahalanobis
distance classifier
Promoter, No
strand Promoter
g(s, kx ) − g(s, hy )
score(s) = (3.3)
g(µx , kx ) − g(µx , ky )
The strand is predicted by recognizing the upstream/downstream signal
asymmetry using a statistically discriminator based, again, on Mahalanobis met-
rics. Finally, very close clusters of predictions (using a 1000-nts window) are
unified in a single hit. A graphical view of the system is depicted in fig. 3.5.
ProStar’s method is conceptually and computationally simpler than any
other general promoter prediction algorithm as it does not require any addi-
tional information, such as conservation of gene structure across species, pres-
ence of CpG islands, TATA-boxes, Inr elements or any other sequence specific
signals. Due to its simplicity ProStar can be in principle applied even in cases
where promoters are located in unusual genomic positions.
23
Figure 3.6: Physical properties study for EP3
The results of these comparisons show that despite its simplicity, ProStar
behaved better than most of the other methods, similar to two algorithms that
use gene structure for prediction (fpom and firstef ) and only nscan, that is based
also in multi-species homology, provides more accurate results for the reference
set of genes. (see [GPTO07], pages 5–8)
3.6 EP3
Abeel et al. developed EP3 [ASB+ 08], a novel approach for predicting pro-
moters in whole-genome sequences by using large-scale structural properties of
DNA. EP3 requires no training, is applicable to many eukaryotic genomes, and
performs extremely well in comparison with the best available promoter predic-
tion programs to the date of publishing. Moreover, it is fast, simple in design,
has no size constraints, and the results are easily interpretable. Their method
also has been tested on 12 additional eukaryotic genomes, including vertebrates,
invertebrates, plants, fungi, and protists.
EP3 is very relevant for this Thesis because they analyze the relationship
between different physical properties and the known promoter groups. Specif-
ically, the variation of RNAP II is plot for each promoter descriptor: TATA
boxes, CpG islands, TFIIB and Inr (see sec. 2.3). This graphic is depicted in
figure 3.6
While EP3 does not outperform its peers by much, the program has sev-
eral additional advantages compared with other promoter predictors. EP3 re-
quires no training or parameter tuning, unlike other programs that need exten-
sive amounts of experimentally determined data for the training of their model
([OSHN00]; [SKW00]; [DGZ01]; [DH02]; [BSC+ 03]). When working on a ge-
nomic scale, speed and memory requirements also are of importance. EP3 is
very fast (for instance, it takes less than 1 h to annotate the complete human
24
genome), requires little memory, and can thus be run on a home computer; in
contrast, some programs require a computer cluster of 80 machines for nearly
a week to process the human genome. Besides performing very well, especially
in light of its simplicity, EP3 can handle many eukaryotic genomes without
modifications.
25
Chapter 4
In this chapter, the whole process for constructing DNAlive will be presented.
A typical software engineering process was used since the beginning, composed
of the following steps:
1. Define an objective
2. Gather the requirements
26
every major internet browser. The web services must be compatible with other
services and exchange data seamlessly.
The web page must be easy to use, work in real time and provide enough
feedback so that the user always knows what to do next. For that, the perfor-
mance should be tweaked to the maximum for all the data to be calculated in
a few seconds. When the system finds that some calculus is going to take more
than a minute, the user should be warned and asked for confirmation1 .
Regarding reliability, even though we will ensure that no data is lost, it is not
a main requirement, as all the data can be regenerated and all the steps can be
reproduced in a few seconds. Security is also important for the server part, and
we must ensure that the software has no exploits. For the client part, however,
no critical or confidential data is handled, so no extra security measures (SSL,
passwords) need to be taken.
Finally, one of the most important requirements was modularity. As we
needed to construct two interfaces for accessing DNAlive, every operation must
be modularized.
27
Two types of inputs. The user can switch between them
Genomes BLAT
database
Proteins
Draw 3D
Bind proteins
Structure
DNA 3D DNA-Protein
Structure complex
Paint the
properties in the Animate DNA
3D structure
3D Plot 4D Plot
Figure 4.1: Architecture of the system. The user can input either genomic
coordinates or a DNA sequence, while being able to change between them at
any moment. The outputs (2D, 3D, 4D plot) are shown in blue, and all the
data structures are painted in red. Databases are painted in yellow, and all the
other processes are depicted in white.
28
then produces another protein. These proteins do not actually bind to a spe-
cific DNA sequence, but to a sequence with specific physical properties or,
computationally speaking, its attributes.
In Chapter 3 we presented a list of different works on DNA descriptors. For
this phase we chose 29 of them, described in the tables in appendix B.
Some examples of physical properties are:
• The melting temperature. Some parts of the DNA are more easily melted
than others; just like the edge of a paper burns more easily than the middle
part.
• Twist. The DNA can be physically twisted more easily in some specific
areas.
• Nucleosome preference. The probability that this DNA segment will bind
to a big protein named Nucleosome (see section 2.5)
AT TA GC TT GG CA AC
Melting temperature 0.023 0.102 0.073 0.009 0.069 0.104 0.044
Twist 0.033 0.017 0.025 0.026 0.026 0.016 0.036
Nucleosome preference 0.012 0.038 0.081 0.011 0.012 0.021 0.054
So, how can these properties be calculated? There are various methods, but
the main idea is to compute the value for each couple of nucleotides, and then
average them if needed. In table 4.1 we can see how the DNA “alphabetical”
sequence can be translated to plain numbers.
Let us imagine that we have the following DNA sequence:
CGTTAATATCGGCGTAGCTAGTGTTTTTCCGATATATCAGTCCGGGCCGCG
That sequence can be interpreted as follows:
XXXPPPPPPPPPPXXXXGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGXXXX
being P a promoter, G a gene, and X nucleotides with no useful information.
We can transform this sequence to plain numeric values (again, see table
4.1) and, if needed, plot a histogram of the resulting values, as shown in figure
4.2
Most of the descriptors use the Nearest Neighbor (NN) model. This approach
assumes that the stability of a given base pair depends on the identity and
orientation of neighboring base pairs.
The energy values are retrieved experimentally, in vitro. Scientists calculate
the energy values for 500 DNA sequences [BD98] and, then, using a linear
equation system, the values for each couple of base pairs is determined.
29
Once the energy values for each couple of base pairs and each property is
calculated, the algorithm needs to read the DNA sequence, aggregate the values
every couple of base pairs, and apply some correction factors. Usually, a sliding
window of 25 base pairs is used, so that every 25 bases the value is averaged and
the signal noise is reduced. These algorithms are, in fact, physical properties
predictors, as they can only estimate the actual value of the descriptors with
a mathematical model which is not perfect. However, this approach is widely
used and drops results which are very similar to the experimental ones.
To implement the scripts we used the Perl programming language. The main
reason is that there are already plenty of libraries for bioinformatic use. It also
interacts easily with web pages, and each script could be transformed into a web
page if needed.
30
if the bugs where fixed and provide feedback on the interface changes. This
system also keeps backups of previous versions and deploys the development
page to production instantly, so no downtime is noticed.
31
Figure 4.3: Physical properties web services
32
4.5.1 Physical properties services
There is no need to dig deep into the implementation of these services, because
they are only a wrapper for the scripts. They receive the input data in the
standard BioMOBY format, call the scripts, reformat the output into BioMOBY
data, and return it to the user.
The services are grouped by descriptor types: DNA Flexibility, PARMBSC0
helical force constants, DNA Disruption energy, DNA non-linear dynamics,
DNA 3DNA structure, DNA stability, Unusual DNA conformation, ProStar
and Nucleosome potential4 .
The user can build a simple workflow, like the one in fig. 4.3, and launch
batch calculations for different DNA sequences, automating the process.
• There is one new use case: “show3DFeature”, whose inputs are a DNA
3D structure and information about the CpG islands (“runEmbossCp-
GReport GFF FromFasta”). This service outputs the DNA and a script
to draw the CpG information in the VMD software. The user then needs
to open this software, open the DNA molecule and paste the script code.
• Another service is provided: “runEmbossTFscan GFF FromFasta”, which
outputs information about the transcription factors on a DNA. It is a
wrapper for the program tfscan from EMBOSS, a bioinformatics soft-
ware package. The results are similar to those of “runSimpleTfbs”, and is
offered as an alternative to it.
Looking again at the system architecture, there are some process which have
no equivalent as a web service. First of all, we decided not to implement those
which are already provided as a web service by somebody else. In this case,
“BLAT” and “Get Sequence”.
Animating a DNA is also a web-only feature. We think that it is important
to offer services which can be useful to run as a batch, but watching a DNA
movie is an interactive process and it makes no sense to return a long file. With
4 In the web page, these two services appear as “Regulation parameters”, but we decided
33
3D DNA it makes sense to return a PDB file, because they are widely used and
there is plenty of software to derive data from them.
34
The web page needs to exchange data with external sites, thus using their avail-
able APIs.
Layered software design is essential in this case, since the data access and the
system logic need to be independent from the interface. Changing the algorithm
to compute a physical property, for example, cannot require modifications in the
interface code.
Finally, to develop research tools which can be computationally expensive,
one must also analyze the available hardware. Some of the scripts require expo-
nential algorithms, so all the batches need to be run through a queuing system,
to ensure hardware stability.
35
Chapter 5
Applying AI techniques to
ProStar
While Phase 1 was mainly devoted to the development of the DNAlive envi-
ronment, in this chapter we will present the new methodology used to improve
ProStar, whose original version was presented in section 3.5.
In order to follow the EGASP rules [BBB+ 06] and to be able to directly
compare the results with those of ProStar1, we got the same list of positive
promoter sequences and the set of non-promoter sequences as in the previous
version.
Positive sequences are the collection of 885 genes, annotated by the Havana
group in the ENCODE region (1% of the human genome) belonging to coding
genes. This information is stored on an index file havana.gene.
havana.gene contains the name of the gene, the name of the ENCODE
region where it belongs, the strand, the Transcription Start Site (TSS) and the
end of the transcribed region. A sample follows:
Havana1 ENm001 + 354319 390710
Considering the TSS as position 0, we selected the 2000 bases around it
[−1000.. + 1000] on the ENCODE database, resulting on 885 sequences, 2000
base pairs each. Further data calculations required us to drop 11 elements,
remaining 874 of them.
For the negative set, we got sequences composed of 2000 non-overlapping
bases (to genes and to themselves) lying in the same strand of transcribed
regions of the Havana genes. Using the example above, where 354319 is the
TSS, we retrieved the bases 353319 to 355318 for the positive set, and then, in
groups of 2000 bases, the rest of the data from 355319 to 390710, is added to
the negative set. Later on, we refined the negative set, deleting the elements
which were too close to a gene, as some of their properties were too similar to
those of the genes.
The strand information is also important. As the ENCODE database only
contains the positive strand, if the gene is located in the negative strand, the
resulting sequence needs to be complementary reversed. This means that AAGT
is reversed to TGAA and then complemented to ACTT.
36
5.1 Proposed solution
We are presented a classification problem which cannot be described by any
existent model. So, a new model needs to be discovered, and we propose to
learn it by using a neural network.
There are 29 DNA descriptors which we believe can be used to build this
model. However, we cannot use all of them to train the network, so we will
apply some filters beforehand, to the data set and also to the descriptors.
First, we will split the promoter database in four groups to gain specificity.
Then, an analysis of the DNA physical properties will be performed to reduce
the input noise on the network (fig. 5.2). After these techniques have been
applied, we will train a network for each group and study its performance as a
promoter classifier (fig. 5.3).
37
Pos. -3 -2 -1 0 1 2 3 4 5
A -1.02 -3.05 0 -4.61 0 0 0 0 -0.01
C -0.28 -2.06 -5.22 -3.49 -5.17 -4.63 -4.12 -3.74 -1.13
G 0 -2.74 -4.28 -4.61 -3.77 -4.73 -2.65 -1.5 0
T -1.68 0 -2.28 0 -2.34 -0.52 -3.65 -0.37 -1.4
Table 5.1: Partial weight matrix for the TATA box, the full version can be found
at [Buc90]. Values for the “TATAA” sequence are rendered in bold.
Table 5.2: The dataset, divided into the four promoter groups.
This way, from now on, every cluster will have each its own positive and
negative set and will be treated independently.
The EGASP rules establish that 13 of the 44 ENCODE regions are used for
training and the others for testing. Unfortunately, after clustering the promoters
in four groups, a further division between a training set and a test set would
leave the smallest group empty, and the rest with few elements. Therefore it
was decided to use the whole dataset for training, performing a 10-fold cross-
validation.
deciding whether a CpG island is present, we will use all the resulting values as a descriptor.
38
is just a composition of the 6 helical properties, which are going to be analyzed
again.
We then run the descriptors on all the dataset, positive and negative. The
computing time took about 48 hours in 10 multi-core machines of the MMB grid
system; an estimated time of 1000 CPU-hours. Even though the process could
have been run through DNAlive web services, the naked scripts were used, so
that the software could be run through batch queues, avoid network and web-
services protocol overhead, and speed up the calculation time in general.
The descriptor analysis consists of two steps, depicted in figure 5.1. The
first step is to compute the Pearson correlation matrix for all descriptors, in
each promoter group. The results will show if there are significant differences
between promoter groups.
Then, a Principal Components Analysis per group was performed. The role
of the PCA is to gather the most important properties that define promoters,
and reduce the weight of those which do not add important information.
Table 5.3: Sample descriptors matrix I before applying the eigenvectors func-
tion.
The first task was to run a Pearson correlation between all descriptors:
39
Positive set
Compute the
physical Has TATA or CpG
properties
Physical
TATA/CpG
properties for
group
each sequence
Correlation
PCA
analysis
Correlation data
between Eigenvectors
properties
40
In general, there should be correlation between some physical properties,
either direct or inverse correlation. For example, the stability, by definition, is
inversely correlated to the free energy (fig. 6.5).
The results (fig. 6.4), indeed, show that some of them reach correlation
values even higher than 90% (colored in red in the aforementioned figures).
We decided then to run a PCA with all descriptors. As the algorithm builds
a correlation matrix to extract the eigenvectors, it should ignore (i.e. assign low
weights) to highly correlated descriptors and positions. Some of the blatantly
correlated properties could have been removed by hand, but we decided to test
the PCA analysis power by running it with all the data.
That way, we can transform the 1080 input parameters to only a few dozens
(Results, table 6.1), while conserving 80% of the original information. With less
parameters, the neural network to be trained (see next section) will be faster
and it should not overfit.
41
signal for each physical property is different. Furthermore, the PCA reduced
the number of inputs, weighting the most important descriptors and ignoring
the others.
It is important to note that, even though we apply the positive set eigen-
vectors function to the input data, the neural network is fed with both sets to
provide a better profiling of each class. They were balanced in the following
manner:
Table 5.4: Elements of the positive and negative set for each promoter group.
The positive set always kept all elements, dropping instances from the negative
set if necessary.
The inputs for the neural networks are the four vectors R (eq. 5.3). The
number of elements in each vector is the number of eigenvectors for that group.
Namely, 72 for the tata-cpg-, 69 for the tata-cpg+, 28 for tata+cpg- and
16 for tata+cpg+. As can be noticed, the number of eigenvectors increases
consistently with the number of elements in the group. Fig. 5.2 contains a
diagram with the data transformations that are performed in order to build the
classifier.
Weka 3.4.13 was used for this task, run in commandline mode. The classi-
fier was set to MultilayerPerceptron, and the other parameters were left with
the default values except: autobuild True, decay False, nominalToBinaryFil-
ter False, normalizeAttributes False, reset False. The training data was set to
cross-validate in 10 folds.
An initial combination of parameters for each network was obtained by brute
force. Combinations were computed by varying the learning rate between 0.1
and 1, step 0.1; the momentum between 0.1 and 1, step 0.1; and the number
of hidden layers between 1 and 3. That methodology allows us to assert that
our network quality is not far away from the best possible network for the given
data. Networks with more than 3 hidden layers were discarded because of the
computational cost and the marginal (if any) improvement in accuracy they
presented.
As the quality measure we looked at the absolute number of correctly classi-
fied instances. The results for these networks were analyzed and the best three
networks for each group were tweaked by hand, observing the impact of each
variable in the resulting. Besides these variables, also the number of neurons in
each layer and the training time were tested.
The four neural networks compose ProStar2’s classifier, acting as the class
predictor for promoter sequences. The models for each network can be found in
the results section (6.2.4).
42
Positive set Negative set
Eigenvectors
Apply the
eigenvectors
function
TATA/CpG
group
Neural network
NN
Promoter
classifier
For each TATA/CpG group
43
5.5 ProStar 2.0
The new version of ProStar scans genomes in 2000-nts windows, moving forward
500-nts at each step. For every window we assume that the TSS would be placed
in position +1000, the position where the TSS was located in the training set.
We first classify this target sequence as CpG+/CpG- and TATA+/TATA-.
Once the promoter type is assigned to our target, we compute the appropriate
physical descriptors with DNAlive scripts. Then the eigenvectors function is
applied, and the resulting array is classified with the specific Neural Network.
A graphical image of this process is depicted in fig. 5.3.
ProStar 2 only requires the user to input a DNA sequence, and it automat-
ically performs all the calculations. Afterwards, it outputs the predicted class
(promoter or non-promoter) and the confidence value for its prediction.
Our method is not available to predict the strand of the promoter, so the
software will add an extra prediction with a negative strand to every promoter,
doubling the amount of predictions.
44
Compute the
Input sequence physical
properties
Physical
properties
Physical
Has TATA or CpG properties for the
current window
Data model
NN model
for that Run the classifier
group
Promoter
prediction
Iterate on a 2000-bps window, step 500 bps
List of all
predictions
45
the number of descriptors with the aid of two statistical techniques. Finally, a
neural network was built for each group. Each network was fed with the same
number of instances for both classes, when possible.
As a result, the ProStar 2 software scans the input sequence in windows of
the same size as the training set, stepping every 500 bases. It then classifies each
window into the promoter or non-promoter class, and outputs each prediction
with the confidence value.
To evaluate the results, we will compare the performance of each of the four
networks to ProStar 1 (Results, table 6.4).
46
Chapter 6
Results
6.1 DNAlive
DNAlive has had a big impact in researchers, corroborated by the publication
of a paper and two oral communications in international conferences1 . The web
page is accessed every day, especially for crossing ProStar’s predictions with
gene information at the UCSC Genome Browser.
DNAlive was developed to give a complete description of the physical prop-
erties of genomic DNA in a simple way, thus providing data that can be easily
understood by non-structural experts. Among others, DNAlive allows the user
to:
Figure 6.1 depicts a screen shot of a DNAlive session in progress, and ap-
pendix A contains the published version of the manuscript.
6.2 ProStar 2
Many steps were involved in the new predictor methodology for ProStar 2. The
next sections present the individual results for each of the phases.
1 DNAlive: A tool for a realistic representation of the DNA at a genomic scale. VII
Jornadas de Bioinformtica, Valencia, February 2007. Representing the genomic DNA. Grand
challenges in computational biology, Barcelona, June 2008
47
Figure 6.1: Screen shot of a DNAlive session. The user has selected a 3D view
of a DNA sequence, and the Curvature descriptor is painted in a blue-to-red
scale.
48
Figure 6.2: Plot for the CpG property on the positive (top) and negative (bot-
tom) sets. The images on the left are the average for all promoter groups, while
the images on the right have each group separately. A clear signal can be ap-
preciated for the promoters, while the negative set only contains noise. The
negative tata+cpg+ (pink) group has higher variations because it contains less
elements.
49
Figure 6.3: Plot for the Curvature property on the positive (top) and negative
(bottom) sets. The images on the left are the average for all promoter groups,
while the images on the right have each group separately. Again, a clear signal
can be appreciated for the promoters, but not in the negative set.
50
TSS signals with that many properties. With this data, we can conclude that
the DNA descriptors can carry promoter signals, and that this signal behaves
differetly for each promoter group.
After looking at the graphs, we performed the numerical analysis of the
signals.
Table 6.2 contains part of the first three eigenvectors for the tata-cpg+
group. In general, the first three eigenvectors carry about 20% of the original
51
Figure 6.4: Pearson correlation matrix for the TATA+ CpG- group.
52
Figure 6.5: Weka plotting the correlation between Stability and Duplex Stability
Free Energy.
# Weight Elements
1 10.478% 6.4% ∗ denaturation1000 − 6.4% ∗ stacking1000 +
+6.3% ∗ meltingtm1000 − 6.3% ∗ duplstabf reeen1000 + . . .
2 8.204% −6.4% ∗ denaturation1850 − 6.4% ∗ meltingtm1850 +
+6.4% ∗ stacking1850 − 6.4% ∗ stability1850 + . . .
3 4.026% −6.7% ∗ meltingtm800 − 6.7% ∗ stability800 +
+6.7% ∗ zdna800 + 6.6% ∗ stacking800
Table 6.2: The first three eigenvectors extracted from the tata-cpg+’s PCA
53
Group Neurons / HL Learning rate Momentum Training time
tata+cpg+ 17, 17 0.5 0.2 500
tata-cpg+ 70 0.2 0.1 500
tata-cpg- 73 0.2 0.4 500
tata+cpg- 29, 29, 29 0.3 0.4 500
Table 6.3: Neural network models for each promoter group. The groups are
sorted by their accuracy (see tab. 6.4). The second column “Neurons per
hidden layer” represents both the number of layers and the neurons in each one,
separated by a comma. Note: in all cases, the number of neurons corresponds
to the number of eigenvectors for that group.
accuracy of 79%. The other groups’ accuracy is near 60% with very high values
for the relative error.
The relative absolute error (column RA Error) is about 50% in the best
case, and near 80% in the rest. This means that, in the worst case, the prediction
power is only a bit better than a ZeroR prediction (i.e. the mode of all values).
Table 6.4: Accuracy results for ProStar 2 versus ProStar 1 . The first column
indicates ProStar2’s promoter group, and the aggregated for ProStar 1. Pro-
Star1’s results were gathered with a different methodology, which did not output
the RA Error.
54
Figure 6.6: Screen shot of a ProStar 2 session. The user has input a DNA
sequence and the software has predicted a promoter at the position 1500.
55
Chapter 7
Conclusions
In chapter 1.1 we presented the dual objectives for this thesis: creating a multi-
interface platform to compute DNA descriptors, and implementing a new version
for the ProStar promoter predictor, using AI techniques.
In the next two sections we will analyze the results for each project and
compare them in detail to the expected objectives.
7.1 DNAlive
Huge databases are of no use without graphical tools to visualize the data and
modular algorithms that can process jobs in batches.
Cross-referencing data is a very difficult work, sometimes even impossible,
and at the current data growth rate researchers are now realizing that this
might be the last opportunity to classify it, now that we still have the required
computational power.
DNAlive was meant to be a simple interface for DNA description algorithms,
but it has turned out to be a step forward into building tools whose purpose is to
help users manage the available information. The implementation of the physi-
cal properties was derived from published work, data sources from the internet
are used to cross-reference data, and if it is available, we use the annotated
data from the UCSC Genome Browser to plot our tracks. 3D molecules and
animations are displayed in a Jmol applet, and our web services call other web
services to retrieve data and launch calculations.
To sum up, even if DNAlive has little Artificial Intelligence in it, it turns to
be an interesting project for the degree in Computer Engineering, and surely
will have more impact than ProStar 2.
7.2 ProStar 2
ProStar 2 has undergone a lot of changes from ProStar 1. Analyzing the method-
ology, it can be considered a different software rather than an evolution of the
first one. The three-step process has forced us to make some decisions, which
undoubtably have influenced the final results.
Some conclusions can be drawn from the obtained results:
56
• Promoter division in four groups. TATA boxes and CpG islands are the
most popular features for defining promoters, and the resulting clusters’
physical properties behaved different. This might mean that these features
are good for promoter clustering, although maybe not the best ones.
• Sequence extraction. We arbitrarily decided to select the 2000 nucleotides
surrounding each TSS. Looking at the signal for many properties, it is
clear that selecting about 1000-nts sequences might have reduced noise,
as the 500 initial and final nucleotides seem to carry no signal.
• Descriptors signal analysis. Most of the descriptors have a very clear
peak at the TSS. However, not all promoter groups benefit from the same
descriptors. In the Results chapter we described how the tata+cpg- group
had a very distinctive signal for the CpG descriptor, but this signal was
not that clear for the Curvature descriptor.
• Parameter correlation. In order to be computationally useful, 2000-nts
sequences were reduced to 27 descriptors of 40 values. We do not believe
that there is much information loss, as nearby nucleotides have similar
physical behaviors. Despite the easily visible correlations, all the descrip-
tors were used to train the neural network.
• Principal components analysis. We let the PCA decide the weight for each
descriptor and position, hoping that it would be smart enough to select
the most relevant ones. Looking at the resulting eigenvectors by hand,
some positions far away from the TSS —that is, with no relevant signal—
weighted more than what was expected.
• Predictor training. A MultilayerPerceptron was used as a classifier, and
because it is very versatile, after analyzing the results, it seems that this
was not a bad choice. Nonetheless, they tend to overfit and they are
difficult to tune. Again, we performed a brute force initial approach, and
the final results have been manually tuned based on the best networks.
However, as might have been expected, the networks with less inputs
didn’t work better (see table 6.4).
Having descriptors which plot such clear signals, we should think why Pro-
Star2’s predictions are not that accurate. We wanted to automate the weight
calculus for each descriptor with the PCA, so we did not filter them manually,
exchanging automation for accuracy. Parameters should have been filtered by
hand, keeping only the descriptors which had a clear signal (see sec. 6.2.1).
Still, we believe that the methodology presented in this thesis can be fine tuned
in order to increase other groups’ accuracy.
For all that, we think that the area which can be more improved on is
the Principal Components Analysis. The PCA cannot be blindly trusted to
reduce noise. Instead of letting it extract 72 eigenvectors from 1,080 posi-
tion/descriptors, it would be better to reduce the descriptors to about 100 and
then get more specific eigenvectors. Another approach to reduce the number
of eigenvectors is to keep less than 80% of the original information, say, about
50%.
The initial objective for ProStar2, improving ProStar1 predictive power, has
been reached only for the small tata+cpg+ promoter group (table 6.4). This
57
also suggests that clustering the promoters beforehand improves accuracy, but
another clustering based on different properties could be more appropriate.
1 http://evolutionarygenomics.imim.es/
58
Bibliography
59
[BSC+ 03] Vladimir B Bajic, Seng Hong Seah, Allen Chong, S P T Krishnan,
Judice L Y Koh, and Vladimir Brusic. Computer model for recog-
nition of functional transcription start sites in RNA polymerase II
promoters of vertebrates. J Mol Graph Model, 21(5):323–32, Mar
2003.
[BSSP95] I Brukner, R Sanchez, D Suck, and S Pongor. Trinucleotide mod-
els for DNA bending propensity: comparison of models based on
DNaseI digestion and . . . . J Biomol Struct Dyn, Jan 1995.
[BTC+ 06] Vladimir B Bajic, Sin Lam Tan, Alan Christoffels, Christian
Schönbach, Leonard Lipovich, Liang Yang, Oliver Hofmann,
Adele Kruger, Winston Hide, Chikatoshi Kai, Jun Kawai, David A
Hume, Piero Carninci, and Yoshihide Hayashizaki. Mice and men:
their promoter properties. PLoS Genet, 2(4):e54, Apr 2006.
60
[GGF87] M Gardiner-Garden and M Frommer. CpG islands in vertebrate
genomes. J Mol Biol, Jan 1987.
[GPTO07] J Goñi, A Pérez, D Torrents, and M Orozco. Determining pro-
moter location based on DNA structure first-principles calcula-
tions. Genome Biology, Dec 2007.
[GT81] O Gotoh and Y Tagashira. Stabilities of nearest-neighbor dou-
blets in double-helical DNA determined by fitting calculated . . . .
Biopolymers, Jan 1981.
[GZW95] A Gorin, V Zhurkin, and K Wilma. B-DNA twisting correlates
with base-pair morphology. J Mol Biol, Jan 1995.
[HB05] Julian L Huppert and Shankar Balasubramanian. Prevalence of
quadruplexes in the human genome. Nucleic Acids Research,
33(9):2908–16, Jan 2005.
[HC96] M El Hassan and C Calladine. Propeller-twisting of base-pairs
and the conformational mobility of dinucleotide steps in DNA. J
Mol Biol, Jan 1996.
[Hen06] C Henderson. Building scalable web sites. page 330, Jan 2006.
[HS89] P A Harbury and K Struhl. Functional distinctions between yeast
TATA elements. Mol Cell Biol, 9(12):5298–304, Dec 1989.
[HZC90] P S Ho, G W Zhou, and L B Clark. Polarized electronic spectra
of Z-DNA single crystals. Biopolymers, 30(1-2):151–63, Jan 1990.
[IM94] V I Ivanov and L E Minchenkova. The A-form of DNA: in search
of the biological role. Mol Biol (Mosk), 28(6):1258–71, Jan 1994.
[KB05] Aditi Kanhere and Manju Bansal. A novel method for prokaryotic
promoter prediction based on DNA stability. BMC Bioinformat-
ics, 6:1, Jan 2005.
[Ken02] W Kent. . . . BLAT—The BLAST-like alignment tool. Genome
Res, Jan 2002.
[KKBB08] D Karolchik, R Kuhn, R Baertsch, and G Barber. The UCSC
genome browser database: 2008 update. Nucleic Acids Research,
Jan 2008.
[KSF+ 02] W James Kent, Charles W Sugnet, Terrence S Furey, Krishna M
Roskin, Tom H Pringle, Alan M Zahler, and David Haussler. The
human genome browser at UCSC. Genome Res, 12(6):996–1006,
Jun 2002.
[KSW06] Edward Kawas, Martin Senger, and Mark D Wilkinson. BioMoby
extensions to the taverna workflow management and enactment
software. BMC Bioinformatics, 7:523, Jan 2006.
[LGK05] Haibo Liu, Jianmin Gao, and Eric T Kool. Helix-forming proper-
ties of size-expanded DNA, an alternative four-base genetic form.
J Am Chem Soc, 127(5):1396–402, Feb 2005.
61
[LGLP92] F Larsen, G Gundersen, R Lopez, and H Prydz. CpG islands as
gene markers in the human genome. Genomics, 13(4):1095–107,
Aug 1992.
62
[SCHH82] F Sanger, A Coulson, G Hong, and D Hill. Nucleotide sequence
of bacteriophage lambda DNA. J Mol Biol, Jan 1982.
[VAM+ 01] J Venter, M Adams, E Myers, P Li, and R Mural. The sequence
of the human genome. Science, Jan 2001.
[vECLHP05] Titus S van Erp, Santiago Cuesta-Lopez, Johannes-Geert Hag-
mann, and Michel Peyrard. Can one predict DNA transcription
start sites by studying bubbles? Phys Rev Lett, 95(21):218104,
Nov 2005.
63
[VEH00] J M Vargason, B F Eichman, and P S Ho. The extended and
eccentric E-DNA structure induced by cytosine methylation or
bromination. Nat Struct Biol, 7(9):758–61, Sep 2000.
64
Index
65
Taverna, 31
Transcription Factor, 10
True Negative, 44
True Positive, 44
TWINSCAN, 20
Workflow, 31
ZeroR, 54
66
Appendix A
DNAlive publication
In this appendix appears the publication for DNAlive and three pages of the
supplementary data where DNAlive is applied to find new genetic information.
67
Vol. 24 no. 15 2008, pages 1731–1732
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btn259
Structural bioinformatics
© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org 1731
vertebrate genome. The program retrieves parameters from their between conformation, functional annotations and physico-chemical
internal databases (Supplementary Table 1) to determine physical properties (Fig. 1).
profiles and to create a 3D structure of the naked DNA. The server also includes unique tools for a rapid representation
Given a DNA sequence, the program determines potentially of chromatin dynamics, which, in extensive analysis performed
bound transcription factor binding sites (TFBS) by scanning the in our laboratory on our database of more than 100 trajectories,
public TRANSFAC database (http://www.gene-regulation.com/) showed a surprisingly high accuracy of the essential deformation
linked to PDB (http://www.rcsb.org/) and Uniprot databases pattern of DNA. The method uses a mesoscopic Metropolis Monte
(http://www.ebi.uniprot.org/). The selection of the complex of Carlo algorithm, where the geometry of each base pair is defined
interest can be monitored externally by the user, who can force by three local rotations (roll, tilt and twist) and translations (slide,
the generation of specific complexes (for example, nucleosomes, shift and rise), and the conformational energy is estimated from the
protein-multicomplexes, etc.). deformation matrix using a harmonic model (Equation 1), where the
index ‘i’ stands for one of the M base pair steps and the index ‘j’
stands for the six unique helical parameters (ξ ) for each step. The
2.2 Server workflow
equilibrium values for one helical parameter in a given base pair step
Once a DNA sequence is entered (Fig. 1), the program computes type and (ξij0 ) and the associated deformation constant (Ki,j ) were
the profile for the 29 physical properties available for the fiber previously determined from molecular dynamics simulations (Pérez,
(Supplementary Table 1). All properties are represented in a 2D plot 2007). Once a movement in helical coordinates is accepted by the
using either the UCSC Genome Browser (http://genome.ucsc.edu) Metropolis test, the corresponding Cartesian structure of the fiber
in combination with annotated genes whenever genomic coordinates is generated using an adaptation of X3DNA (Lu and Olson, 2003)
for the genome are provided, or Gnuplot (Fig. 1 and Supplementary for VIDEO visualization using JMOL Java applets in the HTML
Fig. 1). page (Supplementary Fig. 3). Basic manipulation and analysis
To combine the visualization of DNA physical properties of the trajectories and structure (rotations, translations, distance
with public annotations of the genome, coordinates of the measurements,…) are allowed by the Jmol interface, which allows
input DNA sequence can be matched by running a search the determination of potential DNA-mediated protein-clusters.
in our local Blat server (Kent, 2002). Although the user is
able to annotate transcription factor PDB structures on specific
M
6 2
positions of the DNA input sequence, we have implemented E= Ki,j ξij −ξij0 (1)
an automatic method to perform this step using the TFBS Perl i=1 j=1
library (Lenhard and Wasserman, 2002). The reconstruction of the
average 3D structure of DNA is achieved using sequence-dependent ACKNOWLEDGEMENTS
base step parameters derived from accurate atomistic molecular
We thank the help of Agnes Noy, David Piedra, Henrique Proença
dynamics (Pérez, 2007) and making use of a local adaptation of
and Joaquín Panadero as β-testers of the server.
X3DNA (Lu and Olson, 2003) script (Fig. 1 and Supplementary
Fig. 2). When structural information on protein–DNA complexes Funding: This work has been supported by the Spanish Ministry
is available, modeled structures in the corresponding segment of Education and Science (BIO2006-01602 and BIO2006-
are substituted by the experimental geometries, and junctions 15036), the Spanish Ministry of Health (COMBIOMED network),
are refined if required. The visualization of 3D structures is the Fundación Marcelino Botín and the National Institute of
performed by integrating Jmol Java applets (http://www.jmol.org/) Bioinformatics (Structural Bioinformatics Node).
in the HTML page. All physical descriptors can be mapped into
the 3D structure to favor the detection of potential correlations Conflict of Interest: none declared.
REFERENCES
Abeel,T. et al. (2008) Generic eukaryotic core promoter prediction using structural
features of DNA. Genome Res., 18, 310–323.
Goñi,J.R. et al. (2007) Determining promoter location based on DNA structure first-
principles calculations. Genome Biol., 8, R263.
Kent,W.J. (2002) BLAT- the BLAST-like alignment tool. Genome Res., 12, 656–664.
Lenhard,B. and Wasserman,W.W. (2002) TFBS: computational framework for
transcription factor binding site analysis. Bioinformatics, 18, 1135–1136.
Lu,X.J. and Olson,W.K. (2003) 3DNA: a software package for the analysis, rebuilding
and visualization of three-dimensional nucleic acid structures. Nucleic Acids Res.,
31, 5108–5121.
Ohler,U. et al. (2001) Joint modeling of DNA sequence and physical properties
to improve eukaryotic promoter recognition. Bioinformatics, 17 (Suppl. 1),
S199–S206.
Pedersen,A.G. et al. (2000) A DNA structural atlas for Escherichia coli. J. Mol. Biol.,
299, 907–930.
Pérez,A. et al. (2007) Refinement of the AMBER force field for nucleic acids. Improving
the description of α/γ conformers. Biophys. J., 92, 3817–3829.
Singhal,P. et al. (2008) Prokaryotic gene finding based on physicochemical
characteristics of codons calculated from molecular dynamics simulations.
Fig. 1. DNAlive web server workflow diagram. Biophys. J. [EPub ahead of print; DOI:10.1529/biophysj.107.116392].
1732
Appendix B
The following tables contain the description for each of the 29 physical properties
and their references.
73
B.1 Unusual DNA conformation
74
B.2 DNA disruption energy
75
B.3 DNA 3DNA structure
76
B.4 DNA flexibility
77
B.6 DNA non-linear dynamics
78
B.8 Regulation parameters
79
This document was printed on September 3, 2008