Key Points
-
With the availability of genome sequences for an increasing number of metazoan organisms, the data are now present to carry out large-scale comparative genomic studies.
-
The size of sequenced metazoan genomes, which range from 100 Mb to 3 Gb, makes their comparison a real challenge. Several new approaches have been developed to solve some of the associated problems.
-
The alignment of whole genomes extends the genoome regions that are available to analyse evolutionary mechanisms such as neutral, negative and positive selection, and the history of large insertions and deletions.
-
The potential to compare closely, and even less closely, evolutionarily related metazoans provides new opportunities to identify conserved functional sequences, such as genes or regulatory regions, that are not easily predictable by conventional approaches on a single genome.
-
Hidden Markov model-based programs have been developed mainly in the field of gene prediction to make the most of genome-comparison alignments.
-
The ab initio identification of regulatory regions on a single genome often gives sensitive, but not highly specific, results. Comparative genomic data allow a significant increase in the specificity of such processes.
Abstract
The increasing number of complete and nearly complete metazoan genome sequences provides a significant amount of material for large-scale comparative genomic analysis. Finding new effective methods to analyse such enormous datasets has been the object of intense research. Three main areas in comparative genomics have recently shown important developments: whole-genome alignment, gene prediction and regulatory-region prediction. Each of these areas improves the methods of deciphering long genomic sequences and uncovering what lies hidden in them.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
206,07 € per year
only 17,17 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
References
The C. elegans Genome Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 282, 2012–2018 (1998).
Adams, M. D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000).
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297, 1301–1310 (2002).
Holt, R. A. et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298, 129–149 (2002).
Waterston, R. H. et al. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 (2002). The first whole-genome comparative analysis of two mammalian organisms.
Dehal, P. et al. The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298, 2157–2167 (2002). References 1–8 are the original publications for some of the sequenced 'entire' metazoan genomes.
Bolshakov, V. N. et al. A comparative genomic analysis of two distant diptera, the fruit fly, Drosophila melanogaster, and the malaria mosquito, Anopheles gambiae. Genome Res. 12, 57–66 (2002).
Zdobnov, E. M. et al. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298, 149–159 (2002). One of three papers about the Anopheles genome, which gives a good flavour of more distant comparative genomics compared with the inter-mammal papers.
Dehal, P. et al. Human chromosome 19 and related regions in mouse: conservative and lineage-specific evolution. Science 293, 104–111 (2001).
Mural, R. J. et al. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296, 1661–1671 (2002).
Hedges, S. B. The origin and evolution of model organisms. Nature Rev. Genet. 3, 838–849 (2002).
Graur, D. & Wen-Hsiung, L. Fundamentals of Molecular Evolution (Sinauer Associates, Inc., Sunderland, Massachusetts, 2000).
Goff, S. A. et al. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296, 92–100 (2002).
Freeling, M. Grasses as a single genetic system: reassessment 2001. Plant Physiol. 125, 1191–1197 (2001).
Van Sluys, M. A. et al. Comparative genomic analysis of plant-associated bacteria. Annu. Rev. Phytopath. 40, 169–189 (2002).
Edwards, R. A., Olsen, G. J. & Maloy, S. R. Comparative genomics of closely related salmonellae. Trends Microbiol. 10, 94–99 (2002).
Brosch, R., Pym, A. S., Gordon, S. V. & Cole, S. T. The evolution of mycobacterial pathogenicity: clues from comparative genomics. Trends Microbiol. 9, 452–458 (2001).
Paulsen, I. T., Chen, J., Nelson, K. E. & Saier, M. H. Comparative genomics of microbial drug efflux systems. J. Mol. Microbiol. Biotech. 3, 145–150 (2001).
McClelland, M. et al. Comparison of the Escherichia coli K-12 genome with sampled genomes of a Klebsiella pneumoniae and three Salmonella enterica serovars, Typhimurium, Typhi and Paratyphi. Nucleic Acids Res. 28, 4974–4986 (2000).
Kimura, M. Evolutionary rate at the molecular level. Nature 217, 624–626 (1968).
King, J. L. & Jukes, T. H. Non-Darwinian evolution. Science 164, 788–798 (1969).
Ohta, T. & Tachida, H. Theoretical study of near neutrality. I. Heterozygosity and rate of mutant substitution. Genetics 126, 219–229 (1990).
Miller, W. Comparison of genomic DNA sequences: solved and unsolved problems. Bioinformatics 17, 391–397 (2001).
Smith, T. F. & Waterman, M. S. Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
Needleman, S. B. & Wunsch, C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).
Mayor, C. et al. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16, 1046–1047 (2000).
Harris, T. W. et al. WormBase: a cross-species database for comparative genomics. Nucleic Acids Res. 31, 133–137 (2003).
Clamp, M. et al. Ensembl 2002: accommodating comparative genomics. Nucleic Acids Res. 31, 38–42 (2003).
Karolchik, D. et al. The UCSC genome browser database. Nucleic Acids Res. 31, 51–54 (2003).
Giardine, B. et al. GALA: a database for genomic sequence alignments and annotations. Genome Res. (in the press).
Pennacchio, L. A. et al. An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294, 169–173 (2001).
Wasserman, W. W., Palumbo, M., Thompson, W., Fickett, J. W. & Lawrence, C. E. Human–mouse genome comparisons to locate regulatory sites. Nature Genet. 26, 225–228 (2000).
Jareborg, N., Birney, E. & Durbin, R. Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9, 815–824 (1999).
Roest Crollius, H. et al. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235–238 (2000). The first large-scale comparison between two vertebrate genomes.
Gilligan, P., Brenner, S. & Venkatesh, B. Fugu and human sequence comparison identifies novel human genes and conserved non-coding sequences. Gene 294, 35–44 (2002).
Kent, W. J. & Zahler, A. M. Conservation, regulation, synteny, and introns in a large-scale C. briggsae–C. elegans genomic alignment. Genome Res. 10, 1115–1125 (2000). The first software implementation of a pair-HMM to align sequences.
Delcher, A. L. et al. Alignment of whole genomes. Nucleic Acids Res. 27, 2369–2376 (1999).
Delcher, A. L., Phillippy, A., Carlton, J. & Salzberg, S. L. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30, 2478–2483 (2002).
Ma, B., Tromp, J. & Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics 18, 440–445 (2002). The publication that proposed the original two weighted-spaced model to identify nearly exact matching words
Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B. & Lander, E. S. Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10, 950–958 (2000).
Mullikin, J. C. & Ning, Z. The phusion assembler. Genome Res. 13, 81–90 (2003).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Schwartz, S. et al. Human–mouse alignments with BLASTZ. Genome Res. 13, 103–107 (2003).
Chiaromonte, F., Yap, V. B. & Miller, W. Scoring pairwise genomic sequence alignments. Pac. Symp. Biocomput. 115–126 (2002).
Kent, W. J. BLAT — the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Bray, N., Dubchak, I. & Pachter, L. AVID: a global alignment program. Genome Res. 13, 97–102 (2003).
Couronne, O. et al. Strategies and tools for whole genome alignments. Genome Res. 13, 73–80 (2003).
Brudno, M. et al. LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. (in the press).
Schwartz, S. et al. PipMaker — a web server for aligning two genomic DNA sequences. Genome Res. 10, 577–586 (2000).
Elnitski, L. et al. PipTools: a computational toolkit to annotate and analyze pairwise comparisons of genomic sequences. Genomics 80, 681–690 (2002).
Durbin, R., Eddy, S., Krogh, A. & Mitchison, G. in Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids 80–99 (Cambridge Univ. Press, Cambridge, UK, 1998).
Chiaromonte, F. et al. Association between divergence and interspersed repeats in mammalian noncoding genomic DNA. Proc. Natl Acad. Sci USA 98, 14503–14508 (2001).
Zhang, M. Q. Computational prediction of eukaryotic protein-coding genes. Nature Rev. Genet 3, 698–709 (2002).
Burge, C. & Karlin, S. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94 (1997).
Alexandersson, M., Cawley, S. & Pachter, L. SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. (in the press). The most complete pair-HMM model for gene prediction that has been implemented so far.
Pachter, L., Alexandersson, M. & Cawley, S. Applications of generalized pair hidden Markov models to alignment and gene finding problems. J. Comput. Biol. 9, 389–399 (2002).
Meyer, I. M. & Durbin, R. Comparative ab initio prediction of gene structures using pair-HMMs. Bioinformatics 18, 1309–1318 (2002).
Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene structure prediction. Bioinformatics 17 (Suppl.) S140–S148 (2001). This paper describes Twinscan, which is one of the best informant-HMM gene-modelling approaches.
Wiehe, T., Gebauer-Jung, S., Mitchell-Olds, T. & Guigo, R. SGP-1: prediction and validation of homologous genes based on sequence alignments. Genome Res. 11, 1574–1583 (2001).
Yeh, R. F., Lim, L. P. & Burge, C. B. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816 (2001).
Tagle, D. A. et al. Embryonic ε- and γ-globin genes of a prosimian primate (Galago crassicaudatus): nucleotide and amino-acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203, 439–455 (1988).
Levy, S., Hannenhalli, S. & Workman, C. Enrichment of regulatory signals in conserved non-coding genomic sequence. Bioinformatics 17, 871–877 (2001).
Fickett, J. W. & Wasserman, W. W. Discovery and modeling of transcriptional regulatory regions. Curr. Opin. Biotechnol. 11, 19–24 (2000).
Aparicio, S. et al. Detecting conserved regulatory elements with the model genome of the Japanese puffer fish, Fugu rubripes. Proc. Natl Acad. Sci. USA 92, 1684–1688 (1995).
Flint, J. et al. Comparative genome analysis delimits a chromosomal domain and identifies key regulatory elements in the α-globin cluster. Hum. Mol. Genet. 10, 371–382 (2001).
Webb, C. T., Shabalina, S. A., Ogurtsov, A. Y. & Kondrashov, A. S. Analysis of similarity within 142 pairs of orthologous intergenic regions of Caenorhabditis elegans and Caenorhabditis briggsae. Nucleic Acids Res. 30, 1233–1239 (2002).
Dieterich, C. et al. Annotating regulatory DNA based on man–mouse genomic comparison. Bioinformatics 18 (Suppl.), S84–S90 (2002).
Praz, V., Perier, R., Bonnard, C. & Bucher, P. The eukaryotic promoter database, EPD: new entry types and links to gene expression data. Nucleic Acids Res. 30, 322–324 (2002).
Hamdi, H. K., Nishio, H., Tavis, J., Zielinski, R. & Dugaiczyk, A. Alu-mediated phylogenetic novelties in gene regulation and development. J. Mol. Biol. 299, 931–939 (2000).
Liu, T., Wu, J. & He, F. Evolution of cis-acting elements in 5′ flanking regions of vertebrate actin genes. J. Mol. Evol. 50, 22–30 (2000).
Force, A. et al. Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151, 1531–1545 (1999).
Dermitzakis, E. T. & Clark, A. G. Evolution of transcription factor binding sites in mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121 (2002). The authors estimate that ∼30–40% of the functional cis -acting elements in human are not functional in rodents.
Ludwig, M. Z., Bergman, C., Patel, N. H. & Kreitman, M. Evidence for stabilizing selection in a eukaryotic enhancer element. Nature 403, 564–567 (2000). A publication that studies the compensatory mutation and stabilizing selection of cis -acting elements in two species of Drosophila.
Elnitski, L. et al. Distinguishing regulatory DNA from neutral sites. Genome Res. 13, 64–72 (2003).
Bailey, T. L. & Elkan, C. The value of prior knowledge in discovering motifs with MEME. Proc. Int. Conf. Intell. Syst. Mol. Biol. 3, 21–29 (1995).
Roth, F. P., Hughes, J. D., Estep, P. W. & Church, G. M. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nature Biotechnol. 16, 939–945 (1998).
Morgenstern, B., Frech, K., Dress, A. & Werner, T. DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics 14, 290–294 (1998).
Hertz, G. Z. & Stormo, G. D. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999).
Brazma, A., Jonassen, I., Vilo, J. & Ukkonen, E. Predicting gene regulatory elements in silico on a genomic scale. Genome Res. 8, 1202–1215 (1998).
Hughes, J. D., Estep, P. W., Tavazoie, S. & Church, G. M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000).
Blanchette, M. & Tompa, M. Discovery of regulatory elements by a computational method for phylogenetic footprinting. Genome Res. 12, 739–748 (2002).
Gottgens, B. et al. Transcriptional regulation of the stem cell leukemia gene (SCL) —comparative analysis of five vertebrate SCL loci. Genome Res. 12, 749–759 (2002).
Loots, G. G. et al. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288, 136–140 (2000).
Dubchak, I. et al. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10, 1304–1306 (2000).
Zhu, J., Liu, J. S. & Lawrence, C. E. Bayesian adaptive sequence alignment algorithms. Bioinformatics 14, 25–39 (1998).
Berman, B. P. et al. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome. Proc. Natl Acad. Sci. USA 99, 757–762 (2002).
Levy, S. & Hannenhalli, S. Identification of transcription factor binding sites in the human genome sequence. Mamm. Genome 13, 510–514 (2002).
Chao, K. M., Hardison, R. C. & Miller, W. Recent developments in linear-space alignment methods: a survey. J. Comput. Biol. 1, 271–291 (1994).
Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673–4680 (1994).
Corpet, F. Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881–10890 (1988).
Stojanovic, N. et al. Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res. 27, 3899–3910 (1999).
Wingender, E. et al. TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, 316–319 (2000).
Quandt, K., Frech, K., Karas, H., Wingender, E. & Werner, T. MatInd and MatInspector: new fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878–4884 (1995).
Jegga, A. G. et al. Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res. 12, 1408–1417 (2002).
Loots, G. G., Ovcharenko, I., Pachter, L., Dubchak, I. & Rubin, E. M. rVista for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12, 832–839 (2002).
Hedges, S. B. & Kumar, S. Genomics: vertebrate genomes compared. Science 297, 1283–1285 (2002).
Venkatesh, B., Gilligan, P. & Brenner, S. Fugu: a compact vertebrate reference genome. FEBS Lett. 476, 3–7 (2000).
Wittbrodt, J., Shima, A. & Schartl, M. Medaka — a model organism from the Far East. Nature Rev. Genet. 3, 53–64 (2002).
Acknowledgements
We thank M. Brudno, W. Miller and L. Pachter for providing their respective manuscripts before publication and W. J. Kent, W. Miller, M. Brudno and L. Bentolila for helpful discussion and comments on the manuscript. We also thank the anonymous reviewers for many helpful suggestions. A.U.-V. is funded by the Wellcome Trust. E.B. and L.E. are funded by the European Molecular Biology Laboratory.
Author information
Authors and Affiliations
Corresponding author
Related links
Related links
DATABASES
LocusLink
FURTHER INFORMATION
Comparative analysis of the rat genome
Glossary
- RATE MATRIX
-
Denotes the probability of mutation from one amino acid to another (or from one nucleotide to another) for a given period of evolution. The most well known rate matrices are BLOSUM and PAM.
- FUNCTIONAL SEQUENCE
-
A genomic sequence that provides a function that is under selection and tends to be conserved between species. For example, a protein-coding region or transcription-factor binding site
- SEEDS
-
A short exact, or nearly exact, matching string od characters aligning between two sequences.
- PARALOGUES
-
Sequences, or genes, that have originated from a common ancestral sequence, or gene, by a duplication event.
- ORTHOLOGUES
-
Sequences, or genes, that have originated from a common ancestral sequence, or gene, by a speciation event.
- SYNTENIC REGION
-
A genomic region that is collinear in the order of genes (or of other DNA sequences) in a chromosomal region of two species.
- SYNTENIC ANCHORS
-
Short aligned segments between genome sequences from two species, which are believed to define an orthologous relationship.
- DOT PLOT MATRIX
-
A visualization technique that allows the easy identification of matching nucleotides or amino acids (letters) between two sequences. For example, for two sequences X and Y, each letter has a unique coordinate on the x axis and the y axis respectively. When two letters are the same at a specified coordinate, a dot is plotted in the matrix at that position.
- HIDDEN MARKOV MODEL
-
(HMM). A probabilistic model that is applied to protein- and DNA-sequence pattern recognition. HMMs represent a system as a set of discrete states and as transitions between those states. Each transition has an associated probability. HMMs are valuable because they enable a search or alignment algorithm to be built on firm probabilistic bases, and the parameters (transition probabilities) can be easily trained on a known data set.
- DISCRIMINANT FUNCTIONS
-
Classical statistical pattern-recognition methods that are used to categorize samples into two classes of data.
- NEURAL NETWORKS
-
Mathematical models inspired by analogy with biological neurons to distinguish two or more classes of data.
- SCORE MAXIMIZATION PROCESS
-
Many algorithms attempt to find the solution, under a scoring scheme, that is believed to best reflect reality. For 'simple' models, including hidden Markov models, precise mathematical formulae can be used that will guarantee to find the highest score.
- NEUTRAL DRIFT
-
The process by which a DNA sequence acquires many mutations over time that have no phenotypic effect, and are not acted on by Darwinian selection.
- STABILIZING SELECTION
-
Selection that favours intermediate phenotypes over extreme phenotypes.
- GAP PENALTY
-
Alignment programs deal with insertions and deletions (indels) by introducing a 'gap' in the sequence that contains the deletion. The introduction of gaps and their extension decreases the overall alignment score by a certain value. This value is defined by a gap-opening penalty and a gap-extension penalty, both of which are used as parameters in alignment programs.
Rights and permissions
About this article
Cite this article
Ureta-Vidal, A., Ettwiller, L. & Birney, E. Comparative genomics: genome-wide analysis in metazoan eukaryotes. Nat Rev Genet 4, 251–262 (2003). https://doi.org/10.1038/nrg1043
Issue Date:
DOI: https://doi.org/10.1038/nrg1043
This article is cited by
-
Genome-wide comparative analysis reveals human-mouse regulatory landscape and evolution
BMC Genomics (2015)
-
Trends in genome dynamics among major orders of insects revealed through variations in protein families
BMC Genomics (2015)
-
Visualizing genomes: techniques and challenges
Nature Methods (2010)
-
WebScipio: An online tool for the determination of gene structures using protein sequences
BMC Genomics (2008)