Bioinformatics glossary (https://www.ncbi.nlm.nih.
gov/books/NBK470040/)
accession number
The accession number is a unique identifier assigned to a record in sequence
databases such as GenBank. Several NCBI databases use the format
[alphabetical prefix][series of digits]. A change in the record in some
databases (e.g. GenBank) is tracked by an integer extension of the accession
number, an Accession.version identifier. The initial version of a sequence has
the extension “.1”. When a change is made to a sequence in a GenBank
record, the version extension of the Accession.version identifier is
incremented. For the sequence NM_000245.3, “.3” indicates that the record
has been updated twice. The accession number for the record as a whole
remains unchanged, and will always retrieve the most recent version of the
record; the older versions remain available under the original
Accession.version identifiers.
alignment
An alignment is a representation of the similarity between 2 nucleotide or
protein sequences. In the case of protein sequences, the amino acids derived
from ancestral sequences are taken into consideration in the alignment to
account for conserved sequence. A pairwise alignment involves 2 sequences
and a multiple alignment involves 3 or more sequences. A global alignment
involves aligning the entire sequence whereas a local alignment involves
aligning subsequences. The optimum alignment is determined by highest
score for a given system. In a structural alignment, 3-dimensional structures
of proteins under consideration are superimposed (Koonin and Galperin
2003a).
bioinformatics
Bioinformatics is an interdisciplinary field that applies computational
approaches for the collection, storage, manipulation, and analysis of
biological data including large datasets, to make biological discoveries or
predictions. At a minimum, it encompasses computer science, biology,
genetics, genomics, statistics, mathematics and engineering to interpret
biological data. It is closely related to computational biology.
byte
In computer terms, a unit of storage that is equal to 8 bits.
CD
Conserved Domain. CD refers to a domain (a distinct functional and/or
structural unit of a protein) that has been conserved during evolution. During
evolution, changes at specific positions of an amino acid sequence in the
protein may have occurred in a way that preserve the physico-chemical
properties of the original residues, and hence the structural and/or functional
properties of that region of the protein.
cDNA
complementary DNA. A DNA sequence obtained by reverse transcription of a
messenger RNA (mRNA) sequence.
CDS
Coding region, coding sequence. CDS refers to the portion of a
genomic DNA sequence that is translated, from the start codon to the stop
codon, inclusively, if complete. A partial CDS lacks part of the sequence (it
may lack either or both the start and stop codons). Successful translation of
a CDS results in the synthesis of a protein.
consensus sequence
A representative or most typical nucleotide or amino acid sequence in which
each nucleotide or amino acid is most often found at its respective position
in the group of related sequences.
conserved domains
A conserved domain of a protein is a discrete three-dimensional
independently folding structure that is comprised of one or more protein
sequence motifs. Protein sequence motifs are conserved amino
acid sequences that are a combination of secondary structures (example,
helix-loop-helix) which have been shown to be important for protein function
(Koonin and Galperin 2003b).
database
Store of a set of logically related data or collection of files amenable to
retrieval by scripts or computer.
dataset
Permanent store of an organized collection of data, for sharing,
redistribution, processing, and analysis.
Entrez
Entrez is a retrieval system at NCBI for searching several linked databases,
such as PubMed, GenBank, and PMC.
EST
Expressed Sequence Tag. ESTs are short (usually approximately 300–500
base pairs), single-pass sequence reads from cDNA. Typically, they are
produced in large batches. They represent the genes expressed in a given
tissue and/or at a given developmental stage. They are tags (some coding,
others not) of expression for a given cDNA library. They are useful in
identifying full-length genes and in mapping.
FASTA
The first widely used algorithm for similarity searching of protein
and DNA sequence databases. The program looks for optimal local
alignments by scanning the sequence for small matches called “words”.
Initially, the scores of segments in which there are multiple word hits are
calculated (“init1”). Later, the scores of several segments may be summed
to generate an “initn” score. An optimized alignment that includes gaps is
shown in the output as “opt”. The sensitivity and speed of the search are
inversely related and controlled by the “k-tup” variable, which specifies the
size of a “word” (Pearson and Lipman 1988). Also refers to a format for a
nucleic acid or protein sequence.
FTP
File Transfer Protocol. A method of retrieving files over a network directly to
the user's computer or to his/her home directory using a set of protocols that
govern how the data are to be transported.
GB
Gigabytes; 109 bytes.
genome
The genome is the complete genetic material of an organism.
For eukaryotic organisms, it is the DNA in all chromosomes and in
mitochondria or chloroplasts; for procaryotes, it includes the circular double-
stranded DNA molecule. For viruses, it comprises DNA or RNA.
genomics
A field of study in genetics that applies molecular tools such
as recombinant DNA technology and high-throughput sequencing,
and bioinformatics approaches such
as genome alignment and assembly towards the analysis of genome
structure and function.
mapping
In genomics, mapping refers to the various techniques for determining the
position and relative order of markers or genes (loci) on a chromosome and
relative distance between them based on recombination frequency (genetic
map), the absolute position of genes and the distance between them in
nucleotide base pairs (physical map), or the position of markers or genes on
the chromosome based on hybridization (cytogenetic map).
MapViewer
MapViewer is a genome browsing tool used to view and search an organism's
genome and display chromosome maps.
metagenome
A metagenome is a collective genome representative of the community of
organisms, for example, microorganisms, many of which cannot be
cultivated outside of their environment.
transcriptome
The transcriptome refers to the full set of transcripts in a cell assembled by a
method called RNA-seq in which RNA from cells is collected, sampled, and
sequenced. It includes alternative splice variants, variants created by
alternative transcription initiation and alternative transcription termination,
and noncoding RNA genes.