Significant Similarity and Dissimilarity
in Homologous Proteins 1
Samuel Karlin, Volker Brendel, and Philipp Bucher
Department
of Mathematics,
Stanford
University
Introduction
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
Common practice emphasizes significant sequence similarities between different
members of protein families. These similarities presumably reflect on evolutionary
conservation of structurally and functionally essential residues. The nonconserved
regions, on the other hand, may be either selectively neutral or differentiated.
We
propose several distributional sequence statistics (e.g., clustering of charged residues,
compositional biases, and repetitive patterns) as indicators of differentiation events.
These ideas are illustrated with various examples, including comparisons
among
G protein-coupled
receptors, herpesvirus proteins, and GTPase-activating
proteins.
Members of a protein family are typically composed of subregions of three types:
( 1) conserved regions that generally reflect a common function or structure maintained
by functional constraints (negative selection), (2) freely evolved regions, generally
nonfunctional, changed by random drift (selectively neutral), and (3 ) functionally
differentiated regions adapted to new roles (positive selection).
A common premise asserts that sequences important in the functioning of the
proteins are those conserved for species of a broad evolutionary range. In free regions
among homologous proteins, amino acid replacements are basically randomly generated, conforming to the neutral theory of molecular evolution. This course of variation can often be calibrated as a molecular clock yielding to phylogenetic reconstructions within and between species (Zuckerkandl and Pauling 1962; Wilson et al. 1977;
Kimura 1983; Gillespie 1986). A prototype case of freely varying regions occurs with
the extended globin family. In other protein families, there occur among protein members significant (nonrandom) sequence variations that putatively reflect on functional
differentiation in relation to organism, tissue and cellular specificities, variant mechanisms, and degree of oligomerizations. In particular, functional diversification through
positive selection has been implicated for the variable regions of immunoproteins
(e.g., see Hughes et al. 1990).
The present paper elaborates definition, detection, and analysis of functionally
differentiated subregions by using protein sequence statistics. Sequence features to
distinguish among similar proteins could be based on distribution of charged residues,
hydropathy arrangements, amino acid usage, repetitive patterns, and spacings of amino
acid types (see Methods; also Karlin et al. 199 1) . Sequence contrasts can be important
in understanding the norm and variability of a protein function class.
We will consider three examples underscoring statistically significant features
1. Key words: protein families, molecular evolution, sequence, statistics.
Address for correspondence and reprints: Samuel Karlin, Department of Mathematics, Stanford University, Stanford, California 94305.
Mol. No/. Evof. 9(1):152-167. 1992.
0 1992 by The University of Chicago. All rights reserved.
0737-4038/92/0901-0011$02.00
152
Significant Similarity and Dissimilarity
153
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
which discriminate between members within various classes of similar protein sequences: ( 1) G protein-coupled receptors ( GCRs), (2) homologous proteins from
human herpesviruses, and ( 3) the recently identified human neurofibromatosis gene
product NFI, compared with bovine GAP ( GTPase activator protein) and yeast ZZ?Al
protein (Xu et al. 1990).
It is useful to briefly review relevant background information on the function
classes to be considered:
1. The family of GCR proteins is a large and diverse group including receptors
for neurotransmitters, hormones, and photons. The projected common structure of
these proteins consists of an extracellular N-terminus, seven membrane-spanning helices, and a cytoplasmic C-terminus. The N- and C-terminals and the third cytoplasmic
loop of GCRs vary greatly in length, whereas the transmembrane segments are the
most conserved regions among different receptors. Various residues in the transmembrane regions are involved in ligand binding, and residues in the third cytoplasmic
loop in some cases are responsible for selective G protein coupling (e.g., see Jackson 1990).
2. The herpesviruses are morphologically similar DNA viruses with large doublestranded linear genomes found widely distributed in vertebrate species. After initial
infection they commonly persist in the host in a latent state from which they can
reactivate. However, with respect to many other properties, such as cell tropism, route
of infection, and host pathology, the herpesvirus family exhibits much diversity. On
the basis of a variety of biological criteria, herpesviruses have been classified into a-,
p-, and y-subgroups (Roizman 1990). Members of the a-group, including herpes
simplex virus 1 ( HSVl ) and varicella-zoster virus (VZV), typically infect epithelial
cells and establish latency in neural tissue. The p-group prototype human cytomegalovirus (HCMV) appears to have a broad tissue distribution and a slow growth cycle.
y-Herpesviruses [e.g., Epstein-Barr virus (EBV)] arc lymphotropic viruses that generally
persist as circular episomes in dividing cells. Four have been completely sequenced
[ HSV 1 ( McGeoch et al. 1988)) VZV (Davison and Scott 1986)) HCMV (Chee et al.
1990a), EBV (Baer et al. 1984)]. The gross organization of these genomes is generally
similar in that each consists of a long and a short unique region (UL and US, respectively) flanked and separated from each other by repeated segments. However, there
is substantial variation in both size (from = 125 kp in VZV to =230 kb in HCMV)
and base composition (from 46% G+C in VZV to 68% G+C in HSVl; for review,
see Chee and Barrel1 1990). Comparative genome analysis has shown that the UL
region contains a core segment of -40 genes that are conserved to various degrees
across all herpesviruses (Chee et al. 1990a). Homology between genes has been established either by significant sequence similarity, positional correspondence (allowing
for some rearrangements in gene order and orientation), or genetic complementation.
3. The strongest subalignments comparing the neurofibromatosis gene NFl to
- 15,000 protein sequences were found with ZRAI and GAP confined to the 360residue putative catalytic domains (Xu et al. 1990). The IRA 1 and GAP polypeptides
are presumed to interact with p21 ras, a protein implicated in neoplasia and growth
control (Parson 1990). It has been established that the IRA1 protein down-regulates
yeast rus proteins (Tanaka et al. 1989 ). Mammalian GAP found ubiquitously in cells
can curtail the wild-type mammalian H-m when expressed in yeast (Ballester et al.
1989). On this basis, NFl is projected as a negative oncogene.
For comparison, we end the present paper with a brief discussion of representative
globin proteins from mammalian, avian, and amphibian species. This is the seminal
154
Karlin et al.
protein family of common established function, structure, and origin. There is considerable literature dissecting conserved and variable regions of the globin genes (Lesk
and Chothia 1980; Dickerson and Geis 1983 ). By our criteria, we find y10functionally
differentiated regions among the globin homologues.
Methods
We describe the following four classes of distinctive amino acid sequence configurations that may underlie functional differences among similar proteins:
1. Charge Distribution
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
Systematic studies to identify and characterize (a) distributional features of
charged residues in protein sequences, and (b) their association with protein structure,
function, and organism type were initiated by Karlin et al. ( 1989, 1990a, 199 1). Three
categories of charge configurations have been investigated. A charge cluster refers to
a short (25-75 residues) protein segment with significantly high specific charge content
relative to the amino acid composition of the whole protein. In particular, a positive
(negative) charge cluster is a segment with high positive (negative) net charge, and a
mixed charge cluster is a segment high in charge residues of both signs. An acidic,
basic, or mixed charge run signifies a stretch (typically, 7- 12 residues long) of successive
charged residues of the proper sign that allow for, at most, one or two intermittent
errors. A charge run is said to be statistically significant if the probability of observing
the indicated run length is co.01 in a corresponding random sequence of the same
amino acid composition. A residue segment is called a hyper-charge run under the
following two conditions: ( 1) the run is statistically significant relative to its own
composition, and (2) the probability of observing a run of this length or longer would
be < 10e5 for a random protein sequence of the same length but with charge frequencies
ofan average protein (i.e., - 11.5% acidic and - 11.5% basic). Generally, for a protein
of 400 amino acids, a pure hyper-charge run would need to exceed eight charged
residues. Periodic charge patterns are repetitive arrangements of charged and uncharged
amino acids. Rigorous definitions, implementations, and statistical assessments of
these charge configurations have been elaborated by Karlin et al. ( 1990a).
Our previous studies have revealed several associations between significant charge
configurations and species/protein function. Thus, ( 1) charge clusters are frequent in
certain classes of eukaryotic proteins but are scarce in prokaryotic proteins of all kinds
(Karlin and Brendel 1988); (2) multiple charge clusters in conjunction with distinctive
periodic charge patterns and long DNA repeats characterize in EBV the proteins expressed in the latent state ( Blaisdell and Karlin 1988 ) ; ( 3 ) charge clusters are common
in both eukaryotic nuclear transcription and replication factors (Brendel and Karlin
1989a) but are uncommon among cytoplasmic enzymes and housekeeping proteins
(Karlin 1990); (4) Drosophila developmental control proteins often contain multiple
charge clusters and special charge patterns; and (5) very long anionic and mixed
charge runs are prominent among nuclear autoantigens (Brendel et al. 199 1) .
Differences in charge distribution may correspond to diverse protein folded structure and function. Electrostatic interactions are known to facilitate protein sorting,
translocation, orientation, and binding to DNA and other proteins (e.g., see Kalderon
et al. 1984; von Heijne 1986; Ma and Ptashne 1987). Charge clusters of opposite sign
may contribute to the formation of multiprotein complexes. Charge clusters of like
sign may help maintain separation between certain protein assemblages. Charge clusters
should increase the solubility of the protein in aqueous media. Multiple charge clusters
Significant Similarity and IDissimilarity
within one protein might contribute to intramolecular
protein and protein-nucleic acid interactions.
155
folding or cooperative protein-
2. Amino Acid Usage and Quantiles
3. Repetitive Structures
There are several levels and forms. One category of repeats pertains to the aggregate
of multiplets comprising all homodipeptides XX (=X2), homotripeptides XXX (=X3),
etc., when X denotes any amino acid. A statistical assessment of the counts and locations of these multiplets is attained comparing the observed multiplet set to the
multiplet distribution in a random reconstruction of the protein sequence. The count
of multiplets, o, provides a measure of the homopeptide density of the protein sequence.
When o is excessively large compared with expectation, a significant multiplet realization is inferred. The significance test is done as follows: Let fi be the frequency of
the ith amino acid in the protein sequence. For a random sequence the probability
of seeing at a given position the start of a reiteration of length 2, 3, or 4, etc. of amino
acid type i is f:( 1 -J)2 + f:( 1 -fi‘)2 + f:( 1 -J)2 + - - - = f!( 1 -JI’). Therefore,
the probability of encountering a multiplet (i.e., any homopeptide) at a given position
is
When the observed o exceeds Nfo + 3-,
where N is the sequence length,
the protein is deemed to carry a significantly high multiplet count o. It is further
possible to discern anomalous characteristics of the spacings distribution induced by
multiplets (see next section).
There is also interest in proteins containing one or more substantial homopeptides,
i.e., a multiplet of the form X,, for some n 2 5 (see our discussion of the HCMV
proteins, below). It is widely known that many Drosophila developmental proteins
contain long homopeptides, particularly of glutamine, asparagine, histidine, alanine,
glycine, and proline, the functional role of which is unclear.
Multiple repeated peptides in a protein not restricted to homopolymeric form
can also be a distinguishing characteristic among similar proteins. Prominent examples
include the proteins of the EBV latent cycle, various eukaryotic extracellular cytoskeletal
proteins (e.g., collagens and fibronectins), and the major cofactors of the coagulation
cascade.
Periodic repeats such as the leucine zipper and its generalizations to other amino
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
The quantile distribution Q(x) for a residue type for a given set of proteins
describes the percent of proteins in which that residue type occurs with frequency
<fi. For most residue types, the medians (50% quantile) and interquartile range
(25%-75%) are quite similar across prokaryote and eukaryote species (Karlin et al.
199 1). The l%, 5%, 95%, and 99% quantile distributional values of amino acids for
mammalian species provide standards by which to affirm extremes of amino acid
usage for any particular mammalian protein or protein family. For example, cysteine
is used, on average, - 1% among Escherichia coli and yeast proteins with a narrow
quantile distribution, but is used, on average, >2% among mammalian proteins with
a broad quantile distribution.
156
Karlin et al.
acid types can further reflect on mechanisms of oligomerization and amphipathic
structures (see Landschulz et al. 1988; Brendel and Karlin 1989b; Karlin et al. 199 1).
4. Spacings Distribution
5. Sequence Comparisons and Similarity Scores
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
Consider a sequence of length N and a specified word type (amino acid or peptide)
or general marker with n occurrences randomly distributed in the sequence. These
induce n + 1 spacings ( U,, U1, . . . , U,), where Ui is the distance (number of units)
from the ith occurrence of the marker to the i + 1st occurrence. Our statistical analysis
centers on the extremal spacings m* = min { 156, U, , . . . , U,, } and M* = max { UO,
UI,**-, U,, } . To detect clustering of sites, we check whether the minimum spacing
is significantly small for the postulated random distribution of sites. Similarly, to
decide whether some adjacent marker locations are excessively far apart, we check
whether the maximum M* is especially large (for applications of these ideas to the
E. coli Kohara physical map data, see Karlin and Macken 199 1) . The distributional
properties of m * and M* are classical (e.g., see Feller 197 1) .
The criterion for an extreme minimum at the 0.0 1 significance level involves the
determination of a* such that the probability that m* r a* is 0.99. In a parallel way
the criterion for a significantly large M* can be established.
We use two methods. The first is a multiple sequence-matching algorithm elaborated by Karlin et al. ( 1988). A repeat segment is an aggregate of exactly repeated
words, each of length ZK and separated by error blocks, each of length se letters. K
should be chosen such that m’ I n I m”+‘, where n is the average length of the
sequences considered and m is the alphabet size. Then, in a random sequence of the
same composition and length, each K word would be expected to occur, at most, m
times (Leung et al., accepted). Identification and statistical significance of matching
segments follow the analysis of Karlin et al. ( 1990b). The second method works via
pairwise comparisons involving a general score matrix for amino acid matching (for
details, see Altschul et al. 1990; Karlin and Altschul 1990). Both procedures take into
account the compositional bias of the compared proteins.
Results and Discussion
I. GCR Proteins \
Several new members of the GCR family have recently been promulgated on the
basis of sequence similarities. Thus, Chee et al. ( 1990b) suggested three open reading
frames (ORFs) (US27, US28, and UL33) of HCMV as likely to encode GCRs. Amino
acid segment identities were proffered among US27, US28, and UL33 and with several
known or putative GCRs, including human Pz-adrenergic receptor (ADR), human
serotonin receptor (STR), human MAS oncoprotein (MAS), bovine substance-K
receptor ( SKR ) , and bovine rhodopsin ( RHOD ) . Ross et al. ( 1990 ) isolated a cDNA
clone from a rat thoracic-aorta library encoding a 343-amino-acid protein (i.e., RTA)
with sequence similarity to MAS. It is interesting that several experimental criteria
applied by these authors argue that RTA is not an angiotensin receptor, a function
associated with MAS. Two different endothelin receptors bear sequence similarity to
GCR proteins. One of them is specific for endothelin- 1 ( Arai et al. 1990)) whereas
the other binds all three endothelins (Sakurai et al. 1990).
Significant sequence differences among the GCRs may reflect determinants of
ligand or target protein specificities, variations of regulation and transport, or oligo-
Significant Similarity and Dissimilarity
157
ADR
RHOD
413
. . . . . . . . . . . . . . . ..1.....-
c
2
3
7
MAS
RTA
348
325
>
5
6
7
343
FIG. I .-Similarities and differences among representative mammalian G protein-coupled receptors.
The sequences are drawn approximately to scale (lengths are indicated on the right). The locations of
putative transmembrane segments (boxes I-7 ) and of charge clusters (ovals indicating the charge sign: + for
basic clusters, - for acidic clusters, and + for mixed charge clusters; see Karlin et al. 1990a) are shown on
the sequence lines. Patterned bars underneath the sequence lines delineate segments of significant similarity
between any two of the proteins (different patterns are used for each pair of sequences, e.g., --- for ADR/
STR, 000 for ADR/SKR, etc.). Our significance criterion required that the probability of observing an
equal matching score (with the PAM-250 amino acid substitution matrix).be ~0.01 when random sequences
of the same respective lengths and compositions are compared (see Karlin and Altschul 1990).
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
merization mechanisms. We shall focus on ADR, STR, SKR, RHOD, MAS, and RTA
as representative GCR sequences. All these proteins are in the upper 10% quantile
with respect to the frequency of hydrophobic residues (i.e., ~10% of mammalian
proteins in the data banks are equally high in the aggregate frequency of L, V, I, F,
and M) and are in the lower 10% quantile in the aggregate of charged residues. Such
a compositional bias can easily interfere with proper assessments of sequence similarities. Using the statistical methods for establishing pairwise and multiple sequence
subalignments (see Methods), we found that all pairs of the representative GCRs
share one or more similarity segments, with the exceptions of (a) MAS, which has no
region of significant similarity to either STR or RHOD (also, the MAS-ADR comparison reveals only a short segment of similarity, and the MAS-SKR comparison
aligns MAS transmembrane segment 6 with SKR transmembrane segment 5, out of
order; see fig. 1)) and (b) RTA, which only matches (very strongly) with MAS. Less
than half of the comparisons involve two or more separate similarity segments. Most
of the pairwise similarities and the significant multiple alignments are confined to
transmembrane domains, while the N- and C-terminals and the third cytoplasmic
loop are generally distinct and often significantly distinct (see below).
US27 and US28 show mutual similarity over the seven putative transmembrane
segments, whereas UL33 only has a short N-terminal piece of similarity with US27
and no similarity to US28 (fig. 2). Of the 18 comparisons between the three cytomegalovirus sequences and the six representative GCRs, only eight yielded significant
158
Karlin et al.
US-27
390
>-
US-28
3
1
2
3
4
5
7
323
-
UL-33
_Lt
4
2
5
6
7
362
______
I
I
SrR
sKR
I
I
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
FIG. 2.-Similarities and differences among putative HCMV G protein-coupled receptors. Symbols
are as in fig. 1. Labeled boxes indicate segments of significant similarity to the mammalian GCRs for which
the corresponding segments occur in analogous positions with respect to the transmembrane domains (see
fig. 1).
similarity matches, and only one of these comparisons involved more than a single
matching segment. In detail the similarities between the sequences are of quite varied
extent and location (figs. 1 and 2).
There are salient differences in the nonsimilar regions. This is particularly manifest
in the distribution of charged residues within each sequence. We have ascertained
charge clusters (see Methods) in US27, ADR, STR, MAS, and SKR but none in any
of the other sequences (figs. 1 and 2). In US27, MAS, and SKR a mixed charge cluster
occurs at the C-terminus, while ADR and STR feature a mixed charge cluster in the
third cytoplasmic loop. In MAS the charge cluster involves 11 basic and four acidic
residues in a stretch of 28 residues. RTA at the same location has only six basic and
five acidic residues, a concentration which is statistically not significant. SKR has a
second (basic) charge cluster in the third cytoplasmic loop. The four human opsins,
like their sequenced bovine counterpart (i.e., RHOD), are devoid of distinguished
charge configurations, while the Drosophila opsins, like ADR and STR, have a mixed
charge cluster in the third cytoplasmic loop (data not shown). This charged region is
the most conserved segment among the Drosophila opsins and is the site of interaction
with cytoplasmic proteins (Applebury and Hargrave 1986). The only other amino
acid cluster in all these proteins occurs with ADR, a proline cluster (positions 269292) of 14 prolines in a segment of 24 residues, including the iterated pattern (PX)g .
The considered GCR sequences are unusual in amino acid usage in several respects, relative to a set of some 550 distinct human protein sequences. The GCRs are
beyond the 90% quantile point for hydrophobic residues (i.e., >90% of the control
set displays a lower fraction of hydrophobic residues) and below the 10% quantile
point for acidic residues and total charge. MAS, RHOD, US27, and US28 are particularly rich (above the 95% quantile points) in tyrosine; ADR, MAS, SKR, RHOD,
and US28 in phenylalanine; and MAS, RHOD, US27, and US28 in valine. UL33
alone is high in threonine (in fact, above the 99% quantile point). MAS and US28
are in the low 5% quantile with respect to glutamine and glycine usage. Leucine,
although generally the most abundant residue, is not at an extreme quantile for any
of the sequences.
Significant Similarity and Dissimilarity
159
The two endothelin receptors are very similar to each other over most of their
lengths (not including some 80 amino-terminal and about 40 carboxy-terminal residues) and share one or two segments of similarity with all the mammalian GCRs,
with the exception of RTA. No anomalies are evident in their charge distributions.
One of the receptors (Arai et al. 1990) contains a run of four consecutive cysteines
in the C-terminal region, a pattern not found in any other protein in the data bases.
In summary, the GCRs are mainly conserved in the transmembrane regions (the
similarity is, to a great extent, consequent on the strong compositional bias toward
hydrophobic residues), while the third cytoplasmic loop and C-terminus are generally
discriminating in charge distribution (presumably in relation with target protein specificities). The N-terminus and the second extracellular loop are mostly neither conserved nor imbued with statistically significant sequence differences.
II. Herpesvirus Homologues
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
A high proportion of the homologous genes shared by the a-, B-, and y-herpesviruses show significantly different charge residue distributions. Specifically, among
the 26 genes that exhibit detectable sequence similarities between HSV-1, HCMV,
and EBV (Chee et al. 1990~) (HSVl is chosen as representative of the a-subgroup),
14 triplets include at least one gene product that carries a significant charge cluster,
but presence, type, and/ or location of the charge clusters are in no case invariant over
all three homologues. In the following we shall compare charge configurations in the
following three principal examples from different function classes: ( 1) the large subunit
of the enzyme ribonucleotide reductase, (2) the HSV- 1 transactivator protein UL54
(IE63) and its homologues, and (3) the putative gene products related to HSV- 1
ULl 0,speculated to be membrane associated. These examples also represent different
levels of sequence similarity in, the conserved regions, from highly significant in ribonucleotide reductase to borderline in the transactivator proteins and with the ULlO
group intermediate.
The large subunit of the ribonucleotide reductase ( RNRl ) of HSV- 1 is - 300
amino acids longer than its counterpart in EBV (BORF2), from which it is further
distinguished by a negative charge cluster near the N-terminus (fig. 3A). The corresponding protein in HCMV (UL45 ) is intermediate in size but resembles HSV- 1 in
also having an N-terminal negative charge cluster. Experimental studies have shown
that the N-terminal domain of HSV-2 RNRl (which is very similar to HSV- 1 RNR 1)
has cell-transforming activity (Jones et al. 1986). The HCMV homologue is poorly
conserved if the similarity between the HSV- 1 and EBV proteins is taken as a standard
(the evolutionary distances between the three viruses are approximately the same).
Moreover, among ribonucleotide reductases, two sequence motifs-i.e.,
GXGX2G,
presumed to interact with the substrate (Nikas et al. 1986) and GX2NSX3AXMP,
proposed as a diagnostic signature of this enzyme subunit (Bairoch 1990)-are corrupted. These findings, together with the observation that the HCMV genome lacks
the small subunit of ribonucleotide reductase, suggest that the UL45 gene product has
lost its original function. The shared charge cluster with the HSV- 1 and HSV-2 counterparts perhaps provides a different activity in virus multiplication and host interactions.
The HSV- 1 UL54 protein and its homologues in the other herpesviruses display
strong divergence in charge configurations (fig. 3B). HSV- 1 UL54 carries a negative
charge cluster in the N-terminal quartile. Its counterpart in EBV-BMLFl
, the major
transactivator protein of the lytic cycle-has two distinct charge clusters in the N-
160
Karlin et al.
HCMV UL45 0
A
906
926
EBV BORF2
... .. .. .. . .. .. . .. . .. . .. .. ..
V
HSV-1 UL39
____
. . .. ... . .. . . .. .. .. .. .. .. . ..
HCMV UL69
e
T.
___
B
EBV BMLFl
w
~RAe~,.
1137
744
-
p,
459
...
HSV-1 UL54
C
EBV BBRF3
....
....
-m
____
___
+<>
0+-
____
372
----
--
HSV-1 ULIO -
...
___
405
473
512
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
HCMV ULIOO
0
FIG. 3.-Occurrence of charge clusters in three groups of homologous proteins from human herpesviruses
HSV- 1, EBV, and HCMV. Symbols are as in fig. 1. Also indicated are the locations of homopeptides and
of repetitive patterns. The locations of segments of significant sequence similarity within each group (for
definition, see legend to fig. 1) are shown beneath the sequence lines.
terminal quartile, one of negative sign followed closely by a mixed cluster. This protein
also features a striking ,periodic arginine pattern ( RXZ)io, where Z is predominantly
proline (eight times) and X is mostly alanine (six times). The proline residues make
it likely that this segment is part of an open coil. This unusual periodic pattern may
underlie a novel three-dimensional structure associated with a special regulatory function. The (RXZ),,,, sequence is immediately followed by seven arginines at displacements of three or four (average 3.43), without intervening prolines, that could easily
form an a-helix with a line of positive charge on one side. The corresponding HCMV
protein is completely different, featuring a negative charge cluster at the other end in
the C-terminal quartile. In addition, it has several significant homopeptides of specific
uncharged residues, including iterations of five threonines and nine prolines. The
UL54 polypeptide of HSV- 1, like EBV BMLF 1, is known to be a potent transcriptional
activator, a class of regulatory proteins that is generally rich in charge configurations
(Brendel and Karlin 1989~). The foregoing contrasts suggest that these homologues
of the general herpesvirus protein inventory might contribute to some of the physiological differences between the three subgroups.
The third example consists of three homologous proteins of unknown function:
HSV- 1 ULlO, EBV BBRF3, and HCMV ULlOO. The high content of hydrophobic
residues and a potential signal peptide sequence in HSV ULlO gave rise to the speculation that the corresponding gene products are membrane associated ( McGeoch et
al. 1988). All three proteins contain one or more charge clusters in the C-terminal
half of the sequence (fig. 3C). The HSV- 1 homologue has a positive charge cluster
and, sequentially, two distinct negative charge clusters; the EBV homologue has a
skewed mixed charge cluster, involving five acidic followed by 11 basic residues, over
a length of 42; the HCMV homologue has a mixed charge cluster with an excess of
Significant
Similarity
and Dissimilarity
16 1
Table 1
Global Statistics on Proteins with Distinctive Features in Human Herpesvirus
Oraanism
HCMV
HSV-I
vzv
EBV
No. of
Proteins’
No. of
Cyst&e
Triuletsb
142
70
67
84
5 (5)
I(l)
0 (0)
1 (I)
No. of
Cysteine
Doublet?
53 (35)
14 (13)
15 (13)
13 (13)
No. of
Homopeutidesb,’
No. of
Proteins
with
Significant
Multiplets
58 (35)
13 (9)
13 (5)
15 (14)
14
2
1
3
No. of
Hypercharge
Run@
8 (7)
l(1)
0 (0)
0 (0)
Genomes
No. of
Proteins with
Single-Charge
Clusters/No. of
Proteins with
Multiple-Charge
Clusters
3218
1613
7/l
11/7d
’ ORFs < 140 codons were excluded from the HCMV protein set.
b The Numbers in parentheses are number of proteins with at least one occurrence of the specified feature (e.g., HCMV
contains 53 cysteine doublets in a total of 35 proteins).
c Runs of five or more identical amino acids.
d Six of seven EBV proteins with multiple charge clusters are expressed only in the latent state (Blaisdell and Karlin
1988).
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
acidic residues, including an exceptionally long run of length 10 at the C-terminus of
the cluster. In all three viruses, the charge clusters are statistically highly significant.
A number of additional examples of contrasts in the charge configurations of
homologous herpesvirus proteins can be added to this list. The DNA polymerase has
in HSV- 1 a negative charge cluster (conserved in VZV) missing in EBV and HCMV.
The replication gene BSLFl of EBV is lacking a negative charge cluster that occurs
in the other two homologues of this well-conserved protein. Different types of charge
clusters occur in the huge virion proteins of HSV and EBV, while corresponding
configurations are absent in the homologous protein of HCMV. The strongly conserved
triplet UL25 (HSV- 1) ,BVRF 1(EBV) ,UL77 (HCMV) exhibits a mixed charge-cluster
polymorphism near the N-terminus, HSV-1 not carrying this feature. HCMV UL50
has a negative charge cluster in the C-terminal region, while its putative counterpart
in EBV, BFRFl, has a positively biased mixed charge cluster at the corresponding
location. As in the three main examples, these differences are potential indicators of
functional divergence.
The combined analysis of similarities and dissimilarities can be expanded to deal
with complete genomes of homologous organisms. Such studies may reveal global
evolutionary trends associated with speciation and radiation within higher-order taxa.
In the following analysis of four human herpesvirus genomes, we will show that HCMV
is in many respects an outlier within the herpesvirus realm, a conclusion that could
not have been reached by evaluating sequence similarity alone.
The HCMV genome is among the largest of herpesviruses, with more than twice
as many ORFs as compared with other herpesviruses. The HCMV proteins tend to
be much richer in distinctive sequence features of the kind associated with differentiated
functions, as suggested in the previous examples. Table 1 displays several comparative
analyses for the aggregate protein sets of human herpesviruses. Note the abundance
of cysteine doublets and triplets in HCMV (cysteine triplets are generally scarce, not
only in herpesviruses of the a- and y-subgroups but also in cellular organisms: there
are three occurrences in 748 distinct human proteins, and there is one in 856 (Escherichiu coli proteins). The HCMV proteins also carry a plethora of homopeptides and
162 Karlin et al.
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
more than a dozen polypeptides containing significantly many multiplets (see table
1). Perhaps the most striking finding is the occurrence of eight hyper-charge runs in
a total of 142 HCMV proteins, as compared with only one in 22 1 pooled proteins
from the other three genomes. A similar contrast, though less extreme, is provided by
the numbers of proteins having charge clusters.
What could be the biological significance of the high incidence of distinctive
sequence features in HCMV? We propose that it reflects on a correspondingly high
proportion of specialized proteins in the genome, accounting for the broad tissue- and
cell-type range of this species, especially in the latent state. The discovery of gene
families in HCMV further supports this notion, suggesting the existence of tissuespecific variants for certain protein functions as a coadaptation to a corresponding
diversity in the host genome. It is also compatible with the narrow species range of
cytomegaloviruses, because functionally differentiated protein sequences are generally
subject to more rapid evolutionary change than are those carrying out housekeeping
functions. In summary, the frequent occurrence of distinctive sequence configurations
in HCMV reflects its survival and propagation strategy: maintenance of a high infection
rate through diversified mechanisms of latency rather than through rapid lytic multiplication.
Homologous proteins among herpesviruses highlight both similar and dissimilar
sequence features. Distinctions with respect to charge distribution are paramount in
the terminal portions of the sequences. The striking pattern ( RXZ)io (R = arginine,
X = mostly alanine, and Z = predominantly proline), present among herpesviruses
only in the potent transactivator gene product BMLFl of EBV, may portend a novel
functional three-dimensional structure. A global analysis of herpesvirus genomes reveals
an unusually high incidence of distinctive sequence features in HCMV, probably related
to diversified mechanisms of latency.
III. NFl, GAP, and IRA1
It is somewhat surprising that the similarities between the human NFl gene and
the yeast IRA1 protein sequence are stronger and extend well beyond the region of
similarities between NFI and GAP. How do NFI, GAP, and IRA1 compare or differ
in other sequence attributes? IRA1 and GAP both have several significant homopeptides
and a significantly long stretch of uncharged residues in their N-terminal regions (fig.
4), whereas corresponding sequence structures are absent from NFI. However, the
se
N. {KXX},
2939
IRA1
GAP
FIG. 4.-Similarities and differences among yeast IRA I, human NFJ, and bovine GAP. The regions
of high (black) and weak (stippled) similarity are according to Xu et al. ( 1990). Also indicated are the
locations of homopeptides, of a basic charge cluster in GAP, and of long uncharged regions in IRA J (S/Trich) and GAP (A/G-rich).
Significant Similarity and Dissimilarity
163
IV. Globin Families
There is a vast literature on globin-protein primary sequences and crystal structures
from a wide range of phylogeny (e.g., see Dickerson and Geis 1983). Moreover, complete a- and P-globin DNA sequences (exons, introns, and flanks) are available for
many mammalian, avian, amphibian, and fish species. Several conserved structural
and functional features of the underlying DNA and of the hemoglobin molecule are
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
complete sequence of the N-terminal end of ZWZis not yet available. The homopeptides
and uncharged regions emphasize different residue types in the two proteins. GAP
contains substantial runs of alanine and proline, while the corresponding homopeptides
of ZZUZ consist of serine and asparagine. Protein-specific amino acid preferences are
manifest in the charge-free segments where GAP carries an excess of the small nonpolar
residues glycine and alanine, whereas ZZ?Al contains an abundance of serine and
threonine. The latter also has a significant periodic charge pattern (KXX)s (the X
components include two prolines) upstream from the conserved region. The GAP
protein, on the other hand, exhibits a significant positive charge cluster in its Nterminal domain and contains over its entire sequence a statistically high number of
amino acid doublets and triplets.
The sequence features described above fall into a region where no similarity has
been documented by sequence-alignment methods and therefore might reflect functional or structural characteristics that require compositional properties rather than a
precise arrangement of certain residues. The similarly positioned runs and uncharged
regions involving different amino acids suggest two possible roles (not exclusive),
depending on whether the common aspects or the differences are paramount. Concerning the homopeptides, for instance, one might propose either ( I ) that they assume
functionally equivalent extended structures by virtue of being repetitions of identical
elements or, alternatively, (2) that they mediate interactions with diverse subcellular
targets, through the different physicochemical properties of their building blocks.
Closer to the catalytic domain but still on its N-terminal side, ZZUZ carries a run
of five isoleucines, a phenomenon which is extremely rare in nature (only four independent examples are in the current NBRF data base). This occurrence falls into
the region where the sequence has been aligned with NFl (but not with GAP). The
strongly hydrophobic character of this short segment is not conserved in the human
protein. It is of note that all the significant homopeptides reported (with the exception
of A5 in GAP) derive from varied synomymous codon usage and thus do not reflect
simple repeats on the DNA level.
A different picture emerges from a similar analysis of the C-terminal domains.
The C-terminus of GAP falls immediately after the conserved catalytic domain, rendering this protein less than half the size of the other two (fig. 4). NFI and IRA1
display similarity over an additional 800 amino acids covering most of their C-terminal
regions. Only at the very ends do sequence-alignment procedures fail to detect similarity. The sequences carboxyl from the catalytic domain are amorphous both in the
types of sequence features discussed above and in those seen in the previous examples.
Over a cumulative length of -2,300 amino acids, there is not a significant run, cluster,
or repetitive pattern to report.
The human NFZ, bovine GAP, and yeast ZRAZ proteins are similar over the
catalytic domain. However, GAP and ZZUZ involve significant differences in the Nterminal third, emphasizing a surfeit of amino acid doublets and other repetitive
structures.
164 Karlin et al.
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
well established (e.g., see Lesk and Chothia 1980; Dickerson and Geis 1983). Goodman
( 198 1) has studied, in the quaternary structure of a- and P-globins, the correlation
between the degree of amino acid conservation and the functional type of amino acid
site. He found that, for mammals, amino acids contacting the heme and those participating in the cooperativity that facilitates oxygen transportation, movable interchain
[a-P ( CJ)sites] contacts, and DPG (diphosphoglycerol ) contacts are most strongly and
approximately equally conserved. The fixed contacts a-p (x sites), the interhelix positions, and the otherwise unclassified sites are relatively and approximately equally
variable.
The strongest similarity at the DNA level occurs about the exon-intron interface
(see Karlin and Ghandour 1985). If the Dayhoff ( 1978, p. 230) alignments for the
globins are adopted, degree of conservation across species and between multigenes
[e.g., human ( p, 6, y, and E) and (a and <)I can be classified as no changes (codons
conserved), some silent substitutions, one replacement substitution, and multiple replacements. In this classification at the DNA level, the strength of conservation (the
same for a- and P-globins), in decreasing order, is exon-intron interfaces > heme
> movable P-a contact sites > buried sites > fixed a-0 contact sites > interhelix contacts
> others (data not shown). In the primary sequences the most variable region is
around the N-terminus, and the next most variable is at the C-terminus; this is true
even for such close phylogenetic species as human versus mouse.
By contrast to the GCRs, the herpesvirus ORF homologues, and the NFl-IRA
pairing, there appear to be no functionally differentiated regions among the globins,
in the sense of their having distinctive amino acid sequence features. In particular,
with respect to the a- and P-type globins, across and within species, our sequence
analysis revealed no significant charge configurations of any kind, no distinguished
repetitive sequence structures, and no unusual spacings of any amino acid types (data
not shown). The histidine frequencies are uniformly high, and arginine tends to be
low in most globin proteins; however, it is difficult to reliably discriminate extremes
in amino acid usages in relatively small proteins such as globins.
Although there are no functionally differentiated sequences across the globin
proteins, there are well-known differences in temporal expression (fetal vs. adult). But
this appears to be under separate genetic and environmental controls on transcription
and expression rather than due to differences within the protein sequences (Dickerson
and Geis 1983). j
Perspectives
Sequence similarity can shed light on evolutionary relationships and on common
structure and function. Molecular biologists commonly search a large sequence data
base for similarity to a query sequence. Important considerations concern the sensitivity
of the search and the ability to assess statistical significance of observed matches. A
formidable statistical problem arises because of the numerous comparisons inherent
in such a search. With multiple comparisons, the probability of a chance match increases. The level of acceptance must accordingly be adjusted. Conservative statistical
methods (e.g., the Bonferroni inequalities, maximal range procedures, and nonparametric methods; see Miller 1980) are available to deal with this problem. However,
the difficulties of multiple comparisons grow concomitantly with the accelerating
growth of the data bases.
While significant similarity points to evolutionarily conserved domains, significant
dissimilarity may help to delineate differentiated regions that serve functions unique
Significant Similarity and Dissimilarity 165
to a particular member of a family of homologous proteins. Thus, the GCRs discussed
above are similar in the transmembrane domains but very different with respect to
charge distribution in the third cytoplasmic loop and in the cytoplasmic C-terminus,
a fact presumably reflecting their target specificities. For the NFI /IRA1 /GAP comparison, similarity is confined to the catalytic domain, the C-terminal sequences of
NFl and ZZUZ appear to be unconstrained, and the N-terminal sequences highlight
several distinguishing sequence patterns. We suggest that varied statistical sequence
features (such as charge configurations, segments of unusual composition, runs and
periodic patterns of certain amino acids, etc.) locate attractive breakpoints for interdomain swapping experiments and generally should promote appreciation of idiosyncratic attributes of individual members of protein families.
Acknowledgments
LITERATURE CITED
ALTSCHUL,S. F., W. GISH, W. MILLER, E. W. MYERS, and D. J. LIPMAN. 1990. Basic local
alignment search tool. J. Mol. Biol. 215:403-410.
APPLEBURY,M. L., and P. A. HARGRAVE. 1986. Molecular biology of the visual pigments.
Vision Res. 26:188 1-1895.
ARAI, H., S. HORI, I. ARAMORI,H. OHKUBO, and S. NAKANISHI. 1990. Cloning and expression
of a cDNA encoding an endothelin receptor. Nature 348:730-732.
BAER, R., A. T. BANKIER, M. D. BIGGIN, P. L. DEININGER, P. J. FARELL, T. J. GIBSON, G.
HATFULL,G. S. HUDSON,S. C. SATCHWELL,C. S~GUIN,P. S. TUFFNELL,and B. G. BARRELL.
1984. DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature
310~207-2 11.
BAIROCH,A. 1990. PROSITE: a dictionary of protein sites and patterns, 5th release. University
of Geneva, Geneva.
BALLESTER,R. T., T. MICHAELI,K. FERGUSON, H.-P. Xv, F. MCCORMICK,and M. WIGLER.
1989. Genetic analysis of mammalian GAP expressed in yeast. Cell 59:68 l-686.
BLAISDELL,B. E., and S. KARLIN. 1988. Distinctive charge configurations in proteins of the
Epstein-Barr virus and possible functions. Proc. Natl. Acad. Sci. USA 85:6637-6641.
BRENDEL,V., and S. KARLIN. .1989a. Association of charge clusters with functional domains
of cellular transcription factors. Proc. Natl. Acad. Sci. USA 86:5698-5702.
BRENDEL,V., and S. KARLIN. 19896. Too many leucine zippers? Nature 341:574-575.
BRENDEL,V., J. DOHLMAN,B. E. BLAISDELL,and S. KARLIN. 1991. Very long charge runs in
systemic lupus erythematosus-associated
autoantigens. Proc. Natl. Acad. Sci. USA 88: 15361540.
CHEE, M. S., A. T. BANKIER, S. BECK, R. BOHNI, C. M. BROWN, R. CERNY, T. HORSNELL,
C. A. HUTCHISONIII, T. KOUZARIDES,J. A. MARTIGNETTI,E. PREDDIE,S. C. SATCHWELL,
P. TOMLINSON,K. M. WESTON, and B. G. BARRELL. 1990a. Analysis of the protein-coding
content of the sequence of human cytomegalovirus strain AD169. Curr. Top. Microbial.
Immunol. 154~125-169.
CHEE, M. S., and B. G. BARRELL. 1990. Herpesvimses: a study of parts. Trends Genet. 6:8691.
CHEE, M. S., S. C. SATCHWELL,E. PREDDIE, K. M. WESTON, and B. G. BARRELL. 1990b.
Human cytomegalovirus encodes three G protein-coupled receptor homologues. Nature
344:774-777.
DAVISON,A. J., and J. E. SCOTT. 1986. The complete DNA sequence of varicella-zoster virus.
J. Gen. Virol. 67:1759-1816.
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
We express our thanks to Drs. E. B. Blaisdell, A. Campbell, E. B. Mocarski, and
L. Stryer for discussions and useful comments on the manuscript.
166 Karlin et al.
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
DAYHOFF, M. 0. 1978. Atlas of protein sequence and structure, vol. 5, suppl. 3. National
Biomedical Research Foundation, Washington, D.C.
DICKERSON,R. E., and I. GEIS. 1983. Hemoglobin: structure, function, evolution, and pathology.
Benjamin/Cummings,
Menlo Park, Calif.
FELLER, W. 197 1. An introduction to probability theory and its applications. Vol. 2, 2d ed.
Wiley, New York.
GILLESPIE,J. H. 1986. Rates of molecular evolution. Annu. Rev. Ecol. Syst. 17~637-665.
GOODMAN, M. 1981. Decoding the pattern of protein evolution. Prog. Biophys. Mol. Biol.
38:105-164.
HUGHES, A. L., T. OTA, and M. NEI. 1990. Positive Darwinian selection promotes charge
profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex
molecules. Mol. Biol. Evol. 7:5 15-524.
JACKSON,T. R. 1990. Cell surface receptors of nucleosides, nucleotides, amino acids and amino
neurotransmitters. Curr. Opinion Cell Biol. 2: 167-173.
KALDERON,D., B. L. ROBERTS,W. D. RICHARDSON,and A. E. SMITH. 1984. A short amino
acid sequence able to specify nuclear location. Cell 39:499-509.
KARLIN, S. 1990. Distribution of clusters of charged amino acid in protein sequences. Pp. 17 l180 in R. H. SARMA and M. H. SARMA, eds. Proceedings of the Sixth Conversation in
Biomolecular Stereodynamics: Structure & Methods, vol. 2: DNA protein complexes and
proteins. Adenine, Albany, N.Y.
KARLIN, S., and S. F. ALTSCHUL. 1990. Methods for assessing the statistical significance of
molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA
87~2264-2268.
KARLIN, S., B. E. BLAISDELL,and V. BRENDEL. 1990~. Identification of significant sequence
patterns in proteins. Methods Enzymol. 183:388-402.
KARLIN, S., B. E. BLAISDELL,E. S. MOCARSKI,and V. BRENDEL. 1989. A method to identify
distinctive charge configurations in protein sequences, with an application to human herpesvirus polypeptides.. J. Mol. Biol. 205: 165-177.
KARLIN, S., and V. BRENDEL. 1988. Charge configurations in viral proteins. Proc. Natl. Acad.
Sci. USA 85:9396-9400.
KARLIN, S., V. BRENDEL, P. BUCHER, and S. F. ALTSCHUL. 1991. Statistical methods and
insights for protein and DNA sequences. Annu. Rev. Biophys. Biophys. Chem. 20: 175-203.
KARLIN, S., and G. GHANDOUR. 1985. Comparative statistics for DNA and protein sequences:
single sequence analysis. Proc. Natl. Acad. Sci. USA 82:5800-5804.
KARLIN, S., and C. MACKEN. 1991. Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J. Am. Stat. Assoc. 86~27-35.
KARLIN, S., M. MORRIS, G. GHANDOUR, and M.-Y. LEUNG. 1988. Algorithms for identifying
local molecular sequence features. Comput. Appl. Biosci. 4:4 1-5 1.
KARLIN, S., F. OST, and B. E. BLAISDELL.1990b. Patterns in DNA and amino acid sequences
and their statistical significance. Pp. 133-l 57 in M. S. WATERMAN,ed. Mathematical methods
for DNA sequences. CRC, Boca Raton, Fla.
KIMURA, M. 1983. The neutral theory of molecular evolution. Cambridge University Press,
Cambridge, Mass.
LANDSCHULZ,W. H., P. F. JOHNSON, and S. L. MCKNIGHT. 1988. The leucine zipper: a
hypothetical structure common to a new class of DNA binding proteins. Science 240: 17591764.
LESK, A. M., and C. CHOTHIA. 1980. How different amino acid sequences determine similar
protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol.
136:225-270.
LEUNG, M. L., B. E. BLAISDELL,C. BURG, and S. KARLIN. An efficient algorithm for identifying
matches with errors in multiple long sequences. J. Mol. Biol. (accepted).
MA, J., and M. PTASHNE. 1987. Deletion analysis of GAL4 defines two transcriptional activating
segments. Cell 48:847-853.
Significant Similarity and Dissimilarity
I67
MCGEOCH, D. J., M. A. DALRYMPLE, A. J. DAVISON, A. DOLAN, M. C. FRAME, D. MCNAB,
L. J. PERRY, J. E. SCOTT, and P. TAYLOR. 1988. The complete nucleotide sequence of the
long unique region of the genome of herpes simplex virus type 1. J. Gen. Virol. 69: 153 l-
MASATOSHI NEI,
reviewing editor
Received February 25, 199 1; revision received May 16, 199 1
Accepted May 20, 199 1
View publication stats
Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013
1574.
MILLER, R. G. 1980. Simultaneous statistical inference. Springer, New York.
NIKAS, I., J. MCLAUCHLAN,A. J. DAVISON,W. R. TAYLOR,and J. B. CLEMENTS.1986. Structural
features of ribonucleotide reductase. Protein Structure Funct. Genet. 1:376-384.
PARSON,J. T. 1990. Closing the GAP in a signal transduction pathway. Trends Genet. 6: 169171.
ROIZMAN, B. 1990. Herpesviridae: a brief introduction in Virologv. 2d ed. Pp. 1287-1293 in
B. N. FIELDS and D. M. KNIPE, eds. Raven, New York.
Ross, P. C., R. A. FIGLER, M. H. CORJAY, C. M. BARBER,N. ADAM, D. R. HARCUS, and
K. R. LYNCH. 1990. RTA, a candidate G protein-coupled receptor: cloning, sequencing,
and tissue distribution. Proc. Natl. Acad. Sci. USA 87:3052-3056.
SAKURAI, T., M. YANAGISAWA,Y. TAKUWA, H. MIYAZAKI, S. KIMURA, K. GOTO, and T.
MASAKI . 1990. Cloning of a cDNA encoding a non-isopeptide-selective subtype of the endothelin receptor. Nature 348:732-735.
TANAKA, K., K. MATSUMOTO,and A. TOH-E. 1989. IRA 1, an inhibitory regulator of the RAScyclic AMP pathway in Saccharomyces cerevisiae. Mol. Cell. Biol. 9:757-768.
VONHEIJNE, G. 1986. Net NC charge imbalance may be important for signal sequence function
in bacteria. J. Mol. Biol. 192:287-290.
WILSON, A. C., S. S. CARLSON,and T. J. WHITE. 1977. Biochemical evolution. Annu. Rev.
Biochem. 46:573-639.
Xu, G., P. O’CONNELL,D. VISKOCHIL,R. CAWTHON, M. ROBERTSON,M. CULVER,D. DUNN,
J. STEVENS,R. GESTELAND,R. WHITE, and R. WEISS. 1990. The neurofibromatosis type 1
gene encodes a protein related to GAP. Cell 62599-608.
ZUCKERKANDL,E., and L. PAULING. 1962. Molecular disease, evolution, and genie heterogeneity.
Pp. 189-225 in M. KASHA and B. PULLMAN,eds. Horizons in biochemistry. Academic Press,
New York.