Academia.eduAcademia.edu

Significant similarity and dissimilarity in homologous proteins

1992, Molecular biology and evolution

Common practice emphasizes significant sequence similarities between different members of protein families. These similarities presumably reflect on evolutionary conservation of structurally and functionally essential residues. The nonconserved regions, on the other hand, may be either selectively neutral or differentiated. We propose several distributional sequence statistics (e.g., clustering of charged residues, compositional biases, and repetitive patterns) as indicators of differentiation events. These ideas are illustrated with various examples, including comparisons among G protein-coupled receptors, herpesvirus proteins, and GTPase-activating proteins.

Significant Similarity and Dissimilarity in Homologous Proteins 1 Samuel Karlin, Volker Brendel, and Philipp Bucher Department of Mathematics, Stanford University Introduction Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 Common practice emphasizes significant sequence similarities between different members of protein families. These similarities presumably reflect on evolutionary conservation of structurally and functionally essential residues. The nonconserved regions, on the other hand, may be either selectively neutral or differentiated. We propose several distributional sequence statistics (e.g., clustering of charged residues, compositional biases, and repetitive patterns) as indicators of differentiation events. These ideas are illustrated with various examples, including comparisons among G protein-coupled receptors, herpesvirus proteins, and GTPase-activating proteins. Members of a protein family are typically composed of subregions of three types: ( 1) conserved regions that generally reflect a common function or structure maintained by functional constraints (negative selection), (2) freely evolved regions, generally nonfunctional, changed by random drift (selectively neutral), and (3 ) functionally differentiated regions adapted to new roles (positive selection). A common premise asserts that sequences important in the functioning of the proteins are those conserved for species of a broad evolutionary range. In free regions among homologous proteins, amino acid replacements are basically randomly generated, conforming to the neutral theory of molecular evolution. This course of variation can often be calibrated as a molecular clock yielding to phylogenetic reconstructions within and between species (Zuckerkandl and Pauling 1962; Wilson et al. 1977; Kimura 1983; Gillespie 1986). A prototype case of freely varying regions occurs with the extended globin family. In other protein families, there occur among protein members significant (nonrandom) sequence variations that putatively reflect on functional differentiation in relation to organism, tissue and cellular specificities, variant mechanisms, and degree of oligomerizations. In particular, functional diversification through positive selection has been implicated for the variable regions of immunoproteins (e.g., see Hughes et al. 1990). The present paper elaborates definition, detection, and analysis of functionally differentiated subregions by using protein sequence statistics. Sequence features to distinguish among similar proteins could be based on distribution of charged residues, hydropathy arrangements, amino acid usage, repetitive patterns, and spacings of amino acid types (see Methods; also Karlin et al. 199 1) . Sequence contrasts can be important in understanding the norm and variability of a protein function class. We will consider three examples underscoring statistically significant features 1. Key words: protein families, molecular evolution, sequence, statistics. Address for correspondence and reprints: Samuel Karlin, Department of Mathematics, Stanford University, Stanford, California 94305. Mol. No/. Evof. 9(1):152-167. 1992. 0 1992 by The University of Chicago. All rights reserved. 0737-4038/92/0901-0011$02.00 152 Significant Similarity and Dissimilarity 153 Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 which discriminate between members within various classes of similar protein sequences: ( 1) G protein-coupled receptors ( GCRs), (2) homologous proteins from human herpesviruses, and ( 3) the recently identified human neurofibromatosis gene product NFI, compared with bovine GAP ( GTPase activator protein) and yeast ZZ?Al protein (Xu et al. 1990). It is useful to briefly review relevant background information on the function classes to be considered: 1. The family of GCR proteins is a large and diverse group including receptors for neurotransmitters, hormones, and photons. The projected common structure of these proteins consists of an extracellular N-terminus, seven membrane-spanning helices, and a cytoplasmic C-terminus. The N- and C-terminals and the third cytoplasmic loop of GCRs vary greatly in length, whereas the transmembrane segments are the most conserved regions among different receptors. Various residues in the transmembrane regions are involved in ligand binding, and residues in the third cytoplasmic loop in some cases are responsible for selective G protein coupling (e.g., see Jackson 1990). 2. The herpesviruses are morphologically similar DNA viruses with large doublestranded linear genomes found widely distributed in vertebrate species. After initial infection they commonly persist in the host in a latent state from which they can reactivate. However, with respect to many other properties, such as cell tropism, route of infection, and host pathology, the herpesvirus family exhibits much diversity. On the basis of a variety of biological criteria, herpesviruses have been classified into a-, p-, and y-subgroups (Roizman 1990). Members of the a-group, including herpes simplex virus 1 ( HSVl ) and varicella-zoster virus (VZV), typically infect epithelial cells and establish latency in neural tissue. The p-group prototype human cytomegalovirus (HCMV) appears to have a broad tissue distribution and a slow growth cycle. y-Herpesviruses [e.g., Epstein-Barr virus (EBV)] arc lymphotropic viruses that generally persist as circular episomes in dividing cells. Four have been completely sequenced [ HSV 1 ( McGeoch et al. 1988)) VZV (Davison and Scott 1986)) HCMV (Chee et al. 1990a), EBV (Baer et al. 1984)]. The gross organization of these genomes is generally similar in that each consists of a long and a short unique region (UL and US, respectively) flanked and separated from each other by repeated segments. However, there is substantial variation in both size (from = 125 kp in VZV to =230 kb in HCMV) and base composition (from 46% G+C in VZV to 68% G+C in HSVl; for review, see Chee and Barrel1 1990). Comparative genome analysis has shown that the UL region contains a core segment of -40 genes that are conserved to various degrees across all herpesviruses (Chee et al. 1990a). Homology between genes has been established either by significant sequence similarity, positional correspondence (allowing for some rearrangements in gene order and orientation), or genetic complementation. 3. The strongest subalignments comparing the neurofibromatosis gene NFl to - 15,000 protein sequences were found with ZRAI and GAP confined to the 360residue putative catalytic domains (Xu et al. 1990). The IRA 1 and GAP polypeptides are presumed to interact with p21 ras, a protein implicated in neoplasia and growth control (Parson 1990). It has been established that the IRA1 protein down-regulates yeast rus proteins (Tanaka et al. 1989 ). Mammalian GAP found ubiquitously in cells can curtail the wild-type mammalian H-m when expressed in yeast (Ballester et al. 1989). On this basis, NFl is projected as a negative oncogene. For comparison, we end the present paper with a brief discussion of representative globin proteins from mammalian, avian, and amphibian species. This is the seminal 154 Karlin et al. protein family of common established function, structure, and origin. There is considerable literature dissecting conserved and variable regions of the globin genes (Lesk and Chothia 1980; Dickerson and Geis 1983 ). By our criteria, we find y10functionally differentiated regions among the globin homologues. Methods We describe the following four classes of distinctive amino acid sequence configurations that may underlie functional differences among similar proteins: 1. Charge Distribution Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 Systematic studies to identify and characterize (a) distributional features of charged residues in protein sequences, and (b) their association with protein structure, function, and organism type were initiated by Karlin et al. ( 1989, 1990a, 199 1). Three categories of charge configurations have been investigated. A charge cluster refers to a short (25-75 residues) protein segment with significantly high specific charge content relative to the amino acid composition of the whole protein. In particular, a positive (negative) charge cluster is a segment with high positive (negative) net charge, and a mixed charge cluster is a segment high in charge residues of both signs. An acidic, basic, or mixed charge run signifies a stretch (typically, 7- 12 residues long) of successive charged residues of the proper sign that allow for, at most, one or two intermittent errors. A charge run is said to be statistically significant if the probability of observing the indicated run length is co.01 in a corresponding random sequence of the same amino acid composition. A residue segment is called a hyper-charge run under the following two conditions: ( 1) the run is statistically significant relative to its own composition, and (2) the probability of observing a run of this length or longer would be < 10e5 for a random protein sequence of the same length but with charge frequencies ofan average protein (i.e., - 11.5% acidic and - 11.5% basic). Generally, for a protein of 400 amino acids, a pure hyper-charge run would need to exceed eight charged residues. Periodic charge patterns are repetitive arrangements of charged and uncharged amino acids. Rigorous definitions, implementations, and statistical assessments of these charge configurations have been elaborated by Karlin et al. ( 1990a). Our previous studies have revealed several associations between significant charge configurations and species/protein function. Thus, ( 1) charge clusters are frequent in certain classes of eukaryotic proteins but are scarce in prokaryotic proteins of all kinds (Karlin and Brendel 1988); (2) multiple charge clusters in conjunction with distinctive periodic charge patterns and long DNA repeats characterize in EBV the proteins expressed in the latent state ( Blaisdell and Karlin 1988 ) ; ( 3 ) charge clusters are common in both eukaryotic nuclear transcription and replication factors (Brendel and Karlin 1989a) but are uncommon among cytoplasmic enzymes and housekeeping proteins (Karlin 1990); (4) Drosophila developmental control proteins often contain multiple charge clusters and special charge patterns; and (5) very long anionic and mixed charge runs are prominent among nuclear autoantigens (Brendel et al. 199 1) . Differences in charge distribution may correspond to diverse protein folded structure and function. Electrostatic interactions are known to facilitate protein sorting, translocation, orientation, and binding to DNA and other proteins (e.g., see Kalderon et al. 1984; von Heijne 1986; Ma and Ptashne 1987). Charge clusters of opposite sign may contribute to the formation of multiprotein complexes. Charge clusters of like sign may help maintain separation between certain protein assemblages. Charge clusters should increase the solubility of the protein in aqueous media. Multiple charge clusters Significant Similarity and IDissimilarity within one protein might contribute to intramolecular protein and protein-nucleic acid interactions. 155 folding or cooperative protein- 2. Amino Acid Usage and Quantiles 3. Repetitive Structures There are several levels and forms. One category of repeats pertains to the aggregate of multiplets comprising all homodipeptides XX (=X2), homotripeptides XXX (=X3), etc., when X denotes any amino acid. A statistical assessment of the counts and locations of these multiplets is attained comparing the observed multiplet set to the multiplet distribution in a random reconstruction of the protein sequence. The count of multiplets, o, provides a measure of the homopeptide density of the protein sequence. When o is excessively large compared with expectation, a significant multiplet realization is inferred. The significance test is done as follows: Let fi be the frequency of the ith amino acid in the protein sequence. For a random sequence the probability of seeing at a given position the start of a reiteration of length 2, 3, or 4, etc. of amino acid type i is f:( 1 -J)2 + f:( 1 -fi‘)2 + f:( 1 -J)2 + - - - = f!( 1 -JI’). Therefore, the probability of encountering a multiplet (i.e., any homopeptide) at a given position is When the observed o exceeds Nfo + 3-, where N is the sequence length, the protein is deemed to carry a significantly high multiplet count o. It is further possible to discern anomalous characteristics of the spacings distribution induced by multiplets (see next section). There is also interest in proteins containing one or more substantial homopeptides, i.e., a multiplet of the form X,, for some n 2 5 (see our discussion of the HCMV proteins, below). It is widely known that many Drosophila developmental proteins contain long homopeptides, particularly of glutamine, asparagine, histidine, alanine, glycine, and proline, the functional role of which is unclear. Multiple repeated peptides in a protein not restricted to homopolymeric form can also be a distinguishing characteristic among similar proteins. Prominent examples include the proteins of the EBV latent cycle, various eukaryotic extracellular cytoskeletal proteins (e.g., collagens and fibronectins), and the major cofactors of the coagulation cascade. Periodic repeats such as the leucine zipper and its generalizations to other amino Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 The quantile distribution Q(x) for a residue type for a given set of proteins describes the percent of proteins in which that residue type occurs with frequency <fi. For most residue types, the medians (50% quantile) and interquartile range (25%-75%) are quite similar across prokaryote and eukaryote species (Karlin et al. 199 1). The l%, 5%, 95%, and 99% quantile distributional values of amino acids for mammalian species provide standards by which to affirm extremes of amino acid usage for any particular mammalian protein or protein family. For example, cysteine is used, on average, - 1% among Escherichia coli and yeast proteins with a narrow quantile distribution, but is used, on average, >2% among mammalian proteins with a broad quantile distribution. 156 Karlin et al. acid types can further reflect on mechanisms of oligomerization and amphipathic structures (see Landschulz et al. 1988; Brendel and Karlin 1989b; Karlin et al. 199 1). 4. Spacings Distribution 5. Sequence Comparisons and Similarity Scores Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 Consider a sequence of length N and a specified word type (amino acid or peptide) or general marker with n occurrences randomly distributed in the sequence. These induce n + 1 spacings ( U,, U1, . . . , U,), where Ui is the distance (number of units) from the ith occurrence of the marker to the i + 1st occurrence. Our statistical analysis centers on the extremal spacings m* = min { 156, U, , . . . , U,, } and M* = max { UO, UI,**-, U,, } . To detect clustering of sites, we check whether the minimum spacing is significantly small for the postulated random distribution of sites. Similarly, to decide whether some adjacent marker locations are excessively far apart, we check whether the maximum M* is especially large (for applications of these ideas to the E. coli Kohara physical map data, see Karlin and Macken 199 1) . The distributional properties of m * and M* are classical (e.g., see Feller 197 1) . The criterion for an extreme minimum at the 0.0 1 significance level involves the determination of a* such that the probability that m* r a* is 0.99. In a parallel way the criterion for a significantly large M* can be established. We use two methods. The first is a multiple sequence-matching algorithm elaborated by Karlin et al. ( 1988). A repeat segment is an aggregate of exactly repeated words, each of length ZK and separated by error blocks, each of length se letters. K should be chosen such that m’ I n I m”+‘, where n is the average length of the sequences considered and m is the alphabet size. Then, in a random sequence of the same composition and length, each K word would be expected to occur, at most, m times (Leung et al., accepted). Identification and statistical significance of matching segments follow the analysis of Karlin et al. ( 1990b). The second method works via pairwise comparisons involving a general score matrix for amino acid matching (for details, see Altschul et al. 1990; Karlin and Altschul 1990). Both procedures take into account the compositional bias of the compared proteins. Results and Discussion I. GCR Proteins \ Several new members of the GCR family have recently been promulgated on the basis of sequence similarities. Thus, Chee et al. ( 1990b) suggested three open reading frames (ORFs) (US27, US28, and UL33) of HCMV as likely to encode GCRs. Amino acid segment identities were proffered among US27, US28, and UL33 and with several known or putative GCRs, including human Pz-adrenergic receptor (ADR), human serotonin receptor (STR), human MAS oncoprotein (MAS), bovine substance-K receptor ( SKR ) , and bovine rhodopsin ( RHOD ) . Ross et al. ( 1990 ) isolated a cDNA clone from a rat thoracic-aorta library encoding a 343-amino-acid protein (i.e., RTA) with sequence similarity to MAS. It is interesting that several experimental criteria applied by these authors argue that RTA is not an angiotensin receptor, a function associated with MAS. Two different endothelin receptors bear sequence similarity to GCR proteins. One of them is specific for endothelin- 1 ( Arai et al. 1990)) whereas the other binds all three endothelins (Sakurai et al. 1990). Significant sequence differences among the GCRs may reflect determinants of ligand or target protein specificities, variations of regulation and transport, or oligo- Significant Similarity and Dissimilarity 157 ADR RHOD 413 . . . . . . . . . . . . . . . ..1.....- c 2 3 7 MAS RTA 348 325 > 5 6 7 343 FIG. I .-Similarities and differences among representative mammalian G protein-coupled receptors. The sequences are drawn approximately to scale (lengths are indicated on the right). The locations of putative transmembrane segments (boxes I-7 ) and of charge clusters (ovals indicating the charge sign: + for basic clusters, - for acidic clusters, and + for mixed charge clusters; see Karlin et al. 1990a) are shown on the sequence lines. Patterned bars underneath the sequence lines delineate segments of significant similarity between any two of the proteins (different patterns are used for each pair of sequences, e.g., --- for ADR/ STR, 000 for ADR/SKR, etc.). Our significance criterion required that the probability of observing an equal matching score (with the PAM-250 amino acid substitution matrix).be ~0.01 when random sequences of the same respective lengths and compositions are compared (see Karlin and Altschul 1990). Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 merization mechanisms. We shall focus on ADR, STR, SKR, RHOD, MAS, and RTA as representative GCR sequences. All these proteins are in the upper 10% quantile with respect to the frequency of hydrophobic residues (i.e., ~10% of mammalian proteins in the data banks are equally high in the aggregate frequency of L, V, I, F, and M) and are in the lower 10% quantile in the aggregate of charged residues. Such a compositional bias can easily interfere with proper assessments of sequence similarities. Using the statistical methods for establishing pairwise and multiple sequence subalignments (see Methods), we found that all pairs of the representative GCRs share one or more similarity segments, with the exceptions of (a) MAS, which has no region of significant similarity to either STR or RHOD (also, the MAS-ADR comparison reveals only a short segment of similarity, and the MAS-SKR comparison aligns MAS transmembrane segment 6 with SKR transmembrane segment 5, out of order; see fig. 1)) and (b) RTA, which only matches (very strongly) with MAS. Less than half of the comparisons involve two or more separate similarity segments. Most of the pairwise similarities and the significant multiple alignments are confined to transmembrane domains, while the N- and C-terminals and the third cytoplasmic loop are generally distinct and often significantly distinct (see below). US27 and US28 show mutual similarity over the seven putative transmembrane segments, whereas UL33 only has a short N-terminal piece of similarity with US27 and no similarity to US28 (fig. 2). Of the 18 comparisons between the three cytomegalovirus sequences and the six representative GCRs, only eight yielded significant 158 Karlin et al. US-27 390 >- US-28 3 1 2 3 4 5 7 323 - UL-33 _Lt 4 2 5 6 7 362 ______ I I SrR sKR I I Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 FIG. 2.-Similarities and differences among putative HCMV G protein-coupled receptors. Symbols are as in fig. 1. Labeled boxes indicate segments of significant similarity to the mammalian GCRs for which the corresponding segments occur in analogous positions with respect to the transmembrane domains (see fig. 1). similarity matches, and only one of these comparisons involved more than a single matching segment. In detail the similarities between the sequences are of quite varied extent and location (figs. 1 and 2). There are salient differences in the nonsimilar regions. This is particularly manifest in the distribution of charged residues within each sequence. We have ascertained charge clusters (see Methods) in US27, ADR, STR, MAS, and SKR but none in any of the other sequences (figs. 1 and 2). In US27, MAS, and SKR a mixed charge cluster occurs at the C-terminus, while ADR and STR feature a mixed charge cluster in the third cytoplasmic loop. In MAS the charge cluster involves 11 basic and four acidic residues in a stretch of 28 residues. RTA at the same location has only six basic and five acidic residues, a concentration which is statistically not significant. SKR has a second (basic) charge cluster in the third cytoplasmic loop. The four human opsins, like their sequenced bovine counterpart (i.e., RHOD), are devoid of distinguished charge configurations, while the Drosophila opsins, like ADR and STR, have a mixed charge cluster in the third cytoplasmic loop (data not shown). This charged region is the most conserved segment among the Drosophila opsins and is the site of interaction with cytoplasmic proteins (Applebury and Hargrave 1986). The only other amino acid cluster in all these proteins occurs with ADR, a proline cluster (positions 269292) of 14 prolines in a segment of 24 residues, including the iterated pattern (PX)g . The considered GCR sequences are unusual in amino acid usage in several respects, relative to a set of some 550 distinct human protein sequences. The GCRs are beyond the 90% quantile point for hydrophobic residues (i.e., >90% of the control set displays a lower fraction of hydrophobic residues) and below the 10% quantile point for acidic residues and total charge. MAS, RHOD, US27, and US28 are particularly rich (above the 95% quantile points) in tyrosine; ADR, MAS, SKR, RHOD, and US28 in phenylalanine; and MAS, RHOD, US27, and US28 in valine. UL33 alone is high in threonine (in fact, above the 99% quantile point). MAS and US28 are in the low 5% quantile with respect to glutamine and glycine usage. Leucine, although generally the most abundant residue, is not at an extreme quantile for any of the sequences. Significant Similarity and Dissimilarity 159 The two endothelin receptors are very similar to each other over most of their lengths (not including some 80 amino-terminal and about 40 carboxy-terminal residues) and share one or two segments of similarity with all the mammalian GCRs, with the exception of RTA. No anomalies are evident in their charge distributions. One of the receptors (Arai et al. 1990) contains a run of four consecutive cysteines in the C-terminal region, a pattern not found in any other protein in the data bases. In summary, the GCRs are mainly conserved in the transmembrane regions (the similarity is, to a great extent, consequent on the strong compositional bias toward hydrophobic residues), while the third cytoplasmic loop and C-terminus are generally discriminating in charge distribution (presumably in relation with target protein specificities). The N-terminus and the second extracellular loop are mostly neither conserved nor imbued with statistically significant sequence differences. II. Herpesvirus Homologues Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 A high proportion of the homologous genes shared by the a-, B-, and y-herpesviruses show significantly different charge residue distributions. Specifically, among the 26 genes that exhibit detectable sequence similarities between HSV-1, HCMV, and EBV (Chee et al. 1990~) (HSVl is chosen as representative of the a-subgroup), 14 triplets include at least one gene product that carries a significant charge cluster, but presence, type, and/ or location of the charge clusters are in no case invariant over all three homologues. In the following we shall compare charge configurations in the following three principal examples from different function classes: ( 1) the large subunit of the enzyme ribonucleotide reductase, (2) the HSV- 1 transactivator protein UL54 (IE63) and its homologues, and (3) the putative gene products related to HSV- 1 ULl 0,speculated to be membrane associated. These examples also represent different levels of sequence similarity in, the conserved regions, from highly significant in ribonucleotide reductase to borderline in the transactivator proteins and with the ULlO group intermediate. The large subunit of the ribonucleotide reductase ( RNRl ) of HSV- 1 is - 300 amino acids longer than its counterpart in EBV (BORF2), from which it is further distinguished by a negative charge cluster near the N-terminus (fig. 3A). The corresponding protein in HCMV (UL45 ) is intermediate in size but resembles HSV- 1 in also having an N-terminal negative charge cluster. Experimental studies have shown that the N-terminal domain of HSV-2 RNRl (which is very similar to HSV- 1 RNR 1) has cell-transforming activity (Jones et al. 1986). The HCMV homologue is poorly conserved if the similarity between the HSV- 1 and EBV proteins is taken as a standard (the evolutionary distances between the three viruses are approximately the same). Moreover, among ribonucleotide reductases, two sequence motifs-i.e., GXGX2G, presumed to interact with the substrate (Nikas et al. 1986) and GX2NSX3AXMP, proposed as a diagnostic signature of this enzyme subunit (Bairoch 1990)-are corrupted. These findings, together with the observation that the HCMV genome lacks the small subunit of ribonucleotide reductase, suggest that the UL45 gene product has lost its original function. The shared charge cluster with the HSV- 1 and HSV-2 counterparts perhaps provides a different activity in virus multiplication and host interactions. The HSV- 1 UL54 protein and its homologues in the other herpesviruses display strong divergence in charge configurations (fig. 3B). HSV- 1 UL54 carries a negative charge cluster in the N-terminal quartile. Its counterpart in EBV-BMLFl , the major transactivator protein of the lytic cycle-has two distinct charge clusters in the N- 160 Karlin et al. HCMV UL45 0 A 906 926 EBV BORF2 ... .. .. .. . .. .. . .. . .. . .. .. .. V HSV-1 UL39 ____ . . .. ... . .. . . .. .. .. .. .. .. . .. HCMV UL69 e T. ___ B EBV BMLFl w ~RAe~,. 1137 744 - p, 459 ... HSV-1 UL54 C EBV BBRF3 .... .... -m ____ ___ +<> 0+- ____ 372 ---- -- HSV-1 ULIO - ... ___ 405 473 512 Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 HCMV ULIOO 0 FIG. 3.-Occurrence of charge clusters in three groups of homologous proteins from human herpesviruses HSV- 1, EBV, and HCMV. Symbols are as in fig. 1. Also indicated are the locations of homopeptides and of repetitive patterns. The locations of segments of significant sequence similarity within each group (for definition, see legend to fig. 1) are shown beneath the sequence lines. terminal quartile, one of negative sign followed closely by a mixed cluster. This protein also features a striking ,periodic arginine pattern ( RXZ)io, where Z is predominantly proline (eight times) and X is mostly alanine (six times). The proline residues make it likely that this segment is part of an open coil. This unusual periodic pattern may underlie a novel three-dimensional structure associated with a special regulatory function. The (RXZ),,,, sequence is immediately followed by seven arginines at displacements of three or four (average 3.43), without intervening prolines, that could easily form an a-helix with a line of positive charge on one side. The corresponding HCMV protein is completely different, featuring a negative charge cluster at the other end in the C-terminal quartile. In addition, it has several significant homopeptides of specific uncharged residues, including iterations of five threonines and nine prolines. The UL54 polypeptide of HSV- 1, like EBV BMLF 1, is known to be a potent transcriptional activator, a class of regulatory proteins that is generally rich in charge configurations (Brendel and Karlin 1989~). The foregoing contrasts suggest that these homologues of the general herpesvirus protein inventory might contribute to some of the physiological differences between the three subgroups. The third example consists of three homologous proteins of unknown function: HSV- 1 ULlO, EBV BBRF3, and HCMV ULlOO. The high content of hydrophobic residues and a potential signal peptide sequence in HSV ULlO gave rise to the speculation that the corresponding gene products are membrane associated ( McGeoch et al. 1988). All three proteins contain one or more charge clusters in the C-terminal half of the sequence (fig. 3C). The HSV- 1 homologue has a positive charge cluster and, sequentially, two distinct negative charge clusters; the EBV homologue has a skewed mixed charge cluster, involving five acidic followed by 11 basic residues, over a length of 42; the HCMV homologue has a mixed charge cluster with an excess of Significant Similarity and Dissimilarity 16 1 Table 1 Global Statistics on Proteins with Distinctive Features in Human Herpesvirus Oraanism HCMV HSV-I vzv EBV No. of Proteins’ No. of Cyst&e Triuletsb 142 70 67 84 5 (5) I(l) 0 (0) 1 (I) No. of Cysteine Doublet? 53 (35) 14 (13) 15 (13) 13 (13) No. of Homopeutidesb,’ No. of Proteins with Significant Multiplets 58 (35) 13 (9) 13 (5) 15 (14) 14 2 1 3 No. of Hypercharge Run@ 8 (7) l(1) 0 (0) 0 (0) Genomes No. of Proteins with Single-Charge Clusters/No. of Proteins with Multiple-Charge Clusters 3218 1613 7/l 11/7d ’ ORFs < 140 codons were excluded from the HCMV protein set. b The Numbers in parentheses are number of proteins with at least one occurrence of the specified feature (e.g., HCMV contains 53 cysteine doublets in a total of 35 proteins). c Runs of five or more identical amino acids. d Six of seven EBV proteins with multiple charge clusters are expressed only in the latent state (Blaisdell and Karlin 1988). Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 acidic residues, including an exceptionally long run of length 10 at the C-terminus of the cluster. In all three viruses, the charge clusters are statistically highly significant. A number of additional examples of contrasts in the charge configurations of homologous herpesvirus proteins can be added to this list. The DNA polymerase has in HSV- 1 a negative charge cluster (conserved in VZV) missing in EBV and HCMV. The replication gene BSLFl of EBV is lacking a negative charge cluster that occurs in the other two homologues of this well-conserved protein. Different types of charge clusters occur in the huge virion proteins of HSV and EBV, while corresponding configurations are absent in the homologous protein of HCMV. The strongly conserved triplet UL25 (HSV- 1) ,BVRF 1(EBV) ,UL77 (HCMV) exhibits a mixed charge-cluster polymorphism near the N-terminus, HSV-1 not carrying this feature. HCMV UL50 has a negative charge cluster in the C-terminal region, while its putative counterpart in EBV, BFRFl, has a positively biased mixed charge cluster at the corresponding location. As in the three main examples, these differences are potential indicators of functional divergence. The combined analysis of similarities and dissimilarities can be expanded to deal with complete genomes of homologous organisms. Such studies may reveal global evolutionary trends associated with speciation and radiation within higher-order taxa. In the following analysis of four human herpesvirus genomes, we will show that HCMV is in many respects an outlier within the herpesvirus realm, a conclusion that could not have been reached by evaluating sequence similarity alone. The HCMV genome is among the largest of herpesviruses, with more than twice as many ORFs as compared with other herpesviruses. The HCMV proteins tend to be much richer in distinctive sequence features of the kind associated with differentiated functions, as suggested in the previous examples. Table 1 displays several comparative analyses for the aggregate protein sets of human herpesviruses. Note the abundance of cysteine doublets and triplets in HCMV (cysteine triplets are generally scarce, not only in herpesviruses of the a- and y-subgroups but also in cellular organisms: there are three occurrences in 748 distinct human proteins, and there is one in 856 (Escherichiu coli proteins). The HCMV proteins also carry a plethora of homopeptides and 162 Karlin et al. Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 more than a dozen polypeptides containing significantly many multiplets (see table 1). Perhaps the most striking finding is the occurrence of eight hyper-charge runs in a total of 142 HCMV proteins, as compared with only one in 22 1 pooled proteins from the other three genomes. A similar contrast, though less extreme, is provided by the numbers of proteins having charge clusters. What could be the biological significance of the high incidence of distinctive sequence features in HCMV? We propose that it reflects on a correspondingly high proportion of specialized proteins in the genome, accounting for the broad tissue- and cell-type range of this species, especially in the latent state. The discovery of gene families in HCMV further supports this notion, suggesting the existence of tissuespecific variants for certain protein functions as a coadaptation to a corresponding diversity in the host genome. It is also compatible with the narrow species range of cytomegaloviruses, because functionally differentiated protein sequences are generally subject to more rapid evolutionary change than are those carrying out housekeeping functions. In summary, the frequent occurrence of distinctive sequence configurations in HCMV reflects its survival and propagation strategy: maintenance of a high infection rate through diversified mechanisms of latency rather than through rapid lytic multiplication. Homologous proteins among herpesviruses highlight both similar and dissimilar sequence features. Distinctions with respect to charge distribution are paramount in the terminal portions of the sequences. The striking pattern ( RXZ)io (R = arginine, X = mostly alanine, and Z = predominantly proline), present among herpesviruses only in the potent transactivator gene product BMLFl of EBV, may portend a novel functional three-dimensional structure. A global analysis of herpesvirus genomes reveals an unusually high incidence of distinctive sequence features in HCMV, probably related to diversified mechanisms of latency. III. NFl, GAP, and IRA1 It is somewhat surprising that the similarities between the human NFl gene and the yeast IRA1 protein sequence are stronger and extend well beyond the region of similarities between NFI and GAP. How do NFI, GAP, and IRA1 compare or differ in other sequence attributes? IRA1 and GAP both have several significant homopeptides and a significantly long stretch of uncharged residues in their N-terminal regions (fig. 4), whereas corresponding sequence structures are absent from NFI. However, the se N. {KXX}, 2939 IRA1 GAP FIG. 4.-Similarities and differences among yeast IRA I, human NFJ, and bovine GAP. The regions of high (black) and weak (stippled) similarity are according to Xu et al. ( 1990). Also indicated are the locations of homopeptides, of a basic charge cluster in GAP, and of long uncharged regions in IRA J (S/Trich) and GAP (A/G-rich). Significant Similarity and Dissimilarity 163 IV. Globin Families There is a vast literature on globin-protein primary sequences and crystal structures from a wide range of phylogeny (e.g., see Dickerson and Geis 1983). Moreover, complete a- and P-globin DNA sequences (exons, introns, and flanks) are available for many mammalian, avian, amphibian, and fish species. Several conserved structural and functional features of the underlying DNA and of the hemoglobin molecule are Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 complete sequence of the N-terminal end of ZWZis not yet available. The homopeptides and uncharged regions emphasize different residue types in the two proteins. GAP contains substantial runs of alanine and proline, while the corresponding homopeptides of ZZUZ consist of serine and asparagine. Protein-specific amino acid preferences are manifest in the charge-free segments where GAP carries an excess of the small nonpolar residues glycine and alanine, whereas ZZ?Al contains an abundance of serine and threonine. The latter also has a significant periodic charge pattern (KXX)s (the X components include two prolines) upstream from the conserved region. The GAP protein, on the other hand, exhibits a significant positive charge cluster in its Nterminal domain and contains over its entire sequence a statistically high number of amino acid doublets and triplets. The sequence features described above fall into a region where no similarity has been documented by sequence-alignment methods and therefore might reflect functional or structural characteristics that require compositional properties rather than a precise arrangement of certain residues. The similarly positioned runs and uncharged regions involving different amino acids suggest two possible roles (not exclusive), depending on whether the common aspects or the differences are paramount. Concerning the homopeptides, for instance, one might propose either ( I ) that they assume functionally equivalent extended structures by virtue of being repetitions of identical elements or, alternatively, (2) that they mediate interactions with diverse subcellular targets, through the different physicochemical properties of their building blocks. Closer to the catalytic domain but still on its N-terminal side, ZZUZ carries a run of five isoleucines, a phenomenon which is extremely rare in nature (only four independent examples are in the current NBRF data base). This occurrence falls into the region where the sequence has been aligned with NFl (but not with GAP). The strongly hydrophobic character of this short segment is not conserved in the human protein. It is of note that all the significant homopeptides reported (with the exception of A5 in GAP) derive from varied synomymous codon usage and thus do not reflect simple repeats on the DNA level. A different picture emerges from a similar analysis of the C-terminal domains. The C-terminus of GAP falls immediately after the conserved catalytic domain, rendering this protein less than half the size of the other two (fig. 4). NFI and IRA1 display similarity over an additional 800 amino acids covering most of their C-terminal regions. Only at the very ends do sequence-alignment procedures fail to detect similarity. The sequences carboxyl from the catalytic domain are amorphous both in the types of sequence features discussed above and in those seen in the previous examples. Over a cumulative length of -2,300 amino acids, there is not a significant run, cluster, or repetitive pattern to report. The human NFZ, bovine GAP, and yeast ZRAZ proteins are similar over the catalytic domain. However, GAP and ZZUZ involve significant differences in the Nterminal third, emphasizing a surfeit of amino acid doublets and other repetitive structures. 164 Karlin et al. Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 well established (e.g., see Lesk and Chothia 1980; Dickerson and Geis 1983). Goodman ( 198 1) has studied, in the quaternary structure of a- and P-globins, the correlation between the degree of amino acid conservation and the functional type of amino acid site. He found that, for mammals, amino acids contacting the heme and those participating in the cooperativity that facilitates oxygen transportation, movable interchain [a-P ( CJ)sites] contacts, and DPG (diphosphoglycerol ) contacts are most strongly and approximately equally conserved. The fixed contacts a-p (x sites), the interhelix positions, and the otherwise unclassified sites are relatively and approximately equally variable. The strongest similarity at the DNA level occurs about the exon-intron interface (see Karlin and Ghandour 1985). If the Dayhoff ( 1978, p. 230) alignments for the globins are adopted, degree of conservation across species and between multigenes [e.g., human ( p, 6, y, and E) and (a and <)I can be classified as no changes (codons conserved), some silent substitutions, one replacement substitution, and multiple replacements. In this classification at the DNA level, the strength of conservation (the same for a- and P-globins), in decreasing order, is exon-intron interfaces > heme > movable P-a contact sites > buried sites > fixed a-0 contact sites > interhelix contacts > others (data not shown). In the primary sequences the most variable region is around the N-terminus, and the next most variable is at the C-terminus; this is true even for such close phylogenetic species as human versus mouse. By contrast to the GCRs, the herpesvirus ORF homologues, and the NFl-IRA pairing, there appear to be no functionally differentiated regions among the globins, in the sense of their having distinctive amino acid sequence features. In particular, with respect to the a- and P-type globins, across and within species, our sequence analysis revealed no significant charge configurations of any kind, no distinguished repetitive sequence structures, and no unusual spacings of any amino acid types (data not shown). The histidine frequencies are uniformly high, and arginine tends to be low in most globin proteins; however, it is difficult to reliably discriminate extremes in amino acid usages in relatively small proteins such as globins. Although there are no functionally differentiated sequences across the globin proteins, there are well-known differences in temporal expression (fetal vs. adult). But this appears to be under separate genetic and environmental controls on transcription and expression rather than due to differences within the protein sequences (Dickerson and Geis 1983). j Perspectives Sequence similarity can shed light on evolutionary relationships and on common structure and function. Molecular biologists commonly search a large sequence data base for similarity to a query sequence. Important considerations concern the sensitivity of the search and the ability to assess statistical significance of observed matches. A formidable statistical problem arises because of the numerous comparisons inherent in such a search. With multiple comparisons, the probability of a chance match increases. The level of acceptance must accordingly be adjusted. Conservative statistical methods (e.g., the Bonferroni inequalities, maximal range procedures, and nonparametric methods; see Miller 1980) are available to deal with this problem. However, the difficulties of multiple comparisons grow concomitantly with the accelerating growth of the data bases. While significant similarity points to evolutionarily conserved domains, significant dissimilarity may help to delineate differentiated regions that serve functions unique Significant Similarity and Dissimilarity 165 to a particular member of a family of homologous proteins. Thus, the GCRs discussed above are similar in the transmembrane domains but very different with respect to charge distribution in the third cytoplasmic loop and in the cytoplasmic C-terminus, a fact presumably reflecting their target specificities. For the NFI /IRA1 /GAP comparison, similarity is confined to the catalytic domain, the C-terminal sequences of NFl and ZZUZ appear to be unconstrained, and the N-terminal sequences highlight several distinguishing sequence patterns. We suggest that varied statistical sequence features (such as charge configurations, segments of unusual composition, runs and periodic patterns of certain amino acids, etc.) locate attractive breakpoints for interdomain swapping experiments and generally should promote appreciation of idiosyncratic attributes of individual members of protein families. Acknowledgments LITERATURE CITED ALTSCHUL,S. F., W. GISH, W. MILLER, E. W. MYERS, and D. J. LIPMAN. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403-410. APPLEBURY,M. L., and P. A. HARGRAVE. 1986. Molecular biology of the visual pigments. Vision Res. 26:188 1-1895. ARAI, H., S. HORI, I. ARAMORI,H. OHKUBO, and S. NAKANISHI. 1990. Cloning and expression of a cDNA encoding an endothelin receptor. Nature 348:730-732. BAER, R., A. T. BANKIER, M. D. BIGGIN, P. L. DEININGER, P. J. FARELL, T. J. GIBSON, G. HATFULL,G. S. HUDSON,S. C. SATCHWELL,C. S~GUIN,P. S. TUFFNELL,and B. G. BARRELL. 1984. DNA sequence and expression of the B95-8 Epstein-Barr virus genome. Nature 310~207-2 11. BAIROCH,A. 1990. PROSITE: a dictionary of protein sites and patterns, 5th release. University of Geneva, Geneva. BALLESTER,R. T., T. MICHAELI,K. FERGUSON, H.-P. Xv, F. MCCORMICK,and M. WIGLER. 1989. Genetic analysis of mammalian GAP expressed in yeast. Cell 59:68 l-686. BLAISDELL,B. E., and S. KARLIN. 1988. Distinctive charge configurations in proteins of the Epstein-Barr virus and possible functions. Proc. Natl. Acad. Sci. USA 85:6637-6641. BRENDEL,V., and S. KARLIN. .1989a. Association of charge clusters with functional domains of cellular transcription factors. Proc. Natl. Acad. Sci. USA 86:5698-5702. BRENDEL,V., and S. KARLIN. 19896. Too many leucine zippers? Nature 341:574-575. BRENDEL,V., J. DOHLMAN,B. E. BLAISDELL,and S. KARLIN. 1991. Very long charge runs in systemic lupus erythematosus-associated autoantigens. Proc. Natl. Acad. Sci. USA 88: 15361540. CHEE, M. S., A. T. BANKIER, S. BECK, R. BOHNI, C. M. BROWN, R. CERNY, T. HORSNELL, C. A. HUTCHISONIII, T. KOUZARIDES,J. A. MARTIGNETTI,E. PREDDIE,S. C. SATCHWELL, P. TOMLINSON,K. M. WESTON, and B. G. BARRELL. 1990a. Analysis of the protein-coding content of the sequence of human cytomegalovirus strain AD169. Curr. Top. Microbial. Immunol. 154~125-169. CHEE, M. S., and B. G. BARRELL. 1990. Herpesvimses: a study of parts. Trends Genet. 6:8691. CHEE, M. S., S. C. SATCHWELL,E. PREDDIE, K. M. WESTON, and B. G. BARRELL. 1990b. Human cytomegalovirus encodes three G protein-coupled receptor homologues. Nature 344:774-777. DAVISON,A. J., and J. E. SCOTT. 1986. The complete DNA sequence of varicella-zoster virus. J. Gen. Virol. 67:1759-1816. Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 We express our thanks to Drs. E. B. Blaisdell, A. Campbell, E. B. Mocarski, and L. Stryer for discussions and useful comments on the manuscript. 166 Karlin et al. Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 DAYHOFF, M. 0. 1978. Atlas of protein sequence and structure, vol. 5, suppl. 3. National Biomedical Research Foundation, Washington, D.C. DICKERSON,R. E., and I. GEIS. 1983. Hemoglobin: structure, function, evolution, and pathology. Benjamin/Cummings, Menlo Park, Calif. FELLER, W. 197 1. An introduction to probability theory and its applications. Vol. 2, 2d ed. Wiley, New York. GILLESPIE,J. H. 1986. Rates of molecular evolution. Annu. Rev. Ecol. Syst. 17~637-665. GOODMAN, M. 1981. Decoding the pattern of protein evolution. Prog. Biophys. Mol. Biol. 38:105-164. HUGHES, A. L., T. OTA, and M. NEI. 1990. Positive Darwinian selection promotes charge profile diversity in the antigen-binding cleft of class I major-histocompatibility-complex molecules. Mol. Biol. Evol. 7:5 15-524. JACKSON,T. R. 1990. Cell surface receptors of nucleosides, nucleotides, amino acids and amino neurotransmitters. Curr. Opinion Cell Biol. 2: 167-173. KALDERON,D., B. L. ROBERTS,W. D. RICHARDSON,and A. E. SMITH. 1984. A short amino acid sequence able to specify nuclear location. Cell 39:499-509. KARLIN, S. 1990. Distribution of clusters of charged amino acid in protein sequences. Pp. 17 l180 in R. H. SARMA and M. H. SARMA, eds. Proceedings of the Sixth Conversation in Biomolecular Stereodynamics: Structure & Methods, vol. 2: DNA protein complexes and proteins. Adenine, Albany, N.Y. KARLIN, S., and S. F. ALTSCHUL. 1990. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA 87~2264-2268. KARLIN, S., B. E. BLAISDELL,and V. BRENDEL. 1990~. Identification of significant sequence patterns in proteins. Methods Enzymol. 183:388-402. KARLIN, S., B. E. BLAISDELL,E. S. MOCARSKI,and V. BRENDEL. 1989. A method to identify distinctive charge configurations in protein sequences, with an application to human herpesvirus polypeptides.. J. Mol. Biol. 205: 165-177. KARLIN, S., and V. BRENDEL. 1988. Charge configurations in viral proteins. Proc. Natl. Acad. Sci. USA 85:9396-9400. KARLIN, S., V. BRENDEL, P. BUCHER, and S. F. ALTSCHUL. 1991. Statistical methods and insights for protein and DNA sequences. Annu. Rev. Biophys. Biophys. Chem. 20: 175-203. KARLIN, S., and G. GHANDOUR. 1985. Comparative statistics for DNA and protein sequences: single sequence analysis. Proc. Natl. Acad. Sci. USA 82:5800-5804. KARLIN, S., and C. MACKEN. 1991. Some statistical problems in the assessment of inhomogeneities of DNA sequence data. J. Am. Stat. Assoc. 86~27-35. KARLIN, S., M. MORRIS, G. GHANDOUR, and M.-Y. LEUNG. 1988. Algorithms for identifying local molecular sequence features. Comput. Appl. Biosci. 4:4 1-5 1. KARLIN, S., F. OST, and B. E. BLAISDELL.1990b. Patterns in DNA and amino acid sequences and their statistical significance. Pp. 133-l 57 in M. S. WATERMAN,ed. Mathematical methods for DNA sequences. CRC, Boca Raton, Fla. KIMURA, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, Mass. LANDSCHULZ,W. H., P. F. JOHNSON, and S. L. MCKNIGHT. 1988. The leucine zipper: a hypothetical structure common to a new class of DNA binding proteins. Science 240: 17591764. LESK, A. M., and C. CHOTHIA. 1980. How different amino acid sequences determine similar protein structures: the structure and evolutionary dynamics of the globins. J. Mol. Biol. 136:225-270. LEUNG, M. L., B. E. BLAISDELL,C. BURG, and S. KARLIN. An efficient algorithm for identifying matches with errors in multiple long sequences. J. Mol. Biol. (accepted). MA, J., and M. PTASHNE. 1987. Deletion analysis of GAL4 defines two transcriptional activating segments. Cell 48:847-853. Significant Similarity and Dissimilarity I67 MCGEOCH, D. J., M. A. DALRYMPLE, A. J. DAVISON, A. DOLAN, M. C. FRAME, D. MCNAB, L. J. PERRY, J. E. SCOTT, and P. TAYLOR. 1988. The complete nucleotide sequence of the long unique region of the genome of herpes simplex virus type 1. J. Gen. Virol. 69: 153 l- MASATOSHI NEI, reviewing editor Received February 25, 199 1; revision received May 16, 199 1 Accepted May 20, 199 1 View publication stats Downloaded from http://mbe.oxfordjournals.org/ by guest on June 2, 2013 1574. MILLER, R. G. 1980. Simultaneous statistical inference. Springer, New York. NIKAS, I., J. MCLAUCHLAN,A. J. DAVISON,W. R. TAYLOR,and J. B. CLEMENTS.1986. Structural features of ribonucleotide reductase. Protein Structure Funct. Genet. 1:376-384. PARSON,J. T. 1990. Closing the GAP in a signal transduction pathway. Trends Genet. 6: 169171. ROIZMAN, B. 1990. Herpesviridae: a brief introduction in Virologv. 2d ed. Pp. 1287-1293 in B. N. FIELDS and D. M. KNIPE, eds. Raven, New York. Ross, P. C., R. A. FIGLER, M. H. CORJAY, C. M. BARBER,N. ADAM, D. R. HARCUS, and K. R. LYNCH. 1990. RTA, a candidate G protein-coupled receptor: cloning, sequencing, and tissue distribution. Proc. Natl. Acad. Sci. USA 87:3052-3056. SAKURAI, T., M. YANAGISAWA,Y. TAKUWA, H. MIYAZAKI, S. KIMURA, K. GOTO, and T. MASAKI . 1990. Cloning of a cDNA encoding a non-isopeptide-selective subtype of the endothelin receptor. Nature 348:732-735. TANAKA, K., K. MATSUMOTO,and A. TOH-E. 1989. IRA 1, an inhibitory regulator of the RAScyclic AMP pathway in Saccharomyces cerevisiae. Mol. Cell. Biol. 9:757-768. VONHEIJNE, G. 1986. Net NC charge imbalance may be important for signal sequence function in bacteria. J. Mol. Biol. 192:287-290. WILSON, A. C., S. S. CARLSON,and T. J. WHITE. 1977. Biochemical evolution. Annu. Rev. Biochem. 46:573-639. Xu, G., P. O’CONNELL,D. VISKOCHIL,R. CAWTHON, M. ROBERTSON,M. CULVER,D. DUNN, J. STEVENS,R. GESTELAND,R. WHITE, and R. WEISS. 1990. The neurofibromatosis type 1 gene encodes a protein related to GAP. Cell 62599-608. ZUCKERKANDL,E., and L. PAULING. 1962. Molecular disease, evolution, and genie heterogeneity. Pp. 189-225 in M. KASHA and B. PULLMAN,eds. Horizons in biochemistry. Academic Press, New York.