Papers by William Pearson
Nucleic Acids Research, 2011
MACiE (which stands for Mechanism, Annotation and Classification in Enzymes) is a database of enz... more MACiE (which stands for Mechanism, Annotation and Classification in Enzymes) is a database of enzyme reaction mechanisms, and can be accessed from http://www.ebi.ac.uk/thornton-srv/databases/ MACiE/. This article presents the release of Version 3 of MACiE, which not only extends the dataset to 335 entries, covering 182 of the EC sub-subclasses with a crystal structure available (90%), but also incorporates greater chemical and structural detail. This version of MACiE represents a shift in emphasis for new entries, from nonhomologous representatives covering EC reaction space to enzymes with mechanisms of interest to our users and collaborators with a view to exploring the chemical diversity of life. We present new tools for exploring the data in MACiE and comparing entries as well as new analyses of the data and new searches, many of which can now be accessed via dedicated Perl scripts.
Nature, 2004
incubated with 20 ml Sepharose-immobilized monoclonal anti-HA antibodies (Covance). Beads were wa... more incubated with 20 ml Sepharose-immobilized monoclonal anti-HA antibodies (Covance). Beads were washed with buffer T without BSA before elution. DSP-induced crosslinks in eluted proteins were thiol-cleaved before separation by 8-16% SDS-PAGE and detection by Sypro Ruby (Bio-Rad). Quinone analyses Lipid extractions and quinone detection were performed as described 19. Hydrogenosomes (5.5 mg protein) were extracted and resuspended in 150 ml 9:1 methanol/ethanol, of which 50 ml was injected onto an HPLC system linked to an ECD. Sequence analyses Accession numbers for sequences used to reconstruct NuoF and NuoE phylogenies are listed in Supplementary Tables 2 and 3. NuoF sequences were aligned with CLUSTALX. NuoE sequences were aligned with Wisconsin Package Version 10.2 programs (Genetics Computer Group). A profile hidden Markov model (HMM) was built from Escherichia coli, Neurospora crassa, Bos taurus, Paracoccus denitrificans and Thermus thermophilus sequences with HMMBUILD. Additional sequences were aligned to the profile with HMMALIGN. Both alignments were edited to remove C-and N-terminal extensions. Analyses of NuoF and NuoE evolution were performed with MRBAYES 30 with the JTT amino-acid substitution model and with two Markov chains Monte Carlo. Chains were run for 100,000 generations, with sampling every 50 generations. The first 5,000 generations were discarded as burn-in. Consensus trees satisfying the more than 50 majority rule were drawn with Treeview, and probabilities of branch partitions were calculated.
Nature, 2004
A single wavelength anomalous diffraction experiment was carried out at the CuKa wavelength on cr... more A single wavelength anomalous diffraction experiment was carried out at the CuKa wavelength on crystals cryoprotected in mother liquor plus 25% 2-methyl-2,4pentanediol; clear diffraction was observed to at least 1.8 Å resolution. We indexed, integrated and scaled the data using D*TREK 25 , identified heavy atom sites with SOLVE 26 , and carried out refinement with CNS 27 to obtain a model with final R xtal and R free values of 17.8% and 22.8%, respectively. The model contains all RNA atoms except A82, for which electron density was not observed, along with the hypoxanthine ligand. For additional experimental details and references, see Supplementary Information.
Trends in pharmacological …, 1991
Three 'alpha 1-adrenoceptors' and three 'alpha 2-adrenoceptors' h... more Three 'alpha 1-adrenoceptors' and three 'alpha 2-adrenoceptors' have now been cloned. How closely do these receptors match the native receptors that have been identified pharmacologically? What are the properties of these receptors, and how do they relate to other members of the cationic amine receptor family? Kevin Lynch and his colleagues discuss these questions in this review.
Journal of Molecular Biology, 1999
The relationship between sequence similarity and structural similarity has been examined in 36 pr... more The relationship between sequence similarity and structural similarity has been examined in 36 protein families with ®ve or more diverse members whose structures are known. The structural similarity within a family (as determined with the DALI structure comparison program) is linearly related to sequence similarity (as determined by a Smith-Waterman search of the protein sequences in the structure database). The correlation between structural similarity and sequence similarity is very high; 18 of the 36 families had linear correlation coef®cients r 5 0.878, and only nine had correlation coef®cients r 4 0.815. Inclusion of higher-order terms in the structure/sequence relationship improved the ®t by less than 7 % in 27 of the 36 families. Differences in sequence/structure correlations are distributed evenly among the four protein structural classes, a, b, a/b, and a b. While most protein families show high correlations between sequence similarity and structural similarity, the amount of structural change per sequence change, i.e. the structural mutation sensitivity, varies almost fourfold. Protein families with high and low structural mutation sensitivity are distributed evenly among protein structure classes. In addition, we did not detect strong correlations between structural mutation sensitivity and either protein family mutation rates or protein size. Our results are more consistent with models of protein structure that encode a protein family's fold throughout the protein sequence, and not just in a few critical residues.
Genome Biology, 2015
Background: Protein domains are commonly used to assess the functional roles and evolutionary rel... more Background: Protein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2). Results: We characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts. Conclusions: Partial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein's gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins. Background The discovery of evolutionarily mobile protein domains in the early 1980s, shortly after the recognition of eukaryotic splicing, revolutionized our understanding of protein structure. Before the discovery of the exon-shuffled domains in the EGF receptor [1,2], most proteins (globins, cytochrome c, serine proteases, etc.) were understood to be globally similar single-domain proteins. While proteins like calmodulin were known to contain repeated domains, the structural implications of modular proteins were not fully appreciated until clearly homologous domains were seen in different sequence contexts. Today, domains are central to our understanding of the structure, evolution, and functional roles of proteins and protein families. Protein domain assignments using Pfam [3], InterPro [4], and other domain annotation resources are widely used to infer protein evolutionary relationships,
Nucleic Acids Research, 2003
The CRP (Cleavage of Radiolabeled Phosphoproteins) program guides the design and interpretation o... more The CRP (Cleavage of Radiolabeled Phosphoproteins) program guides the design and interpretation of experiments to identify protein phosphorylation sites by Edman sequencing of unseparated peptides. Traditionally, phosphorylation sites are determined by cleaving the phosphoprotein and separating the peptides for Edman 32 P-phosphate release sequencing. CRP analysis of a phosphoprotein's sequence accelerates this process by omitting the separation step: given a protein sequence of interest, the CRP program performs an in silico proteolytic cleavage of the sequence and reports the predicted Edman cycles in which radioactivity would be observed if a given serine, threonine or tyrosine were phosphorylated. Experimentally observed cycles containing 32 P can be compared with CRP predictions to confirm candidate sites and/or explore the ability of additional cleavage experiments to resolve remaining ambiguities. To reduce ambiguity, the phosphorylated residue (P-Tyr, P-Ser or P-Thr) can be determined experimentally, and CRP will ignore sites with alternative residues. CRP also provides simple predictions of likely phosphorylation sites using known kinase recognition motifs. The CRP interface is available at http://fasta.bioch.virginia.edu/crp.
Nucleic Acids Research, 1977
A computer program is described for the rapid calculation of least squares solutions for data fit... more A computer program is described for the rapid calculation of least squares solutions for data fitted to different functions normally used in reassociation and hybridization kinetic measurements. The equations for the fraction not reacted as a function of Cot follow: First order, exp(-kCot); second order, (1+kCot)Y1; variable order, (1+kCot)-n; apRoximate fraction of DNA sequence remaining single stranded, (1+kCot)F ; and a function describing the pairing of tracer when the rate constant for the tracer (k) is distinct frym the driver rate constant (kd): exptkL1-(1+kdCot)-nJ/Lkd(1-n)J}. Several components may be used for most of these functional forms. The standard deviations of the individual parameters at the solutions are calculated.
Chromosoma, 1976
Abstract. A sensitive search has been made in Drosophila melanogaster DNA for short repetitive se... more Abstract. A sensitive search has been made in Drosophila melanogaster DNA for short repetitive sequences interspersed with single copy sequences. Five kinds of measurements all yield the conclusion that there are few short repetitive sequences in this genome: 1) Comparison of ...
Genome Research, 1999
We have developed a rapid visual method for identifying novel members of gene families. Starting ... more We have developed a rapid visual method for identifying novel members of gene families. Starting with an evolutionary tree, 20-50 protein query sequences for a gene family are selected from different branches of the tree. These query sequences are used to search the GenBank and expressed sequence tag (EST) DNA databases and their nightly updates using the tfastx3 or tfasty3 programs. The results of all 20-50 searches are collated and resorted to highlight EST or genomic sequences that share significant similarity with the query sequences. The statistical significance of each DNA/protein alignment is plotted, highlighting the portion of the query sequence that is present in the database sequence and the percent identity in the aligned region. The collated results for database sequences are linked using the WWW to the underlying scores and alignments; these links can also be used to perform additional searches to characterize the novel sequence further. With traditional "deep&quo...
Current Protocols in Bioinformatics, 2002
The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fast... more The FASTA programs provide a comprehensive set of rapid similarity searching tools (fasta36, fastx36, tfastx36, fasty36, tfasty36), similar to those provided by the BLAST package, as well as programs for slower, optimal, local and global similarity searches (ssearch36, ggsearch36) and for searching with short peptides and oligonucleotides (fasts36, fastm36). The FASTA programs use an empirical strategy for estimating statistical significance that accommodates a range of similarity scoring matrices and gap penalties, improving alignment boundary accuracy and search sensitivity (Unit 3.5). The FASTA programs can produce "BLAST-like" alignment and tabular output, for ease of integration into existing analysis pipelines, and can search small, representative databases, and then report results for a larger set of sequences, using links from the smaller dataset. The FASTA programs work with a wide variety of database formats, including mySQL and postgreSQL databases (Unit 9.4). The programs also provide a strategy for integrating domain and active site annotations into alignments and highlighting the mutational state of functionally critical residues. These protocols describe how to use the FASTA programs to characterize protein and DNA sequences, using protein:protein, protein:DNA, and DNA:DNA comparisons.
Background: While the pairwise alignments produced by sequence similarity searches are a powerful... more Background: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. Results: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alig...
Molecular Biology and Evolution
Accurate determination of the evolutionary relationships between genes is a foundational challeng... more Accurate determination of the evolutionary relationships between genes is a foundational challenge in biology. Homology—evolutionary relatedness—is in many cases readily determined based on sequence similarity analysis. By contrast, whether or not two genes directly descended from a common ancestor by a speciation event (orthologs) or duplication event (paralogs) is more challenging, yet provides critical information on the history of a gene. Since 2009, this task has been the focus of the Quest for Orthologs (QFO) Consortium. The sixth QFO meeting took place in Okazaki, Japan in conjunction with the 67th National Institute for Basic Biology conference. Here, we report recent advances, applications, and oncoming challenges that were discussed during the conference. Steady progress has been made toward standardization and scalability of new and existing tools. A feature of the conference was the presentation of a panel of accessible tools for phylogenetic profiling and several develo...
Bioinformatics is becoming increasingly central to research in the life sciences. However, despit... more Bioinformatics is becoming increasingly central to research in the life sciences. However, despite its importance, bioinformatics skills and knowledge are not well integrated in undergraduate biology education. This curricular gap prevents biology students from harnessing the full potential of their education, limiting their career opportunities and slowing genomic research innovation. To advance the integration of bioinformatics into life sciences education, a framework of core bioinformatics competencies is needed. To that end, we here report the results of a survey of life sciences faculty in the United States about teaching bioinformatics to undergraduate life scientists. Responses were received from 1,260 faculty representing institutions in all fifty states with a combined capacity to educate hundreds of thousands of students every year. Results indicate strong, widespread agreement that bioinformatics knowledge and skills are critical for undergraduate life scientists, as wel...
Nucleic acids research, Jan 6, 2016
Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sens... more Iterative similarity search programs, like psiblast, jackhmmer, and psisearch, are much more sensitive than pairwise similarity search methods like blast and ssearch because they build a position specific scoring model (a PSSM or HMM) that captures the pattern of sequence conservation characteristic to a protein family. But models are subject to contamination; once an unrelated sequence has been added to the model, homologs of the unrelated sequence will also produce high scores, and the model can diverge from the original protein family. Examination of alignment errors during psiblast PSSM contamination suggested a simple strategy for dramatically reducing PSSM contamination. psiblast PSSMs are built from the query-based multiple sequence alignment (MSA) implied by the pairwise alignments between the query model (PSSM, HMM) and the subject sequences in the library. When the original query sequence residues are inserted into gapped positions in the aligned subject sequence, the resu...
Uploads
Papers by William Pearson