Protein design

This article refers to rational protein design. For the broader engineering of proteins see protein engineering.

Protein design is the rational design of new protein molecules to fold to a target protein structure, with the ultimate goal of designing novel function and/or behavior. Proteins can be designed from scratch (de novo design) or by making calculated variations on a known protein structure and its sequence (known as protein redesign). Rational protein design approaches make protein-sequence predictions that will fold to specific structures. These predicted sequences can then be validated experimentally through methods such as peptide synthesis, site-directed mutagenesis, or artificial gene synthesis.

Rational protein design dates back to the mid-1970s, although initial protein design approaches were based mostly on sequence composition and did not account for specific interactions between side-chains at the atomic level.^[1] Recently, however, improvements in molecular force fields, protein design algorithms, and structural bioinformatics, such as libraries of amino acid conformations, have enabled the development of advanced computational protein design tools. These computational tools can make complex calculations on protein energetics and flexibility, and perform searches over enormous configuration spaces, which would be unfeasible to perform manually. Thanks to the development of computational protein design programs and important successes in the field (e.g., see examples below), rational protein design has become one of the most important tools in protein engineering.

Overview and history

The goal in rational protein design is to predict amino acid sequences that will fold to a specific protein structure. Although the number of possible protein sequences is enormous, growing exponentially with the size of the protein chain, only a subset of them will fold reliably and quickly to a single native state. Protein design involves identifying novel sequences within this subset. The native state of a protein is the conformational free energy minimum for the chain. Therefore, protein design is the search for sequences that have the chosen structure as a free energy minimum. In a sense, it is the reverse of structure prediction: In design, a tertiary structure is specified, and a sequence that will fold to it is identified. Hence, it is also referred to as inverse folding. Protein design is then an optimization problem: using some scoring criteria, an optimized sequence that will fold to the desired structure is chosen.

When the first proteins were rationally designed during the 1970s and 1980s, the sequence for these was optimized manually based on analyses of other known proteins, the sequence composition, amino acid charges, and the geometry of the desired structure.^[1] The first designed proteins are attributed to Bernd Gutte, who designed a reduced version of a known catalyst, bovine ribonuclease, as well as tertiary structures consisting of beta-sheets and alpha-helices, including a binder of DDT. Urry and colleagues later designed elastin-like fibrous peptides based on rules on sequence composition. Richardson and co-workers designed a 79-residue protein with no sequence homology to any known protein.^[1] In the 1990s, the advent of powerful computers, libraries of amino acid conformations, and force fields developed primarily for molecular dynamics simulations enabled the development of structure-based computational protein design tools. Following the development of these computational tools, enormous success has been achieved over the last 30 years in protein design. The first protein successfully designed completely de novo was done by Stephen Mayo and co-workers in 1997,^[2] and, shortly after, in 1999 Peter S. Kim and co-workers designed dimers, trimers, and tetramers of unnatural right-handed coiled coils.^[3]^[4] In 2003, David Baker's laboratory designed a full protein to a fold never seen before in nature.^[5] Later, in 2008, Baker's group computationally designed enzymes for two different reactions.^[6] In 2010, one of the most powerful broadly neutralizing antibodies was isolated from patient serum using a computationally designed protein probe.^[7] Thanks to these and other successes (e.g., see examples below), protein design has become one of the most important tools available for protein engineering. There is great hope that the design of new proteins, small and large, will have applications in medicine and bioengineering.

Underlying models of protein structure and function

Protein design programs use computer models of the molecular forces that drive proteins in in vivo environments. In order to make the problem tractable, these forces are simplified by protein design models. Although protein design programs vary greatly, they have to address four main modeling questions: What is the target structure of the design, what flexibility is allowed on the target structure, which sequences are included in the search, and which force field will be used to score sequences and structures.

Target structure

The Top7 protein was one of the first proteins designed for a fold that had never been seen before in nature ^[5]

Protein function is heavily dependent on protein structure, and rational protein design uses this relationship to design function by designing proteins that have a target structure or fold. Thus, by definition, in rational protein design the target structure or ensemble of structures must be known beforehand. This contrasts with other forms of protein engineering, such as directed evolution, where a variety of methods are used to find proteins that achieve a specific function, and with protein structure prediction where the sequence is known, but the structure is unknown.

Most frequently, the target structure is based on a known structure of another protein. However, novel folds not seen in nature have been made increasingly possible. Peter S. Kim and co-workers designed trimers and tetramers of unnatural coiled coils, which had not been seen before in nature.^[3]^[4] The protein Top7, developed in David Baker's lab, was designed completely using protein design algorithms, to a completely novel fold.^[5] More recently, Baker and co-workers developed a series of principles to design ideal globular-protein structures based on protein folding funnels that bridge between secondary structure prediction and tertiary structures. These principles, which build on both protein structure prediction and protein design, were used to design five different novel protein topologies.^[8]

Sequence space

FSD-1 (shown in blue, PDB id: 1FSV) was the first de novo computational design of a full protein.^[2] The target fold was that of the zinc finger in residues 33-60 of the structure of protein Zif268 (shown in red, PDB id: 1ZAA). The designed sequence had very little sequence identity with any known protein sequence.

In rational protein design proteins can be redesigned from the sequence and structure of a known protein, or completely from scratch in de novo protein design. In protein redesign, most of the residues in the sequence are maintained as their wild-type amino-acid while a few are allowed to mutate. In de novo design the entire sequence is designed anew, based on no previous sequence.

Both de novo designs and protein redesigns can establish rules on the sequence space: the specific amino acids that are allowed at each mutable residue position. For example, the composition of the surface of the RSC3 probe to select HIV-broadly neutralizing antibodies was restricted based on evolutionary data and charge balancing. In fact, many of the earliest attempts on protein design were heavily based on empirical "rules" on the sequence space.^[1] Moreover, the design of fibrous proteins, usually follows strict rules on the sequence space. Collagen-based designed proteins, for example, are often composed of Gly-Pro-X repeating patterns.^[1] With the advent of computational techniques, however, the design of proteins with no human intervention in sequence selection has become possible.^[2]

Structural flexibility

Common protein design programs use rotamer libraries to simplify the conformational space of protein side chains. This animation loops through all the rotamers of the isoleucine amino acid based on the Penultimate Rotamer Library.^[9]

In protein design, the target structure (or structures) of the protein are known. However, a rational protein design approach must model some flexibility on the target structure in order to increase the number of sequences that can be designed for that structure and to minimize the chance of a sequence folding to a different structure. For example, in a protein redesign of a single, small amino acid (such as alanine) in the tightly packed core of a protein, very few mutants would be predicted by a rational design approach to fold to the target structure, if the surrounding side-chains are not allowed to be repacked.

Thus, an essential parameter of any design process is the amount of flexibility that is allowed for both the side-chains and the backbone. In the simplest models, the protein backbone is kept rigid while some of the protein side-chains are allowed to change conformations. Protein side-chains, however, can have many degrees of freedom in their bond lengths, bond angles, and χ dihedral angles. To simplify this space, protein design methods use rotamer libraries that assume ideal values for bond lengths and bond angles, while restricting χ dihedral angles to a few frequently observed low-energy conformations known as rotamers.

Rotamer libraries describe rotamers based on an analysis of many protein structures. Backbone-independent rotamer libraries describe all rotamers.^[9] Backbone-dependent rotamer libraries, in contrast, describe the rotamers as how likely they are to appear depending on the protein backbone arrangement around the side chain.^[10] The rotamers described by rotamer libraries are usually regions in space. Most protein design programs use a single conformation (e.g., the modal value for rotamer dihedrals in space) or several points in the region described by the rotamer; the OSPREY protein design program, in contrast, models the entire continuous region.^[11]

Although rational protein design must preserve the general backbone fold a protein, allowing some backbone flexibility can significantly increase the number of sequences that fold to the structure while maintaining the general fold of the protein.^[12] Backbone flexibility is especially important in protein redesign because sequence mutations often result in small changes to the backbone structure. Moreover, backbone flexibility can be essential for more advanced applications of protein design, such as binding prediction and enzyme design. Some models of protein design backbone flexibility include small and continuous global backbone movements, discrete backbone samples around the target fold, backrub motions, and protein loop flexibility.^[12]^[13]

Energy function

Comparison of various potential energy functions. The most accurate energy are those that use quantum mechanical calculations, but these are too slow for protein design. On the other extreme, heuristic energy functions, are based on statistical terms and are very fast. In the middle are molecular mechanics energy functions that are physically-based but are not as computationally expensive as quantum mechanical simulations.^[14]

Rational protein design techniques must be able to discriminate sequences that will be stable under the target fold from those that would prefer other low-energy competing states. Thus, protein design requires accurate energy functions that can rank and score sequences by how well they fold to the target structure. At the same time, however, these energy functions must consider the computational challenges behind protein design. One of the most challenging requirements for successful design is an energy function that is both accurate and simple for computational calculations.

The most accurate energy functions are those based on quantum mechanical simulations. These simulations, however, are too slow and typically impractical for protein design. Many protein design algorithms, instead, use either physics-based energy functions adapted from molecular mechanics simulation programs, knowledge based energy-functions, or a hybrid combination of both. However, the trend has been toward using more physics-based potential energy functions.^[14]

Physics-based energy functions, such as AMBER and CHARMM, are typically derived from quantum mechanical simulations, as well as thermodynamic, crystallographic, and spectroscopic data from experimental data.^[15] These energy functions typically simplify physical energy function and make them pairwise decomposable, meaning that the total energy of a protein conformation can be calculated by adding the pairwise energy between each atom pair, which makes them attractive for optimization algorithms. Physics-based energy functions typically model an attractive-repulsive Lennard-Jones term between atoms and a pairwise electrostatics coulombic term^[16] between non-bonded atoms.

Water-mediated hydrogen bonds play a key role in protein-protein binding. One such interaction is shown between residues D457, S365 in the heavy chain of the HIV-broadly-neutralizing antibody VRC01 (green) and residues N58 and Y59 in the HIV envelope protein GP120 (purple).^[17]

Statistical potentials, in contrast to physics-based potentials, have the advantage of being fast to compute, of accounting implicitly of complex effects and being less sensitive to small changes in the protein structure.^[18] These energy functions are based on deriving energy values from frequency of appearance on a structural database.

Protein design, however, has requirements that can sometimes be limited in molecular mechanics force-fields. Molecular mechanics force-fields, which have been used mostly in molecular dynamics simulations, are optimized for the simulation of single sequences, but protein design searches through many conformations of many sequences. Thus, molecular mechanics force-fields must be tailored for protein design. In practice, protein design energy functions frequently incorporate both statistical terms and physics-based terms. For example, the Rosetta energy function, one of the most-used energy functions, incorporates physics-based energy terms originating in the CHARMM energy function, as well as statistical energy terms, such as rotamer probability and knowledge-based electrostatics. Typically, energy functions are highly customized between laboratories, and specifically tailored for every design.^[15]

Challenges for effective protein design energy functions

Water makes up a majority of the molecules surrounding proteins and is the main driver of protein structure. Thus, modeling of the interaction between water and protein is a necessity in protein design. The number of water molecules that interact with a protein at any given time, however, is enormous and each one has a large number of degrees of freedom and interaction partners. Protein design programs, instead, model the majority of water molecules as a continuum, modeling both the hydrophobic effect and solvation polarization.^[15]

Individual water molecules, however, can sometimes have a crucial structural role in the core of proteins, as well as in protein-protein or protein-ligand interactions. Failing to model such waters can result in mispredictions of the optimal sequence of a protein-protein interface. As an alternative, water molecules can be added to rotamers.^[15]

Protein design as an optimization problem

This animation illustrates the complexity of a protein design search. A typical protein design search compares all the rotamer-conformations from all possible mutations at all residues. In this example, the residues Phe36 and His 106 are allowed to mutate to, respectively, the amino acids Tyr and Asn. Phe and Tyr have 4 rotamers each in the rotamer library, while Asn and His have 7 and 8 rotamers, respectively, in the rotamer library (from the Richardson's penultimate rotamer library^[9]) . The animation loops through all (4 + 4) x (7 + 8) = 120 possibilities. The structure shown is that of myoglobin, PDB id: 1mbn.

The goal of protein design is to find a protein sequence that will fold to a target structure. A protein design algorithm must, therefore, search all the conformations of each sequence, with respect to the target fold, and rank sequences according to the lowest-energy conformation of each one, as determined by the protein design energy function. Thus, a typical input to the protein design algorithm is the target fold, the sequence space, the structural flexibility, and the energy function, while the output is one or more sequences that are predicted to fold stably to the target structure.

The number of candidate protein sequences, however, grows exponentially with the number of protein residues; for example, there are 20¹⁰⁰ protein sequences of length 100. Furthermore, even if amino acid side-chain conformations are limited to a few rotamers (see Structural flexibility), this results in an exponential number of conformations for each sequence. Thus, in our 100 residue protein, and assuming that each amino acid has exactly 10 rotamers, a search algorithm that searches this space will have to search over 200¹⁰⁰ protein conformations.

The most common energy functions can be decomposed into pairwise terms between rotamers and amino acid types, which casts the problem as a combinatorial one, and powerful optimization algorithms can be used to solve it. In those cases, the total energy of each conformation belonging to each sequence can be formulated as a sum of individual and pairwise terms between residue positions. If a designer is interested only in the best sequence, the protein design algorithm only requires the lowest-energy conformation of the lowest-energy sequence. In these cases, the amino acid identity of each rotamer can be ignored and all rotamers belonging to different amino acids can be treated the same. Let r_i be a rotamer at residue position i in the protein chain, and E(r_i) the potential energy between the internal atoms of the rotamer. Let E(r_i, r_j) be the potential energy between r_i and rotamer r_j at residue position j. Then, we define the optimization problem as one of finding the conformation of minimum energy (E_T):

$\min E_{T} = \sum_{i}\Big[ E_i(r_i) + \sum_{i\ne j} E_{ij}(r_i, r_j)\Big] \,$

(1)

The problem of minimizing E_T is an NP-hard problem.^[13]^[19]^[20] Even though the class of problems is NP-hard, in practice many instances of protein design can be solved exactly or optimized satisfactorily through heuristic methods.

Algorithms

Several algorithms have been developed specifically for the protein design problem. These algorithms can be divided into two broad classes: exact algorithms, such as Dead-end elimination, that do not have runtime guarantees but guarantee the quality of the solution; and heuristic algorithms, such as Monte Carlo, that are faster than exact algorithms but have no guarantees on the optimality of the results. Exact algorithms guarantee that the optimization process produced the optimal according to the protein design model. Thus, if the predictions of exact algorithms fail when these are experimentally validated, then the source of error can be attributed to the energy function, the allowed flexibility, the sequence space or the target structure (e.g., if it cannot be designed for) ^[21]

Some protein design algorithms are listed below. Although these algorithms address only the most basic formulation of the protein design problem, Equation (1), when the optimization goal changes because designers introduce improvements and extensions to the protein design model, such as improvements to the structural flexibility allowed (e.g., protein backbone flexibility) or the incorporation of sophisticated energy terms, many of the extensions on protein design that improve modeling are built on top of these algorithms. For example, Rosetta Design incorporates sophisticated energy terms, as well as backbone flexibility using Monte Carlo as the underlying optimization algorithm. OSPREY's algorithms build upon the Dead-End Elimination algorithm and A* to incorporate continuous backbone and side-chain movements. Therefore, these algorithms provide a good perspective on the different kinds of algorithms available for protein design.

Algorithms with mathematical guarantees

Dead-end elimination (DEE)

Main article: Dead-end elimination

The dead-end elimination algorithm reduces the search space of the problem iteratively by removing rotamers that that can be provably shown to not be part of the global lowest energy conformation (GMEC). On each iteration, the dead-end elimination algorithm compares all possible pairs of rotamers at each residue position, and removes each rotamer r′_i that can be shown to always be of higher energy than another rotamer r_i and is therefore not part of the GMEC:

$E(r^\prime_i) + \sum_{j\ne i} \min_{r_j} E(r^\prime_i,r_j) > E(r_i) + \sum_{j\ne i} \max_{r_j} E(r_i,r_j)$

Other powerful extensions to the Dead-End Elimination algorithm include the Pairs elimination criterion, and the Generalized dead-end elimination criterion. This algorithm has also been extended to handle continuous rotamers with provable guarantees.

Although the Dead-end elimination algorithm runs in polynomial time on each iteration, it cannot guarantee convergence. If, after a certain number of iterations, the dead-end elimination algorithm does not prune any more rotamers, then either rotamers have to be merged or another search algorithm must be used to search the remaining search space. In such cases, the dead-end elimination acts as a pre-filtering algorithm to reduce the search space, while other algorithms, such as A*, Monte Carlo, Linear Programming, or FASTER are used to search the remaining search space.^[13]

Branch-and-bound algorithms

The protein design conformational space can be represented as a tree, where the protein residues are ordered in an arbitrary way, and the tree branches at each of the rotamers in a residue. Branch and bound algorithms use this representation to efficiently explore the conformation tree: At each branching, branch and bound algorithms bound the conformation space and explore only the promising branches.^[13]^[22]^[23]

A popular search algorithm for protein design is the A* search algorithm.^[13]^[23] A* computes a lower-bound score on each partial tree path that lower bounds (with guarantees) the energy of each of the expanded rotamers. Each partial conformation is added to a priority queue and at each iteration the partial path with the lowest lower bound is popped from the queue and expanded. The algorithm stops once a full conformation has been enumerated and guarantees that the conformation is the optimal.

The A* score f in protein design consists of two parts, f=g+h. g is the exact energy of the rotamers that have already been assigned in the partial conformation. h is a lower bound on the energy of the rotamers that have not yet been assigned. Each is designed as follows, where d is the index of the last assigned residue in the partial conformation.

$g=\sum_{i=1}^d (E(r_i ) + \sum_{j=i+1}^d E(r_i,r_j) )$

$h = \sum_{j=d+1}^n [\min_{r_j}(E(r_j) + \sum_{i=1}^d E(r_i,r_j) + \sum_{k=j+1}^n \min_{r_k} E(r_j,r_k))]$

Optimization as an integer linear program

The problem of optimizing E_T (Equation (1)) can be easily formulated as an Integer programming (ILP).^[24] One of the most powerful formulations uses binary variables to represent the presence of a rotamer and edges in the final solution, and constraints the solution to have exactly one rotamer for each residue and one pairwise interaction for each pair of residues:

$\ \min \sum_{i}\sum_{r_i} E_i(r_i)q_{i}(r_i) + \sum_{j\ne i}\sum_{r_j} E_{ij}(r_i, r_j)q_{ij}(r_i, r_j) \,$

s.t.

$\sum_{r_i} q_{i}(r_i) = 1, \ \forall i$

$\sum_{r_j} q_{ij}(r_i,r_j) = q_{i}(r_i), \forall i, r_i, j$

$q_i, q_{ij} \in \{0,1\}$

ILP solvers, such as CPLEX, can compute the exact optimal solution for large instances of protein design problems. These solvers use a linear programming relaxation of the problem, where q_i and q_ij are allowed to take continuous values, in combination with a branch and cut algorithm to search only a small portion of the conformation space for the optimal solution. ILP solvers have been shown to solve many instances of the side-chain placement problem.^[24]

Message-passing based approximations to the linear programming dual

ILP solvers depend on linear programming (LP) algorithms, such as the Simplex or barrier-based methods to perform the LP relaxation at each branch. These LP algorithms were developed as general-purpose optimization methods and are not optimized for the protein design problem (Equation (1)). In consequence, the LP relaxation becomes the bottleneck of ILP solvers when the problem size is large.^[25] Recently, several alternatives based on message-passing algorithms have been designed specifically for the optimization of the LP relaxation of the protein design problem. These algorithms can approximate both the dual or the primal instances of the integer programming, but in order to maintain guarantees on optimality, they are most useful when used to approximate the dual of the protein design problem, because approximating the dual guarantees that no solutions are missed. Message-passing based approximations include the tree reweighted max-product message passing algorithm,^[26]^[27] and the message passing linear programming algorithm.^[28]

Optimization algorithms without guarantees

Monte Carlo and simulated annealing

Monte Carlo is one of the most widely used algorithms for protein design. In its simplest form, a Monte Carlo algorithm selects a residue at random, and in that residue a randomly chosen rotamer (of any amino acid) is evaluated.^[20] The new energy of the protein, E_new is compared against the old energy E_old and the new rotamer is accepted with a probability of:

$p=e^{-\beta(E_{\text{new}}-E_{\text{old}}))},$

where β is the Boltzmann constant and the temperature T can be chosen such that in the initial rounds it is high and it is slowly annealed to overcome local minima.^[11]

FASTER

The FASTER algorithm uses a combination of deterministic and stochastic criteria to optimize amino acid sequences. FASTER first uses DEE to eliminate rotamers that are not part of the optimal solution. Then, a series of iterative steps optimize the rotamer assignment.^[29]^[30]

Belief propagation for protein design

In belief propagation for protein design, the algorithm exchanges messages that describe the belief that each residue has about the probability of each rotamer in neighboring residues. The algorithm updates messages on every iteration and iterates until convergence or until a fixed number of iterations. Convergence is not guaranteed in protein design. The message m_{i→ j}(r_j that a residue i sends to every rotamer (r_j at neighboring residue j is defined as:

$m_{i\to j}(r_j) = \max_{r_i} \Big(e^{\frac{-E_i(r_i)-E_{ij}(r_i,r_j)}{T}}\Big) \prod_{k \in N(i)\backslash j} m_{k\to i (r_i)}$

Both max-product and sum-product belief propagation have been used to optimize protein design.

Applications and examples of designed proteins

Enzyme design

The design of new enzymes is an application of protein design with enormous bioengineering and biomedical applications. In general, designing a protein structure can be different from designing an enzyme, because the design of enzymes must consider many states involved in the catalytic mechanism. Protein design, however, is a prerequisite of de novo enzyme design because, at the very least, the design of catalysts requires a scaffold in which the catalytic mechanism can be inserted.^[31]

Enormous progress in de novo enzyme design and enzyme redesign was made during the first decade of the 21st century. In three major studies, David Baker and co-workers de novo designed enzymes for the retro-Aldol reaction,^[32] a Kemp-elimination reaction,^[33] and for the Diels-Alder reaction.^[34] Furthermore, Stephen Mayo and co-workers developed an iterative method to design the most efficient known enzyme for the Kemp-elimination reaction.^[35] In addition, in the laboratory of Bruce Donald, computational protein design was used to switch the specificity of one of the protein domains of the nonribosomal peptide synthetase that produces Gramicidin S, from its natural substrate phenylalanine to other noncognate substrates including charged amino acids; the redesigned enzymes had activities close to those of the wild-type.^[36]

Design for affinity

Protein–protein interactions are involved in most biological processes. Many of the most difficult-to-treat diseases, such as Alzheimer's, many forms of cancer and HIV infection involve protein-protein interactions. To treat these diseases, it is, therefore, desirable to design protein or protein-like therapeutics that bind one of the partners of the interaction and, therefore, disrupt the disease-causing interaction. This requires the design of protein-therapeutics for affinity toward its partner.

Protein-protein interactions can be designed using protein design algorithms because the principles that rule protein stability also rule protein-protein binding. Protein-protein interaction design, however, presents challenges not commonly present in protein design. One of the most important challenges is that, in general, the interfaces between proteins are more polar than protein cores, and binding involves a tradeoff between desolvation and hydrogen bond formation.^[37] To overcome this challenge, Bruce Tidor and co-workers developed a method to improve the affinity of antibodies by focusing on electrostatic contributions. They found that, for the antibodies designed in the study, reducing the desolvation costs of the residues in the interface increased the affinity of the binding pair.^[37]^[38]^[39]

The design of protein-protein interfaces for affinity has become a productive area of protein design research. Brian Kuhlman and co-workers redesigned a native monomeric protein into a homodimeric protein by designing beta-sheets between the two proteins.^[40]^[41] In 2010 Chris Floudas and co-workers computationally designed peptide inhibitors of HIV entry and successfully validated their inhibitory capacity in cell cultures.^[42] The Donald laboratory designed peptide inhibitors of a protein-protein interaction involved in cystic fibrosis, with potential therapeutic applications; these inhibitors formed a beta-sheet with the pdz-binding domain of the inhibited protein.^[43] Recently, Amit Jaiswal and others have developed 30 designer peptides based on the affinity of amino-acids for each other, in order to inhibit telomerase recruitment towards telomeres.^[44]

Scoring binding predictions

Protein design energy functions must be adapted to score binding predictions because binding involves a trade-off between the lowest-energy conformations of the free proteins (E_P and E_L) and the lowest-energy conformation of the bound complex (E_PL):

$\Delta_G = E_{PL} - E_P - E_L$ .

The K* algorithm approximates the binding constant of the algorithm by including conformational entropy into the free energy calculation. The K* algorithm considers only the lowest-energy conformations of the free and bound complexes (denoted by the sets P, L, and PL) to approximate the partition functions of each complex:^[13]

$K^* = \frac{\sum\limits_{x\in PL} e^{-E(x)/RT}}{\sum\limits_{x\in P} e^{-E(x)/RT}\sum\limits_{x\in L} e^{-E(x)/RT}}$

Design for specificity

The design of protein-protein interactions must be highly specific because proteins can interact with a large number of proteins; successful design requires selective binders. Thus, protein design algorithms must be able to distinguish between on-target (or positive design) and off-target binding (or negative design).^[1]^[37] One of the most prominent examples of design for specificity is the design of specific bZIP-binding peptides by Amy Keating and co-workers for 19 out of the 20 bZIP families; 8 of these peptides were specific for their intended partner over competing peptides.^[37]^[45]^[46] In addition, positive and negative design was also used by Anderson and co-workers to predict mutations in the active site of a drug target that conferred resistance to a new drug; positive design was used to maintain wild-type activity, while negative design was used to disrupt binding of the drug.^[47] Recent computational redesign by Costas Maranas and co-workers was also capable of experimentally switching the cofactor specificity of Candida boidinii xylose reductase from NADPH to NADH.^[48]

Protein resurfacing

Protein resurfacing consists of designing a protein's surface while preserving the overall fold, core, and boundary regions of the protein intact. Protein resurfacing is especially useful to alter the binding of a protein to other proteins. One of the most important applications of protein resurfacing was the design of the RSC3 probe to select broadly neutralizing HIV antibodies at the NIH Vaccine Research Center. First, residues outside of the binding interface between the gp120 HIV envelope protein and the previously discovered b12-antibody were selected to be designed. Then, the sequence spaced was selected based on evolutionary information, solubility, similarity with the wild-type, and other considerations. Then the RosettaDesign software was used to find optimal sequences in the selected sequence space. RSC3 was later used to discover the broadly neutralizing antibody VRC01 in the serum of a long-term HIV-infected non-progressor individual.^[49]

Design of globular proteins

Globular proteins are proteins that contain a hydrophobic core and a hydrophilic surface. Globular proteins frequently assume a stable structure, unlike fibrous proteins, which have multiple conformations. The three-dimensional structure of globular proteins is typically easier to determine through X-ray crystallography and nuclear magnetic resonance than both fibrous proteins and membrane proteins, which makes globular proteins more attractive for protein design than the other types of proteins. In fact, most successful protein designs have involved globular proteins. Both RSD-1, and Top7 were de novo designs of globular proteins. Five more protein structures were designed, synthesized, and verified in 2012 by the Baker group. These new proteins serve no biological function, but the structures are intended to act as building-blocks that can be expanded to incorporate functional active sites. The structures were found computationally by using new heuristics based on analyzing the connecting loops between parts of the sequence that specify secondary structures.^[50]

Design of transmembrane proteins

Membrane proteins are inherently hard to design, in large part because it is hard to validate the designs experimentally for several reasons. Membrane proteins are hard to purify. Their structure is hard to characterize because they adopt their native conformation only in the presence of a membrane. Crystallization is inherently hard, and NMR studies of membrane proteins can fail because of their size.

Design of fibrous proteins

Fibrous proteins, such as elastin or collagen, typically have no single structure. It is believed, however, that the fibrous proteins do not have random folds, and instead they have folds defined within an ensemble of structures. This ensemble defines their behavior. Therefore, in theory it is possible to rationally design fibrous proteins by selecting a sequence that will populate a specific ensemble.

Other applications

One of the most desirable applications for protein design is for biosensors, proteins that will sense the presence of specific compounds. Some attempts in the design of biosensors include sensors for unnatural molecules including TNT.^[51] More recently, Kuhlman and co-workers designed a biosensor of the PAK1.^[52]

Role of other protein engineering tools in protein design

Protein design is one of the tools available for protein engineering. When proteins are designed using rational protein design, it is frequently the case that other protein engineering tools are used as part of the process. For example, when David Baker and co-workers designed de novo enzymes for the Kemp-elimination catalysis, these enzymes were then optimized using directed evolution to optimize the catalysts.^[33] In fact, directed evolution is considered by Baker's group to be a necessary part of enzyme design, in order to identify sequence features missed by the protein design algorithms.^[31]

Software

EGAD: A Genetic Algorithm for protein Design.^[53] A free, open-source software package for protein design and prediction of mutation effects on protein-folding stabilities and binding affinities. EGAD can also consider multiple structures simultaneously for designing specific binding proteins or locking proteins into specific conformational states. In addition to natural protein residues, EGAD can also consider free-moving ligands with or without rotatable bonds. EGAD can be used with single or multiple processors.

Iterative Protein Redesign and Optimization. IPRO redesigns proteins to increase or give specificity to native or novel substrates and cofactors. This is done by repeatedly randomly perturbing the backbones of the proteins around specified design positions, identifying the lowest-energy combination of rotamers, and determining whether the new design has a lower binding energy than previous ones. The iterative nature of this process allows IPRO to make additive mutations to the protein sequence that collectively improve the specificity toward the desired substrates and/or cofactors. Experimental testing of predictions by IPRO successfully switched the cofactor preference of Candida boidinii xylose reductase from NADPH to NADH.^[48]

OSPREY A free, open-source, actively developed protein design program with an emphasis on continuous protein flexibility for the side-chains and backbone, modeling of proteins as thermodynamic ensembles and algorithms with mathematical guarantees on the input. OSPREY has been used in several prospective applications with biomedical relevance, including: enzymes-redesign toward non-cognate substrates,^[36] prospective prediction of resistance mutations against novel drugs,^[47] drug design to treat leukemia,^[54] peptide-drug design to treat cystic fibrosis,^[43] and the design of probes of broadly neutralizing HIV antibodies.^[55]

PROTDES software for protein design based on CHARMM molecular mechanics package.

RosettaDesign. A software package, under active development and free for academic use, that has seen extensive successful use.^[5]^[32]^[33]^[56]^[57]^[58]^[59] RosettaDesign is accessible via a web server.^[60]

SHARPEN. A permissive open-source library for protein design and structure prediction. SHARPEN offers a variety of combinatorial optimization methods (e.g., Monte Carlo, Simulated Annealing, FASTER^[61]) and can score proteins using the successful Rosetta all-atom force field or molecular mechanics force fields (OPLSaa). In addition to the protein modeling library, SHARPEN includes tools for scalable distributed computing.

WHAT IF software for protein modelling, design, validation, and visualisation.

Protein WISDOM. Protein WISDOM ^[62] is a workbench for In silico De novo Design of BioMolecules. It is an optimization-driven approach to design new sequences for improved stability or binding affinity. The sequence selection stage designs novel protein and peptide sequences based upon a rigid or flexible design template uploaded by the user. Once sequences are generated, they can be validated by either fold specificity or approximate binding affinity calculations.

Abalone program for protein modelling, molecular dynamics folding and visualisation.^[63]

PoPMuSiC. Very popular among academics, with more than 400 registered users, PoPMuSiC is a fast and accurate program to design proteins with modified stability.

References

↑ ^1.0 ^1.1 ^1.2 ^1.3 ^1.4 ^1.5 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^2.0 ^2.1 ^2.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^3.0 ^3.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^4.0 ^4.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^5.0 ^5.1 ^5.2 ^5.3 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^9.0 ^9.1 ^9.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^11.0 ^11.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^12.0 ^12.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^13.0 ^13.1 ^13.2 ^13.3 ^13.4 ^13.5 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^14.0 ^14.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^15.0 ^15.1 ^15.2 ^15.3 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^20.0 ^20.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^23.0 ^23.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^24.0 ^24.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^31.0 ^31.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^32.0 ^32.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^33.0 ^33.1 ^33.2 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^36.0 ^36.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^37.0 ^37.1 ^37.2 ^37.3 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^43.0 ^43.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ ^47.0 ^47.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ ^48.0 ^48.1 Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ http://rosettadesign.med.unc.edu/
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Lua error in package.lua at line 80: module 'strict' not found.
↑ Example of protein folding by molecular dynamics

v t e Biomolecular structure
Protein structure	Primary Secondary Tertiary Quaternary Determination Prediction Design Thermodynamics
Nucleic acid structure	Primary Secondary Tertiary Quaternary Determination Prediction Design Thermodynamics
See also	Protein Protein domain Protein engineering Nucleic acid DNA RNA Nucleic acid double helix