Graph Based Signature
Graph Based Signature
Graph Based Signature
The second set of methods is based on potential energy func- geometric centre. A pairwise distance calculation between
tions. Environment-specific substitution tables, which describe the the atoms of the environment generates an atom distance
propensities of residues to mutate in a certain protein-structural en- matrix, which accounts for a wide spectrum of distances,
vironment during evolutionary time, have been used to derive a from short to long range. From this matrix, distance pat-
statistical potential energy function used by the method SDM terns are then extracted and summarized as a feature vector;
(Topham et al., 1997; Worth et al., 2011). In the PoPMuSiC To account for the atom changes induced by the mutation,
method (Dehouck et al., 2009), the estimated stability change on we introduce a ‘pharmacophore count’ vector. Each one of
mutation is expressed as a linear combination of 26 different the 20 amino acid residues is represented by a different
energy functions, whose parameters were trained using an artificial vector, where each position denotes the frequency of a cer-
neural network. Empirical energy functions have also been used in tain pharmacophore in that residue. The difference vector
a method that performed Monte Carlo optimization (Bordner and between the wild-type and mutant pharmacophore vectors is
Abagyan, 2004), which has also been used to study the role of then appended to the signature.
336
mCSM
only the residue environment in the wild-type protein. In this way, PMapper and belong to eight possible classes: hydrophobic, positive,
our method is applicable even when no mutant structures are avail- negative, hydrogen acceptor, hydrogen donor, aromatic, sulphur and
able. Furthermore, it does not require the generation of homology neutral.
models. Experimental conditions: The experimental conditions, in which the
Pharmacophore changes: To take into account the changes in atom thermodynamic data, such as pH and temperature are collected, are
types due to the mutation, a ‘pharmacophore count’ vector is intro- also appended to the signatures when available. Relative solvent
duced. Wild-type and mutant residues are represented as pharmaco- accessibility of the residue is also included.
phore frequency vectors as shown in the lower part of Figure 1a and
calculated as follows. Let L be a set of pharmacophore types and f a
labelling function that assigns a pharmacophore l 2 L to a given Figure 1b summarizes the mCSM prediction workflow as follows:
atom x: fðxÞ ¼ l. The frequency of each type of pharmacophore in preprocessing the thermodynamic and structural data, extracting the resi-
a residue is then summarized in a vector p. The difference pchange due environments, signature calculation and noise reduction, supervised
between pharmacophore count for mutant (pmt) and wild-type (pwt) learning and mutation impact prediction and validation.
residue is calculated (pchange ¼ pmt pwt ) and appended to the signa- Algorithm 1 shows the function that calculates the proposed mCSM
ture. The atom pharmacophores are characteristics described by signature, which requires the following input parameters: a set of
337
D.E.V. Pires et al.
mutations, wild-type structure and the atomic categories (or pharmaco- of protein–protein and protein–DNA affinity change on mutation. We
phore) to be considered, a cutoff range (DMIN and DMAX) and a cutoff also describe in the Supplementary Material a dataset used to assess
step (DSTEP) in which each cutoff is discretized. For each mutation the the ability of mCSM in predicting disease-related mutations. Table 1
residue environment is calculated based on a cutoff distance, selecting all summarizes the conducted experiments, the data sets and validation
interacting atoms in the residue vicinity. The pairwise distances between procedures used.
all pairs of atoms of the residue environment are then calculated and
stored in a distance matrix. The distance matrix is scanned considering 2.2.1 Protein stability change To assess the applicability of mCSM
the cutoff range and cutoff step generating a cumulative distribution of signatures in predicting the impact of mutations in protein stability, sev-
the distances by atomic category. Finally, the pharmacophoric changes eral data sets derived from the ProTherm (Kumar et al., 2006) database
between wild-type and mutant residue are appended to the signature as were considered. ProTherm is a collection of experimental thermo-
well as the experimental conditions (pH and temperature) and residue dynamic parameters for wild-type and mutant proteins, including the
relative solvent accessibility. The generated signatures are then used as change in Gibbs free energy (G). Only single-point mutations were
evidence to train predictive classification and regression models. considered. The data sets were used in comparative experiments with
other methods, in regression and classification tasks, which consist of
Table 1. Summary of data sets used, the experiments performed and validation process used
Protein stability change S2648 Regression 5-fold cross-validation (Dehouck et al., 2009)
Protein stability change S1925 Regression and classification 20-fold cross-validation (Masso and Vaisman, 2008)
Protein stability change S350/S309/S87 Regression Train (S2298) (Worth et al., 2011)
Protein–nucleic acid affinity ProNIT Regression and classification 10-fold cross-validation (Ahmad et al., 2008)
Protein–protein affinity SKEMPI Regression and classification 10-fold cross-validation (Moal and Fernandez-Recio, 2012)
Protein–protein affinity BeAtMuSiC Regression 10-fold cross-validation (Dehouck et al., 2013)
Disease-related mutations KIN Classification 20-fold cross-validation (Capriotti and Altman, 2011)
338
mCSM
where R ¼ 8:314JK1 mol1 is the ideal gas constant, T is the temperature In the first set of experiments, we predict the impact of single-
(in Kelvin) and KD is the affinity of the protein–protein complex. The point mutations on protein stability via regression and classifi-
affinity change between wild-type and mutant forms (G) is calculated cation tasks. We then assess the adequacy of the signatures in
as follows:
predicting affinity changes on mutation at protein–protein and
G ¼ Gwild Gmutant protein–DNA interfaces. We also perform experiments that aim
SKEMPI: The data set used is derived from the SKEMPI database to predict disease-related mutations. Finally, as a case study, we
(Moal and Fernandez-Recio, 2012). SKEMPI is a curated database apply our methodology to predict stability changes of 42 muta-
that compiles changes in thermodynamic and kinetic parameters on mu- tions occurring in the tumour suppressor protein p53.
tation for protein–protein complexes for which a structure is available in
the Protein Data Bank. From this database, which includes data for
43000 mutations, we filter only single-point mutations with available 3.1 Predicting protein stability change on mutation
experimental affinities for both wild-type and mutant forms. This resulted
As summarized in Table 1, three different stability data sets were
in the data set that comprises 2317 mutations in 150 different proteins
Fig. 2. Regression results for mCSM signature predictive model trained using Gaussian processes regression for different tasks. From left to right:
stability change prediction (S1925 dataset), protein–protein affinity change (SKEMPI dataset) and protein–DNA affinity change (ProNIT data set). For
each data set the Pearson’s correlation coefficient () and standard error () are also shown in the top-left part of each graph
339
D.E.V. Pires et al.
Table 2. Comparative regression experiments using the S350 data set 3.3 Case study: predicting stability changes for
p53 mutants
Method Number of Pearson’s Standard Cancer is a complex disease that arises from a combination of
predictions coefficienta error(kcal/mol)a genetic and epigenetic changes accumulated over many years.
Although there is large variability in the genes implicated in
Automute 315 0.46/0.45/0.45 1.43/1.46/1.99 tumorigenesis, 450% of human cancers carry loss of function
Cupsat 346 0.37/0.35/0.50 1.91/1.96/2.14
mutations in the transcription factor p53 (Beroud and Soussi,
Dmutant 350 0.48/0.47/0.57 1.81/1.87/2.31
2003; Olivier et al. 2002). In response to DNA damage, p53
Eris 334 0.35/0.34/0.49 4.12/4.28/3.91
I-Mutant-2.0 346 0.29/0.27/0.27 1.65/1.69/2.39 transactivates a range of genes to induce cell cycle arrest, DNA
PoPMuSiC-1.0 350 0.62/0.63/0.70 1.24/1.25/1.66 repair, senescence and apoptosis, depending on the extent and
PoPMuSiC-2.0 350 0.67/0.67/0.71 1.16/1.19/1.67 types of DNA damage (Sionov et al., 1999; Vousden et al., 2002).
SDM 350 0.52/0.53/0.63 1.80/1.81/2.11 p53 is composed of three main domains: an N-terminal trans-
340
mCSM
In general there was good agreement between the algorithms shows that the majority of the variability of the signatures can
in predicting the direction of change when compared with the be explained with atom frequencies for long-range distances.
experimental data. One interesting deviation, however, was the In this way, mCSM perceives residue environment density and
mutation of arginine 282 to tryptophan, clinically a commonly depth implicitly, without relying on direct calculations or thresh-
observed p53 mutation. Arginine 282, shown in Figure 1a and olds. A similar analysis was done for the other data sets
Supplementary Figure S2, is involved in a network of inter- (Supplementary Figs. S5 and S6).
actions underpinning the loop-sheet-helix major groove DNA The mCSM approach shows that a good description of the
binding motif. Mutation to tryptophan results in large structural effect of mutations on the wild-type structure can be achieved
perturbations, resulting in p53 being largely unfolded, and hence using the pharmacophore count. This contrasts with use of the
inactive, under physiological conditions (Bullock et al., 2000). immediate environment of the residue to define a pharmaco-
mCSM was able to predict accurately the effect of this mutation phore, which is de-emphasized in mCSM. Although in mCSM
on the overall stability of p53. Interestingly, both SDM and the pharmacophore is not described in terms of a defined struc-
PoPMuSiC predicted that the R282W mutation would be stabi-
341
D.E.V. Pires et al.
342