Graph Based Signature

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Vol. 30 no.

3 2014, pages 335–342


BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btt691

Structural bioinformatics Advance Access publication November 26, 2013

mCSM: predicting the effects of mutations in proteins using


graph-based signatures
Douglas E. V. Pires1,*, David B. Ascher1,2 and Tom L. Blundell1,*
1
Department of Biochemistry, University of Cambridge, Cambridge CB2 1GA, UK and 2ACRF Rational Drug Discovery
Centre and Biota Structural Biology Laboratory, St Vincents Institute of Medical Research, Fitzroy, VIC, 3065, Australia
Associate Editor: Alfanso Valencia

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


ABSTRACT malfunction and resulting in disease. Thus, predicting the im-
Motivation: Mutations play fundamental roles in evolution by introdu- pacts of mutations in proteins is of major importance to under-
cing diversity into genomes. Missense mutations in structural genes standing function, not only of molecules and cells but also of the
may become either selectively advantageous or disadvantageous to whole organism.
the organism by affecting protein stability and/or interfering with inter- Mutagenesis studies that experimentally determine free en-
actions between partners. Thus, the ability to predict the impact of ergy differences between wild-type and mutant proteins
mutations on protein stability and interactions is of significant value, (Fersht, 1987) produce accurate results but are usually costly
particularly in understanding the effects of Mendelian and somatic and time-consuming. However, the advent of databases with ex-
mutations on the progression of disease. Here, we propose a novel perimental thermodynamic parameters for both wild-type and
approach to the study of missense mutations, called mCSM, which mutant proteins such as ProTherm and ProNIT (protein-nucleic
relies on graph-based signatures. These encode distance patterns acid) (Kumar et al., 2006) and more recently the SKEMPI (Moal
between atoms and are used to represent the protein residue envir- and Fernandez-Recio, 2012), which describes protein–protein
onment and to train predictive models. To understand the roles of complexes, has been helpful to the study of mutations on a
mutations in disease, we have evaluated their impacts not only on larger scale. These provide an experimental basis for novel in
protein stability but also on protein–protein and protein–nucleic acid silico paradigms, models and algorithms to study more exten-
interactions. sively missense mutations and their impacts on protein stability
Results: We show that mCSM performs as well as or better than other and function.
methods that are used widely. The mCSM signatures were success- The several different approaches used to study the impacts of
fully used in different tasks demonstrating that the impact of a muta- mutations on protein structure and function can be broadly clas-
tion can be correlated with the atomic-distance patterns surrounding sified into those that seek to understand the effects of mutations
an amino acid residue. We showed that mCSM can predict stability from the amino acid sequence of a protein alone, and those that
changes of a wide range of mutations occurring in the tumour sup- exploit the extensive structural information now available for
pressor protein p53, demonstrating the applicability of the proposed many proteins. The first group includes well-established and
method in a challenging disease scenario. widely used sequence-based methods such as SIFT (Ng and
Availability and implementation: A web server is available at http:// Henikoff, 2003) and PolyPhen (Adzhubei et al., 2010). Here,
structure.bioc.cam.ac.uk/mcsm. we focus on the second approach that takes advantage of the
Contact: dpires@dcc.ufmg.br; tom@cryst.bioc.cam.ac.uk protein structural information that has been accumulated on the
Supplementary information: Supplementary data are available at impact of mutations within the 3D space of a natively folded
Bioinformatics online. protein.
Received on April 18, 2013; revised on November 12, 2013; accepted
Structure-based approaches, which may be categorized as ma-
on November 21, 2013 chine learning methods and potential energy functions, typically
attempt to predict either the direction of change in protein sta-
bility on mutation (as a classification task) or the actual free
1 INTRODUCTION energy value (G) as a regression task. Machine learning-
based methods have been combined with structure-based
1.1 Background computational mutagenesis as a four-body statistical contact po-
Mutations play fundamental roles in evolution by introducing tential in Masso and Vaisman (2008). Support vector machines
diversity into genomes, most often through single nucleotide have been used to predict changes in stability from either protein
polymorphisms (SNPs). Non-synonymous single nucleotide sub- sequence or structure descriptors (Capriotti et al., 2005a, b;
stitutions (nsSNPs) are of particular interest, as they can disrupt Cheng et al., 2005) and more recently to predict disease-related
function by interfering with protein stability and/or interactions mutations (Capriotti and Altman,2011). There have also been
with partners. Such mutations can be selectively advantageous in recent attempts to predict the stability changes on multisite
evolution or they may cause a change in stability often leading to mutations (Tian et al., 2010). Machine learning methods have
proven to be powerful predictive tools, even when data on which
*To whom correspondence should be addressed. to train the methods have not been extensively available.

ß The Author 2013. Published by Oxford University Press.


This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
D.E.V. Pires et al.

The second set of methods is based on potential energy func- geometric centre. A pairwise distance calculation between
tions. Environment-specific substitution tables, which describe the the atoms of the environment generates an atom distance
propensities of residues to mutate in a certain protein-structural en- matrix, which accounts for a wide spectrum of distances,
vironment during evolutionary time, have been used to derive a from short to long range. From this matrix, distance pat-
statistical potential energy function used by the method SDM terns are then extracted and summarized as a feature vector;
(Topham et al., 1997; Worth et al., 2011). In the PoPMuSiC  To account for the atom changes induced by the mutation,
method (Dehouck et al., 2009), the estimated stability change on we introduce a ‘pharmacophore count’ vector. Each one of
mutation is expressed as a linear combination of 26 different the 20 amino acid residues is represented by a different
energy functions, whose parameters were trained using an artificial vector, where each position denotes the frequency of a cer-
neural network. Empirical energy functions have also been used in tain pharmacophore in that residue. The difference vector
a method that performed Monte Carlo optimization (Bordner and between the wild-type and mutant pharmacophore vectors is
Abagyan, 2004), which has also been used to study the role of then appended to the signature.

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


conformational sampling as a way to assess the impact of single-
point mutations in protein structures (Kellogg et al., 2011). Thus, each mutation is represented as a signature vector that is
Although there have been attempts to predict the affinities of used to train and test predictive machine learning methods in
particular protein–protein complexes (Moal et al., 2011; Yan regression and classification tasks. In Section 2, we describe in
et al., 2013), there has been much less attention to the challenge detail how the signatures are calculated and the data sets used
of predicting the impact of mutations on affinity in large sets of in this study.
protein–protein and protein–DNA complexes. A significant ex-
ception has been the report of Guerois et al. (2002) on predicting 1.3 Summary of results
the effects of mutations in a set of 82 protein–protein complexes. We show that the mCSM signatures can be used successfully to
Another important study refers to the identification of binding tackle different tasks related to the prediction of the impacts of
energy hot spots in protein complexes by predicting the impact of mutations in proteins. We have conducted a series of compara-
mutations to alanine in 4200 mutations (Kortemme and Baker, tive experiments that indicate that mCSM performs as well as or
2002). More recently, a method derived from PoPMuSiC, called better than several other widely used methods. mCSM is able
BeAtMuSiC (Dehouck et al., 2013), has been developed to pre- to predict not only the direction of the change in stability of
dict the impacts of mutations on protein–protein affinities for a proteins and affinity of protein–protein and protein–DNA com-
set of 81 proteins. plexes but also the actual numerical experimental value, with
An alternative approach to study mutations is to represent correlation coefficients up to 0.824 for a large data set of muta-
residue environments as graphs where nodes are the atoms and tions. We have also applied our methodology to predict changes
the edges are the physicochemical interactions established among of stability resulting from mutations occurring in the tumour
them. For instance, the method Bongo (Cheng et al., 2008) suppressor protein p53. mCSM outperforms other methods,
attempts to predict structural effects of nsSNPs by evaluating demonstrating its applicability to understanding mutations that
graph theoretic metrics and identifying key residues using a lead to disease.
vertex cover algorithm. From these graphs, distance patterns
can also be extracted and summarized in a structural signature,
which may then be used as evidence to train predictive models. 2 MATERIALS AND METHODS
Da Silveira et al. (2009) first reported the use of inter-residue
distance patterns or signatures to define protein contacts,
2.1 mCSM: graph-based signatures
demonstrating that they are conserved across protein folds. Here, we extend the concept of the inter-atomic distance patterns to a
The Cutoff Scanning Matrix (CSM) is a protein structural sig- residue environment called mCSM. mCSM signatures can be divided into
three major components:
nature (Pires et al., 2011) successfully used in large-scale protein
function prediction and structural classification tasks. Pires et al.  Graph-based atom distance patterns: The major components of
(2013) extended the inter-residue signature to an atomic level the signatures are distance patterns in the vicinity of the wild-type
(aCSM) and successfully applied it in large-scale receptor-based residue encoded as a cumulative distribution. The wild-type residue
protein ligand prediction. environment, which here is defined as the set of atoms within a dis-
Here, we use the concept of graph-based structural signatures tance r from its geometric centre, can be modelled as a contact graph,
where the atoms are the nodes and the edges are defined by a cutoff
to study and predict the impact of single-point mutations on
distance. In this way, the signatures encode distance patterns by
protein stability and protein–protein and protein–nucleic acid
varying the edge-defining cutoff distance, used in computing the
affinity. The approach, called mutation Cutoff Scanning number of edges of the resulting atomic contact graph. A cumulative
Matrix (henceforth called mCSM), encodes distance patterns distribution is obtained from an atom distance matrix, which is a
between atoms to represent protein residue environments. pairwise distance calculation between atoms of the environment.
Here, as in a previous study (Pires et al., 2013), we use three types
of atom classification to segment the cumulative distribution: one
1.2 Method outline class (no distinction between atoms), a binary classification (atoms
The mCSM signatures are calculated in two steps: labelled as polar or hydrophobic) and using the Pmapper pharma-
cophoric classification, which classifies atoms into eight possible
 For a given mutation site, we define the wild-type residue categories: hydrophobic, positive, negative, hydrogen acceptor,
environment by the atoms within a distance r from its hydrogen donor, aromatic, sulphur and neutral. mCSM considers

336
mCSM

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


Fig. 1. Predicting the impact of mutations with mCSM. (a) Highlights important steps in the methodology and how the main components of the
signatures are computed. Here, we use as an example the published crystal structure of p53 (PDB ID: 2OCJ), considering the mutation site R282W,
further discussed in Section 3.3. Given a mutation site in a wild-type protein, its structural environment is extracted and the distance patterns among the
atoms summarized in the mCSM signature. To take into account the change in atom types due to the mutation, a pharmacophore count is performed for
the wild-type and mutant residue. The changes in pharmacophore count are then appended into the signature, which is used to train/test predictive
models. The considered pharmacophore types are eight: hydrophobic (green), positive (blue), negative (red), hydrogen acceptor (red), hydrogen donor
(blue), aromatic (green), sulphur (yellow) and neutral (white). (b) Summarizes the mCSM predictive workflow that can be divided into the following
steps: gathering and preprocessing the thermodynamic and structural data, extracting the residue environments, signature calculation and noise reduc-
tion, supervised learning and mutation impact prediction and validation

only the residue environment in the wild-type protein. In this way, PMapper and belong to eight possible classes: hydrophobic, positive,
our method is applicable even when no mutant structures are avail- negative, hydrogen acceptor, hydrogen donor, aromatic, sulphur and
able. Furthermore, it does not require the generation of homology neutral.
models.  Experimental conditions: The experimental conditions, in which the
 Pharmacophore changes: To take into account the changes in atom thermodynamic data, such as pH and temperature are collected, are
types due to the mutation, a ‘pharmacophore count’ vector is intro- also appended to the signatures when available. Relative solvent
duced. Wild-type and mutant residues are represented as pharmaco- accessibility of the residue is also included.
phore frequency vectors as shown in the lower part of Figure 1a and
calculated as follows. Let L be a set of pharmacophore types and f a
labelling function that assigns a pharmacophore l 2 L to a given Figure 1b summarizes the mCSM prediction workflow as follows:
atom x: fðxÞ ¼ l. The frequency of each type of pharmacophore in preprocessing the thermodynamic and structural data, extracting the resi-
a residue is then summarized in a vector p. The difference pchange due environments, signature calculation and noise reduction, supervised
between pharmacophore count for mutant (pmt) and wild-type (pwt) learning and mutation impact prediction and validation.
residue is calculated (pchange ¼ pmt  pwt ) and appended to the signa- Algorithm 1 shows the function that calculates the proposed mCSM
ture. The atom pharmacophores are characteristics described by signature, which requires the following input parameters: a set of

337
D.E.V. Pires et al.

mutations, wild-type structure and the atomic categories (or pharmaco- of protein–protein and protein–DNA affinity change on mutation. We
phore) to be considered, a cutoff range (DMIN and DMAX) and a cutoff also describe in the Supplementary Material a dataset used to assess
step (DSTEP) in which each cutoff is discretized. For each mutation the the ability of mCSM in predicting disease-related mutations. Table 1
residue environment is calculated based on a cutoff distance, selecting all summarizes the conducted experiments, the data sets and validation
interacting atoms in the residue vicinity. The pairwise distances between procedures used.
all pairs of atoms of the residue environment are then calculated and
stored in a distance matrix. The distance matrix is scanned considering 2.2.1 Protein stability change To assess the applicability of mCSM
the cutoff range and cutoff step generating a cumulative distribution of signatures in predicting the impact of mutations in protein stability, sev-
the distances by atomic category. Finally, the pharmacophoric changes eral data sets derived from the ProTherm (Kumar et al., 2006) database
between wild-type and mutant residue are appended to the signature as were considered. ProTherm is a collection of experimental thermo-
well as the experimental conditions (pH and temperature) and residue dynamic parameters for wild-type and mutant proteins, including the
relative solvent accessibility. The generated signatures are then used as change in Gibbs free energy (G). Only single-point mutations were
evidence to train predictive classification and regression models. considered. The data sets were used in comparative experiments with
other methods, in regression and classification tasks, which consist of

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


Algorithm 1: Mutation Cutoff Scanning Matrix Calculation. predicting the numerical value and the direction of change in G,
1: function mCSM(MutationSet, AtomClass, DMIN , DMAX , DSTEP ) respectively.
2: for all mutation i 2 ðMutationSetÞ do S2648: The first data set, S2648, was used in comparative regression
3: residue_environment ¼ extractResidueEnvironment(mutation) tasks where the aim is to predict the change in Gibbs free energy (G)
4: j¼0 between wild-type and mutant protein. The data set comprises 2648
5: distMatrix calculateAtomicPairwiseDist(residue_environment) single-point mutations in 131 different globular proteins. For experiments
6: for dist DMIN ; to DMAX; step DSTEP do with these data, we used 5-fold cross-validation, the same validation
7: for all class 2 ðAtomClassÞ do procedure use by the authors of the PoPMuSiC (Dehouck et al., 2009)
8: mCSM½i½j getFrequency(distMatrix, dist, class) algorithm.
9: jþþ S350: The second data set, S350, comprised 350 mutations in 67 dif-
10: add_pH_RSA_Temperature(mCSM½i) ferent proteins. It is a randomly selected subset of the S2648 data set, also
11: add_pharmacophores_changes(mCSM½i) used in comparative regression experiments. In this case, the remaining
12: return mCSM 2298 mutations from the S2648 data set were used to train the predictive
model, whereas the S350 data set was used as a test set. This data set is
The main goal of the mCSM signatures is to encode and concisely widely used in the literature to compare the performance of different
summarize atomic-distance patterns in the vicinity of a residue that can methods.
be correlated to the impact of a mutation. Even though interference with S1925: The data set S1925 was used in both regression and classifica-
short-distance interactions (e.g. the creation or disruption of hydrogen tion experiments. It comprises 1925 mutations in 55 proteins, which are
bonds) is the most direct effect of mutations, mCSM signatures also take uniformly distributed across the four major SCOP classes (Murzin et al.,
into account long-range distance patterns, in contrast with the majority of 1995). Twenty-fold cross-validation protocol was used, the same protocol
other approaches described in the literature. used in by the AUTOMUTE method Masso and Vaisman, 2008).
It is important to note that mCSM does not define any explicit penal- p53: Finally, as a study case, we assembled a data set of 42 mutations
ties, for instance when burying hydrophobic or exposing polar residues. within the DNA binding domain of the tumour suppressor protein p53,
The pharmacophore vector is used to reflect the changes in residue char- whose thermodynamic effects have previously been experimentally
acter and size due to the mutation, and its impact is learned without the characterized (Ang et al., 2006; Bullock et al. 2000; Joerger et al., 2006;
definition of an explicit threshold. On the other hand, the perception of Nikolova et al. 1998, 2000). The full data set description is available as
residue accessibility or residue depth is implicitly obtained by mCSM Supplementary Material.
atomic distance patterns.
A detailed description of the evaluation methodology, supervised 2.2.2 Protein–protein affinity change The second set of experiments
learning algorithms used in classification and regression tasks as well as aims to assess the performance of mCSM signatures in predicting the
the quality metrics used to evaluate the performance of mCSM are avail- impact of mutations on affinity of protein–protein complexes, in both
able as Supplementary Material. regression and classification tasks (i.e. prediction of the numerical change
or its direction). Affinities of protein–protein complexes were converted
2.2 Data sets from molar (M) to Kcal/mol using the formulation of the Gibbs free
energy (G):
The data sets used in this work can be divided into four groups by pre-
dictive task: prediction of protein stability change on mutation, prediction G ¼ RTlnðKD Þ

Table 1. Summary of data sets used, the experiments performed and validation process used

Experiment Data set Task Validation References

Protein stability change S2648 Regression 5-fold cross-validation (Dehouck et al., 2009)
Protein stability change S1925 Regression and classification 20-fold cross-validation (Masso and Vaisman, 2008)
Protein stability change S350/S309/S87 Regression Train (S2298) (Worth et al., 2011)
Protein–nucleic acid affinity ProNIT Regression and classification 10-fold cross-validation (Ahmad et al., 2008)
Protein–protein affinity SKEMPI Regression and classification 10-fold cross-validation (Moal and Fernandez-Recio, 2012)
Protein–protein affinity BeAtMuSiC Regression 10-fold cross-validation (Dehouck et al., 2013)
Disease-related mutations KIN Classification 20-fold cross-validation (Capriotti and Altman, 2011)

338
mCSM

where R ¼ 8:314JK1 mol1 is the ideal gas constant, T is the temperature In the first set of experiments, we predict the impact of single-
(in Kelvin) and KD is the affinity of the protein–protein complex. The point mutations on protein stability via regression and classifi-
affinity change between wild-type and mutant forms (G) is calculated cation tasks. We then assess the adequacy of the signatures in
as follows:
predicting affinity changes on mutation at protein–protein and
G ¼ Gwild  Gmutant protein–DNA interfaces. We also perform experiments that aim
SKEMPI: The data set used is derived from the SKEMPI database to predict disease-related mutations. Finally, as a case study, we
(Moal and Fernandez-Recio, 2012). SKEMPI is a curated database apply our methodology to predict stability changes of 42 muta-
that compiles changes in thermodynamic and kinetic parameters on mu- tions occurring in the tumour suppressor protein p53.
tation for protein–protein complexes for which a structure is available in
the Protein Data Bank. From this database, which includes data for
43000 mutations, we filter only single-point mutations with available 3.1 Predicting protein stability change on mutation
experimental affinities for both wild-type and mutant forms. This resulted
As summarized in Table 1, three different stability data sets were
in the data set that comprises 2317 mutations in 150 different proteins

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


with available PDB structures. The SKEMPI data set used is available via used. The left graph of Figure 2 presents the regression results for
Supplementary Material. Ten-fold cross-validation was used for experi- the S1925 data set. The mCSM signatures were used to train a
ments on this data set. Gaussian process regression model that achieved a correlation of
BeAtMuSiC: The data set comprises 2007 mutations used in a recent  ¼ 0:824 with a standard error of  ¼ 1:026ðkcal=molÞ. For this
study (Dehouck et al., 2013). It corresponds to a subset of the SKEMPI data set, comparing with the AUTOMUTE method (Masso and
database, comprising mutations in 81 PDB structures. This data set was Vaisman, 2008), mCSM presented a performance equivalent or
used in comparative experiments in 10-fold cross-validation as well as in a even better for both regression and classification tasks as showed
blind test, as described in Section 4 in Supplementary Material.
in Supplementary Tables S3 and S7.
When compared with the largest available data set of
2.2.3 Protein–nucleic acid affinity change The third set of experi-
ments was designed to demonstrate the capability of mCSM to predict
mutants, S2648, used by the method PoPMuSiC (Dehouck
changes in affinity in protein–DNA complexes, for both regression and et al., 2009), mCSM presented a better performance achieving
classification tasks. For this purpose, we used a data set that was derived a correlation coefficient of  ¼ 0:69 with standard error of
from the ProNIT database (Kumar et al., 2006). ProNIT comprises ex-  ¼ 1:06ðkcal=molÞ, compared with a correlation of  ¼ 0:63
perimentally determined thermodynamic interaction data between pro- with  ¼ 1:15ðkcal=molÞ reported by the original authors. Even
teins and nucleic acids. In this case, we considered the change in Gibbs after 10% outlier removal, mCSM maintains its efficacy
free energy (G). Only single-point mutations were taken into account. ( ¼ 0:79 with  ¼ 0:78ðkcal=molÞ for mCSM and
ProNIT: The data set comprises 511 single mutations in 21 different  ¼ 0:79;  ¼ 0:86ðkcal=molÞ for PoPMuSiC).
proteins for which structures are available, and which were used in a
The data set S350 is a subset of S2648 and was used as test set
protein–DNA interaction study that aimed to explore the relationship
between free energy, sequence conservation and structural cooperativity
for predictive models trained with the remaining 2298 mutations.
(Ahmad et al., 2008). We used 10-fold cross-validation for all experiments Table 2 summarizes the results obtained and shows that mCSM
carried out on this data set. outperforms other approaches, some by a large margin. Because
some other methods were not able to predict the stability changes
for all 350 mutations, the table also shows the results for the 309
3 RESULTS mutations for which all methods were capable of estimating a
To assess the ability of the signatures to encode the impact of G value. The performance is also shown when only muta-
mutations on protein structures, we designed an extensive series tions with G 42 kcal/mol are considered. For all cases,
of comparative experiments with other state-of-the-art methods. mCSM was the best performing method.

Fig. 2. Regression results for mCSM signature predictive model trained using Gaussian processes regression for different tasks. From left to right:
stability change prediction (S1925 dataset), protein–protein affinity change (SKEMPI dataset) and protein–DNA affinity change (ProNIT data set). For
each data set the Pearson’s correlation coefficient () and standard error () are also shown in the top-left part of each graph

339
D.E.V. Pires et al.

Table 2. Comparative regression experiments using the S350 data set 3.3 Case study: predicting stability changes for
p53 mutants
Method Number of Pearson’s Standard Cancer is a complex disease that arises from a combination of
predictions coefficienta error(kcal/mol)a genetic and epigenetic changes accumulated over many years.
Although there is large variability in the genes implicated in
Automute 315 0.46/0.45/0.45 1.43/1.46/1.99 tumorigenesis, 450% of human cancers carry loss of function
Cupsat 346 0.37/0.35/0.50 1.91/1.96/2.14
mutations in the transcription factor p53 (Beroud and Soussi,
Dmutant 350 0.48/0.47/0.57 1.81/1.87/2.31
2003; Olivier et al. 2002). In response to DNA damage, p53
Eris 334 0.35/0.34/0.49 4.12/4.28/3.91
I-Mutant-2.0 346 0.29/0.27/0.27 1.65/1.69/2.39 transactivates a range of genes to induce cell cycle arrest, DNA
PoPMuSiC-1.0 350 0.62/0.63/0.70 1.24/1.25/1.66 repair, senescence and apoptosis, depending on the extent and
PoPMuSiC-2.0 350 0.67/0.67/0.71 1.16/1.19/1.67 types of DNA damage (Sionov et al., 1999; Vousden et al., 2002).
SDM 350 0.52/0.53/0.63 1.80/1.81/2.11 p53 is composed of three main domains: an N-terminal trans-

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


mCSM 350 0.73/0.74/0.82 1.08/1.10/1.48 activation domain (amino acid residues 1–45), a DNA binding
domain (residues 102–292) and a C-terminal oligomerization
Note: Results directly obtained from Worth et al. (2011). Bold values highlight are domain (residues 319–359) (Sionov et al., 1999; Vousden et al.,
the best performing metrics.
a
The three values given per column correspond, respectively, to the whole validation
2002). Unlike most tumour suppressors that are inactivated by
set of 350 mutants, the 309 mutants for which a prediction was available for all deletion or truncation mutations, mutations in p53 most often
predictors. Finally, in the third column are the results for 87 mutants, a subset of the result in a protein with a single nucleotide substitution. The ma-
309 mutants, which the experimental G is 42 kcal/mol. jority of these mutations (95%) are located in the DNA binding
domain; they either directly interfere with residues involved in
DNA binding or disrupt the wild-type conformation and stabil-
ity of p53 (Olivier et al., 2002). Both of these classes of mutations
3.2 Predicting affinity change in protein–protein and prevent the transcriptional activation of p53 target genes in a
protein–DNA complexes on mutation dominant-negative fashion (Vousden et al., 2002).
The central and right graphs of Figure 2 present the regression A new approach in cancer therapy is to find drugs that can
results for the SKEMPI protein–protein data set and for the rescue the activity of mutant p53. This has primarily focused on
ProNIT protein–DNA data set. For the SKEMPI data set, trying to enhance the stability p53, potentially reducing the ef-
mCSM was able to achieve a correlation of  ¼ 0:801 with fect of destabilizing mutations and restoring wild-type activity.
 ¼ 1:251ðkcal=molÞ, whereas for the ProNIT data set the results Several approaches have identified stabilizing molecules that also
were  ¼ 0:673 with  ¼ 1:042ðkcal=molÞ. In classification tasks, show a stimulatory effect on p53 DNA binding. These have
for both data sets the predictive models trained with the mCSM included antibodies (Hupp et al., 1995), peptides (Selivanova
signatures were able to achieve accuracies of 482% and Area et al., 1999) and small molecules identified via structure-guided
Under ROC Curves (AUCs) of 0.826 and 0.853, respectively, as design (Boeckler et al., 2008) and screening approaches (Bykov
described in Supplementary Table S1. et al., 2002). The most advanced of these is the small molecule
Supplementary Table S4 shows the mCSM performance in PRIMA-1MET (APR-246) that has successfully completed
comparison with the program BeAtMuSiC. mCSM achieves a Phase I/II clinical trials, where it was observed that it could
correlation of  ¼ 0:58 with  ¼ 1:55ðkcal=molÞ, in comparison induce p53-dependent biological effects in tumour cells in vivo
with  ¼ 0:40 with  ¼ 1:80ðkcal=molÞ achieved by the (Lehmann et al., 2012).
BeAtMuSiC method. mCSM also achieves a correlation of We have used mCSM to predict the effect of mutations on the
 ¼ 0:56 with  ¼ 1:38ðkcal=molÞ in a blind test as described in stability of p53. We used the published crystal structure of p53
Supplementary Section S4. (PDB ID: 2OCJ), and predicted the change in stability of 442
We also evaluate our approach by using it to identify disease- single mutations within the DNA binding domain of p53 whose
related mutations, comparing it with well-established sequence- thermodynamic effects have previously been experimentally char-
based methods. Supplementary Table S2 summarizes the acterized. None of these mutations was present in the training set.
In addition to mCSM, we also used SDM and PoPMuSiC to
obtained results. mCSM achieves a comparable level of accuracy,
predict the stability changes of these mutations. These predictions
whereas presenting much better Matthews Correlation
were compared directly with the experimentally determined
Coefficient (MCC) and AUC values.
thermodynamic effects (Supplementary Table S6).
In addition, we performed experiments on low-redundancy
mCSM predicted stability changes correlated strongly with the
data sets where all mutations in a protein (or position) are
experimentally observed thermodynamic effects ( ¼ 0:68), as
either in the test or training set exclusively, as described in shown in Supplementary Table S5. In addition, mCSM was a
Supplementary Section S4. As shown in Supplementary Tables much better predictor of stability changes in p53 than either
S8 and S9, at the protein level the performance tended to be SDM ( ¼ 0:29) or PoPMuSiC ( ¼ 0:56), consistent with our
slightly inferior. The distribution of mutations per protein in larger analysis. In Supplementary Figure S1, we can see that
the data sets is unequal, meaning that information about hun- compared with the experimental observations, mCSM predic-
dreds of mutations may be available for a single protein, which tions did not have a large variation. Significantly, the interquar-
may be a significant source of bias when defining the folds in tile range and 95% confidence interval from mCSM predictions
cross-validation. were tighter than either SDM or PoPMuSiC.

340
mCSM

In general there was good agreement between the algorithms shows that the majority of the variability of the signatures can
in predicting the direction of change when compared with the be explained with atom frequencies for long-range distances.
experimental data. One interesting deviation, however, was the In this way, mCSM perceives residue environment density and
mutation of arginine 282 to tryptophan, clinically a commonly depth implicitly, without relying on direct calculations or thresh-
observed p53 mutation. Arginine 282, shown in Figure 1a and olds. A similar analysis was done for the other data sets
Supplementary Figure S2, is involved in a network of inter- (Supplementary Figs. S5 and S6).
actions underpinning the loop-sheet-helix major groove DNA The mCSM approach shows that a good description of the
binding motif. Mutation to tryptophan results in large structural effect of mutations on the wild-type structure can be achieved
perturbations, resulting in p53 being largely unfolded, and hence using the pharmacophore count. This contrasts with use of the
inactive, under physiological conditions (Bullock et al., 2000). immediate environment of the residue to define a pharmaco-
mCSM was able to predict accurately the effect of this mutation phore, which is de-emphasized in mCSM. Although in mCSM
on the overall stability of p53. Interestingly, both SDM and the pharmacophore is not described in terms of a defined struc-
PoPMuSiC predicted that the R282W mutation would be stabi-

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


ture of the mutant protein, there may still be further gains in the
lizing, with SDM predicting that it would actually be stabilizing quality of the prediction if this could be incorporated into
(Supplementary Table S1). In the case of SDM, the version used mCSM. A further difference from many previous methods is
in the comparison considers only side chain H-bonds to main the omission of an assessment of the effect of the mutation on
chain residues (Worth et al., 2011), which tend to be less critical; the unfolded state. This is explicitly described in perturbation
however in this case the arginine side chain makes three hdydro- methods and in programs like SDM. It appears this is either
gen bonds to other buried or partially buried side chains. This less important to the estimation of the change of stability that
highlights the power of mCSM using the local environment to arises from the majority of mutations or the unfolded state is
predict the effects of mutations. insufficiently well described in methods such as SDM.
In summary, we have shown that mCSM can predict the mCSM does not depend on the observation of mutations that
effects of mutations on the stability of p53, and can identify have occurred in evolution of proteins. These have generally been
disease-associated destabilizing mutations. mCSM provides a selected over long evolutionary times for minor positive or nega-
reliable way to quickly assess the impact of mutations within tive changes on stability that were selectively advantageous to the
the p53 gene, and hence the likelihood of success of stabilizing
organism. Mutations that occur in cancer, such as those in p53
molecules, which is important within a clinical setting.
described here, are often deleterious to the function of the pro-
tein and may not be sampled in a statistically significant manner
4 DISCUSSION AND CONCLUSIONS in evolution when information on residue structural environment
is required in the calculation. Therefore, they may not be prop-
We present a new approach, mCSM, for studying the impact of
erly accounted for in SDM environment-specific substitution
missense mutations in proteins. mCSM, which relies on graph-
tables. This is almost certainly the case for the mutation of
based signatures, was successfully applied and evaluated in dif-
arginine 282 to tryptophan in p53, which has significant effects
ferent predictive tasks and was shown to outperform earlier
on the stability of the protein. This underlines the importance of
methods. We have successfully applied this methodology to pre-
understanding theoretically the effects of various parameters that
dict stability changes of mutations occurring in p53, demonstrat-
ing the applicability of mCSM in a challenging disease scenario. influence stability a major focus of both mCSM and PoPMuSiC.
The results achieved by mCSM support the idea that the In future works we intend to apply our methodology for
impact of a mutation can be correlated with the atomic distance predicting affinity changes to protein-ligand complexes, to
patterns surrounding an amino acid residue. The distant patterns understand the affects of mutations that occur in drug resistance
describe the nature of the environment of the residue in the wild- in cancer and other diseases. We believe our machine learning
type protein. They contrast with amino acid substitution patterns approach is complementary to those based on potential energy
as used in SDM, which define the environment as a function of functions like SDM and PoPMuSiC. This way, we intend to
the residues immediately surrounding the residue. Thus, a solvent combine them into a hybrid method.
inaccessible residue in SDM will have the same defined environ-
ment whether it is on the core of the protein or close to the
surface. On the other hand, the distance matrix of mCSM will ACKNOWLEDGEMENT
differ according to the depth of the residue and the curvature of
the protein surface that lies within a large radius, in this work, up The authors thank Harry Jubb and Bernardo Ochoa for helpful
to 10 Å. This is reminiscent of the focus on ‘depth’ championed discussions and feedback.
by Chakravarty and Varadarajan (1999). The mCSM distance Funding: Brazilian agency Conselho Nacional de
matrix will also be sensitive to the nature of the electrostatic Desenvolvimento Cientı́fico e Tecnológico (CNPq)—Brazil (to
environment, which appears to be less well described in D.E.V.P). Victoria Fellowship from the Victorian Government
PoPMuSiC. These advantages are evidenced in the feature selec- and the Leslie (Les) J. Fleming Churchill Fellowship from the
tion analysis (Supplementary Fig. S3), which shows that long- The Winston Churchill Memorial Trust (to D.B.A.). University
range distances are the most discriminative attributes of the of Cambridge and The Wellcome Trust for facilities and support
signatures, usually polar–polar and hydrophobic–hydrophobic (to T.L.B.).
atom frequencies for distances beyond 6 Å for the SKEMPI
data set. A latent semantic analysis (Supplementary Fig. S4) Conflict of Interest: none declared.

341
D.E.V. Pires et al.

REFERENCES Kellogg,E.H. et al. (2011) Role of conformational sampling in computing mutation-


induced changes in protein structure and stability. Proteins, 79, 830–838.
Adzhubei,I.A. et al. (2010) A method and server for predicting damaging missense Kortemme,T. and Baker,D. (2002) A simple physical model for binding energy
mutations. Nat. Methods, 7, 248–249. hot spots in protein–protein complexes. Proc. Natl Acad. Sci. USA, 99,
Ahmad,S. et al. (2008) Protein–DNA interactions: structural, thermodynamic and 14116–14121.
clustering patterns of conserved residues in DNA-binding proteins. Nucleic Kumar,M.S. et al. (2006) Protherm and pronit: thermodynamic databases for pro-
Acids Res., 36, 5922–5932. teins and protein–nucleic acid interactions. Nucleic Acids Res., 34 (Suppl. 1),
Ang,H.C. et al. (2006) Effects of common cancer mutations on stability and DNA D204–D206.
binding of full-length p53 compared with isolated core domains. J. Biol. Chem., Lehmann,S. et al. (2012) Targeting p53 in vivo: A first-in-human study with
281, 21934–21941. p53-targeting compound apr-246 in refractory hematologic malignancies and
Beroud,C. and Soussi,T. (2003) The umd-p53 database: new mutations and analysis prostate cancer. J. Clin. Oncol., 30, 3633–3639.
tools. Hum. Mutat., 21, 176–181. Masso,M. and Vaisman,I.I. (2008) Accurate prediction of stability changes in pro-
Boeckler,F.M. et al. (2008) Targeted rescue of a destabilized mutant of p53 by an in tein mutants by combining machine learning with structure based computa-
silico screened drug. Proc. Nat Acad. Sci. USA, 105, 10360–10365. tional mutagenesis. Bioinformatics, 24, 2002–2009.
Bordner,A. and Abagyan,R. (2004) Large-scale prediction of protein geometry and Moal,I.H. and Fernandez-Recio,J. (2012) SKEMPI: a structural kinetic and ener-

Downloaded from https://academic.oup.com/bioinformatics/article/30/3/335/228906 by guest on 21 June 2022


stability changes for arbitrary single point mutations. Proteins, 57, 400–413.
getic database of mutant protein interactions and its use in empirical models.
Bullock,A.N. et al. (2000) Quantitative analysis of residual folding and DNA bind-
Bioinformatics, 28, 2600–2607.
ing in mutant p53 core domain: definition of mutant states for rescue in cancer
Moal,I.H. et al. (2011) Protein–protein binding affinity prediction on a diverse set of
therapy. Oncogene, 19, 1245–1256.
structures. Bioinformatics, 27, 3002–3009.
Bykov,V.J. et al. (2002) Restoration of the tumor suppressor function to mutant p53
Murzin,A.G. et al. (1995) SCOP: a structural classification of proteins database for
by a low-molecular-weight compound. Nat. Med., 8, 282–288.
the investigation of sequences and structures. J. Mol. Biol., 247, 536–540.
Capriotti,E. and Altman,R.B. (2011) Improving the prediction of disease-related
Ng,P.C. and Henikoff,S. (2003) Sift: predicting amino acid changes that affect
variants using protein three-dimensional structure. BMC Bioinformatics, 12
protein function. Nucleic Acids Res., 31, 3812–3814.
(Suppl. 4), S3.
Nikolova,P.V. et al. (1998) Semirational design of active tumor suppressor p53
Capriotti,E. et al. (2005a) I-mutant2. 0: predicting stability changes upon mutation
DNA binding domain with enhanced stability. Proc. Natl Acad. Sci. USA, 95,
from the protein sequence or structure. Nucleic Acids Res., 33 (Suppl. 2),
14675–14680.
W306–W310.
Nikolova,P.V. et al. (2000) Mechanism of rescue of common p53 cancer mutations
Capriotti,E. et al. (2005b) Predicting protein stability changes from sequences using
by second-site suppressor mutations. EMBO J., 19, 370–378.
support vector machines. Bioinformatics, 21 (Suppl. 2), ii54–ii58.
Olivier,M. et al. (2002) The iarc tp53 database: new online mutation analysis and
Chakravarty,S. and Varadarajan,R. (1999) Residue depth: a novel parameter for
the analysis of protein structure and stability. Structure, 7, 723–732. recommendations to users. Hum. Mutat., 19, 607–614.
Cheng,J. et al. (2005) Prediction of protein stability changes for single-site mutations Pires,D.E.V. et al. (2011) Cutoff Scanning Matrix (CSM): structural classification
using support vector machines. Proteins, 62, 1125–1132. and function prediction by protein inter-residue distance patterns. BMC
Cheng,T.M. et al. (2008) Prediction by graph theoretic measures of structural effects Genomics, 12 (Suppl. 4), S12.
in proteins arising from non-synonymous single nucleotide polymorphisms. Pires,D.E.V. et al. (2013) aCSM: noise-free graph-based signatures to large-scale
PLoS Comput. Biol., 4, e1000135. receptor-based ligand prediction. Bioinformatics, 29, 855–861.
da Silveira,C.H. et al. (2009) Protein cutoff scanning: a comparative analysis of Selivanova,G. et al. (1999) Reactivation of mutant p53 through interaction of a
cutoff dependent and cutoff free methods for prospecting contacts in proteins. c-terminal peptide with the core domain. Mol. Cell. Biol., 19, 3395–3402.
Proteins, 74, 727–743. Sionov,R.V. and Haupt,Y. (1999) The cellular response to p53: the decision between
Dehouck,Y. et al. (2009) Fast and accurate predictions of protein stability changes life and death. Oncogene, 18, 6145.
upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0. Tian,J. et al. (2010) Predicting changes in protein thermostability brought about by
Bioinformatics, 25, 2537–2543. single-or multi-site mutations. BMC Bioinformatics, 11, 370.
Dehouck,Y. et al. (2013) BeAtMuSiC: prediction of changes in protein-protein Topham,C.M. et al. (1997) Prediction of the stability of protein mutants based on
binding affinity on mutations. Nucleic Acids Res., 41, W333–W339. structural environment-dependent amino acid substitution and propensity
Fersht,A.R. (1987) Dissection of the structure and activity of the tyrosyl-trna tables. Protein Eng., 10, 7–21.
synthetase by site-directed mutagenesis. Biochemistry, 26, 8031–8037. Vousden,K.H. and Lu,X. (2002) Live or let die: the cell’s response to p53. Nat. Rev.
Guerois,R. et al. (2002) Predicting changes in the stability of proteins and protein Cancer, 2, 594–604.
complexes: a study of more than 1000 mutations. J. Mol. Biol., 320, 369–387. Worth,C.L. et al. (2011) SDM – a server for predicting effects of mutations on
Hupp,T.R. et al. (1995) Small peptides activate the latent sequence-specific DNA protein stability and malfunction. Nucleic Acids Res., 39 (Suppl. 2),
binding function of p53. Cell, 83, 237–245. W215–W222.
Joerger,A.C. et al. (2006) Structural basis for understanding oncogenic p53 muta- Yan,Z. et al. (2013) Specificity and affinity quantification of protein-protein inter-
tions and designing rescue drugs. Proc. Natl Acad. Sci. USA, 103, 15056–15061. actions. Bioinformatics, 29, 1127–1133.

342

You might also like