Database
Database
Database
Received September 23, 2016; Revised October 21, 2016; Editorial Decision October 24, 2016; Accepted October 30, 2016
ABSTRACT feedback, the scope and data types within ChEMBL have
gradually expanded, with some major new areas included
ChEMBL is an open large-scale bioactivity database in recent releases: compounds in clinical development, data
(https://www.ebi.ac.uk/chembl), previously de- from patents, direct depositions for neglected diseases and
scribed in the 2012 and 2014 Nucleic Acids Research agrochemical data.
Database Issues. Since then, alongside the contin- Drug discovery remains a costly process with a high
ued extraction of data from the medicinal chemistry failure rate (3–6). To provide a more complete picture
literature, new sources of bioactivity data have across the drug discovery and development process, and to
also been added to the database. These include: help researchers better understand what makes a success-
deposited data sets from neglected disease screen- ful medicine, we have extended the ChEMBL data model
ing; crop protection data; drug metabolism and to include, for the first time, data typically generated in the
disposition data and bioactivity data from patents. A pre-clinical and clinical phases of drug discovery, specifi-
cally drug metabolism and disposition data. Another com-
number of improvements and new features have also
mon approach to understanding pharmaceutical attrition
been incorporated. These include the annotation of is to learn from successful drugs and failed drug candidates
assays and targets using ontologies, the inclusion (7–9). We have therefore extended our set of drug-target an-
of targets and indications for clinical candidates, notations to include those for clinical candidates and have
addition of metabolic pathways for drugs and calcu- also mapped these chemical entities to their therapeutic in-
lation of structural alerts. The ChEMBL data can be dications.
accessed via a web-interface, RDF distribution, data At the end of 2013, EMBL-EBI took over the operation,
downloads and RESTful web-services. development and support of the SureChem patent system
(now called SureChEMBL (10)) from Digital Science Ltd.
Access to this resource has highlighted the potential value
INTRODUCTION to scientists of bioactivity data not yet published in the sci-
Since its inception a major component of ChEMBL’s con- entific literature. However, the current SureChEMBL sys-
tent has been bioactivity data regularly extracted from the tem only extracts compound structures from the patents
medicinal chemistry literature (1,2). Among many other and not associated bioactivity data. As a first step to ad-
applications such data enables researchers to identify tool dress this opportunity we have worked with BindingDB to
compounds for potential therapeutic targets, to probe the incorporate the BindingDB patent data into ChEMBL (11).
available SAR data for a target, investigate phenotypic data Neglected disease research continues to be a field of drug
associated with similar compounds and to identify potential discovery conducted largely (though not exclusively) by not-
off-target effects of specific chemotypes. In order to pro- for-profit organisations that aim to expedite research by en-
vide a more complete perspective, based in part on user couraging sharing of experimental data with the commu-
* To whom correspondence should be addressed. Tel: +44 1223 494333; Fax: +44 1223 494468; Email: arl@ebi.ac.uk
Present addresses:
Louisa J. Bellis, Department of Genetics, University of Cambridge, Downing Street, Cambridge CB2 3EH, UK.
John P. Overington, Mark Davies and Anneli Karlsson, BenevolentAI, 40 Churchway, London NW1 1LW, UK.
Nathan Dedman, Local Measure, 87 Leonard St, London EC2A 4QS, UK.
George Papadatos, GlaxoSmithKline, Medicines Research Centre, Gunnels Wood Road, Stevenage, Herts SG1 2NY, UK.
C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
D946 Nucleic Acids Research, 2017, Vol. 45, Database issue
nity (12–15). Depositions of this type of data into ChEMBL Beyond the area of neglected diseases, the University
have continued to increase since the original malaria depo- of Vienna and Roche have deposited supplementary data
sitions in 2010. associated with publications already in ChEMBL (17,18).
Whilst the pharmaceutical and drug discovery commu- This is complementary to the similar sets already de-
nity continues to be the major user and consumer of posited by GlaxoSmithKline and we encourage similar de-
ChEMBL data, other life sciences communities also work positions from other authors. AstraZeneca have taken a
with similar types of data. The agrochemical industry is one different approach to direct data deposition. They iden-
such community where specific efforts have been made to tified compounds already in ChEMBL and then pro-
widen the coverage of data relevant to the discovery and vided data on these compounds from a variety of in
development of herbicides, pesticides and fungicides (16). vitro ADME and physicochemical screens including pro-
In the next sections, we describe the new data types now tein binding, microsome and hepatocyte clearance, solu-
integrated into the ChEMBL database, the annotations we bility, pKa and lipophilicity. It is important to note that
have undertaken to enable structured organisation and ac- for all such deposited data sets, ChEMBL provides a
ready carefully manually curated in BindingDB, this infor- the ChEMBL data; it has also been adopted by a number of
mation is retained in ChEMBL and simply mapped to the other bioassay data providers and members of the drug dis-
equivalent ChEMBL target. The data is taken from 1,015 covery community, allowing for good data interoperability.
granted US patents published between 2013 and 2015 and ChEMBL standard activity types were manually mapped
currently comprises 99 061 bioactivities on 68 149 distinct to corresponding BAO result terms (stored in the ACTIVI-
compounds binding to around 600 distinct targets. Of par- TIES table as BAO ENDPOINT). The resulting mappings
ticular interest to drug discovery scientists is the fact that cover 91% of the activity data points in ChEMBL. The re-
data often appears in patents earlier than in the traditional mainder are mainly diverse phenotypic endpoints that are
medicinal chemistry literature. This patent set contains data not covered by BAO (e.g. ‘Tissue Severity Score’, ‘Anticon-
on 50 targets for which there was previously no data in vulsant activity’, ‘Paw swelling’, ‘Relative uterus weight’) or
ChEMBL and which may therefore represent novel targets imprecise terms (e.g. ‘Ratio’, ‘Selectivity’, ‘Response’) that
of therapeutic interest. require further resolution. Similarly, ChEMBL standard
activity units were mapped to Units Ontology (UO) terms
NEW FUNCTIONALITY (22) (which are also a component of BAO) and stored in the
ACTIVITIES table as UO UNITS. Where available, units
Richer assay and target annotation were also mapped to terms from the Quantities, Units, Di-
Typical entry points to ChEMBL have predominantly been mensions and Types ontology (http://www.qudt.org). These
compound-based or target-based searches. However, more are stored in the ACTIVITIES table as QUDT UNITS.
than half of the activity data points in ChEMBL come from The current mapping to Units Ontology covers 87% of
functional or phenotypic assays that cannot be assigned a ChEMBL activity data points (the remainder largely be-
molecular target. Since phenotypic screening is once more ing complex units e.g. ‘ng.h.ml-1 ’, ‘ml.min-1 .g-1 ’ that are not
becoming commonplace in drug discovery (19), making this covered by UO).
wealth of data more accessible is a priority. To this end, we ChEMBL assays have also been annotated with BAO
have applied a number of ontologies to the ChEMBL assay assay format terms, allowing users to distinguish bio-
and activity data, allowing them to be searched and filtered chemical, cell-based, tissue-based or organism-based
by cell-line, tissue or assay format, for example. assays. An automated, rule-based approach was used to
The BioAssay Ontology (BAO) (20,21) was chosen as a classify the assay format for historical assays, based on
means of annotating ChEMBL assays for a number of rea- information in assay descriptions and target assignments.
sons: this ontology has been developed specifically for small In order to minimise false assignments, any assays where
molecule screening data and so provides good coverage of the format could not be determined unambiguously
D948 Nucleic Acids Research, 2017, Vol. 45, Database issue
Finally, we have also added information regard- in a data model that maintains the relationships between the
ing previously approved drugs that have been with- various molecular entities (e.g. metabolite A may be formed
drawn for toxicity or efficacy reasons. Information directly from drug D, whereas metabolite B may result
regarding withdrawn drugs was collated from several from the degradation of metabolite A). Where metabolis-
sources: the FDA (http://www.fda.gov) and EMA (http: ing enzymes, species and tissues are available in the origi-
//www.ema.europa.eu/ema/), the WITHDRAWN database nal publication this information is recorded in ChEMBL.
(35), the US Electronic Code of Federal Regulations In instances where the metabolite structure is known it is
(http://www.ecfr.gov/cgi-bin/retrieveECFR?gp=2&SID= recorded as a chemical structure. If the exact structure is
915cc9ab8176f1d1a2a355acf064ffe3&h=L&mc=true&n= unknown the reaction is still recorded but with an unde-
sp21.4.216.b&r=SUBPART&ty=HTML#se21.4.216 124), fined structure. The metabolism data is recorded in two
Federal Register (https://www.gpo.gov/fdsys/pkg/FR- new tables METABOLISM and METABOLISM REFS.
2014-07-02/pdf/2014-15371.pdf) and several review ar- The metabolite pathway is shown on the ChEMBL in-
ticles (36–38). Where available, the year of withdrawal, terface as an interactive image with links to the data on
Figure 2. Compound Report Card for Troglitazone showing mechanism of action, indication and withdrawal information (https://www.ebi.ac.uk/chembl/
compound/inspect/CHEMBL408).
Nucleic Acids Research, 2017, Vol. 45, Database issue D951
substructure and similarity searching) and targets as well amount of data that can be retrieved for a particu-
as keyword searching across assay, cell line and tissue infor- lar target/compound, or to provide relevant pharmacol-
mation. Users can also retrieve and filter bioactivity infor- ogy and drug-target data in the context of other data
mation and browse drug and clinical candidate information types (e.g. pathway, expression or disease information).
(including targets and indications). More details of the user ChEMBL data is incorporated into a wide range of other
interface and its functionality can be found in previous pub- resources including PubChem BioAssay (52), BindingDB
lications (1,2). (11), CanSAR (53), Open PHACTS (54), Open Targets (55)
and the Target Central Resource Database/PHAROS (http:
Downloads and web-services //juniper.health.unm.edu/tcrd/, 56), so can also be accessed
via these routes. However, since these other resources are
While the ChEMBL interface provides the functionality
different in scope, they do not all incorporate ChEMBL in
necessary for many simple use-cases, some users may pre-
full (e.g. BindingDB focuses only on binding measurements,
fer to download the database and query it locally (e.g. for
while Open Targets incorporates data on drug–target and
6. Kola,I. and Landis,J. (2004) Can the pharmaceutical industry reduce data from the library of integrated network-based cellular signatures
attrition rates? Nat. Rev. Drug Discov., 3, 711–715. (LINCS). J. Biomol. Screen., 19, 803–816.
7. Cook,D., Brown,D., Alexander,R., March,R., Morgan,P., 26. Mungall,C.J., Torniai,C., Gkoutos,G.V., Lewis,S.E. and
Satterthwaite,G. and Pangalos,M.N. (2014) Lessons learned from the Haendel,M.A. (2012) Uberon, an integrative multi-species anatomy
fate of AstraZeneca’s drug pipeline: a five-dimensional framework. ontology. Genome Biol., 13, R5.
Nat. Rev. Drug Discov., 13, 419–431. 27. Gremse,M., Chang,A., Schomburg,I., Grote,A., Scheer,M.,
8. Waring,M.J., Arrowsmith,J., Leach,A.R., Leeson,P.D., Mandrell,S., Ebeling,C. and Schomburg,D. (2011) The BRENDA Tissue Ontology
Owen,R.M., Pairaudeau,G., Pennie,W.D., Pickett,S.D., Wang,J. et al. (BTO): the first all-integrating ontology of all organisms for enzyme
(2015) An analysis of the attrition of drug candidates from four major sources. Nucleic Acids Res., 39, D507–D513.
pharmaceutical companies. Nat. Rev. Drug Discov., 14, 475–486. 28. Southan,C., Sharman,J.L., Benson,H.E., Faccenda,E., Pawson,A.J.,
9. Morgan,P., Van Der Graaf,P.H., Arrowsmith,J., Feltner,D.E., Alexander,S.P., Buneman,O.P., Davenport,A.P., McGrath,J.C.,
Drummond,K.S., Wegner,C.D. and Street,S.D. (2012) Can the flow Peters,J.A. et al. (2016) The IUPHAR/BPS Guide to
of medicines be improved? Fundamental pharmacokinetic and PHARMACOLOGY in 2016: towards curated quantitative
pharmacological principles toward improving Phase II survival. Drug interactions between 1300 protein targets and 6000 ligands. Nucleic
Discov. Today, 17, 419–424. Acids Res., 44, D1054–D1068.
screening libraries and for their exclusion in bioassays. J. Med. research and drug discovery knowledgebase. Nucleic Acids Res., 44,
Chem., 53, 2719–2740. D938–D943.
47. Ochoa,R., Davies,M., Papadatos,G., Atkinson,F. and Overington,J.P. 54. Williams,A.J., Harland,L., Groth,P., Pettifer,S., Chichester,C.,
(2014) myChEMBL: a virtual machine implementation of open data Willighagen,E.L., Evelo,C.T., Blomberg,N., Ecker,G., Goble,C. et al.
and cheminformatics tools. Bioinformatics, 30, 298–300. (2012) Open PHACTS: semantic interoperability for drug discovery.
48. Davies,M., Nowotka,M., Papadatos,G., Atkinson,F., van Drug Discov. Today, 17, 1188–1198.
Westen,G.J., Dedman,N., Ochoa,R. and Overington,J.P. (2014) 55. Koscielny,G., An,P., Carvalho-Silva,D., Cham,J.A., Munoz-Pomer
MyChEMBL: a virtual platform for distributing cheminformatics Fuentes,A., Fumis,L., Gasparyan,R., Hasan,S., Karamanis,N.,
tools and open data. Challenges, 5, 334–337. Maguire,M. et al. (2016) Open targets: a platform for therapeutic
49. Filippov,I.V. and Nicklaus,M.C. (2009) Optical structure recognition target identification and validation. Nucleic Acids Res.,
software to recover chemical information: OSRA, an open source doi:10.1093/nar/gkw1055.
solution. J. Chem. Inf. Model., 49, 740–743. 56. Nguyen,D.T., Mathias,D., Bologa,C., Brunak,S., Fernandez,N.,
50. O’Boyle,N.M., Banck,M., James,C.A., Morley,C., Vandermeersch,T. Gaulton,A., Hersey,A., Holmes,J., Jensen,L., Karlsson,A. et al.
and Hutchison,G.R. (2011) Open babel: an open chemical toolbox. J. (2016) Pharos: collating protein information to shed light on the
Cheminform., 3, 33. druggable genome. Nucleic Acids Res., doi:10.1093/nar/gkw1072.