We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×

Role of open chemical data in aiding drug discovery and design

    Anna Gaulton

    † Author for correspondence

    EMBL – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.

    and
    John P Overington

    EMBL – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK

    Published Online:https://doi.org/10.4155/fmc.10.191

    Drug-discovery data

    Researchers in large pharmaceutical companies typically draw on a wide range of data resources and tools to enable decisions regarding target selection, lead identification, optimization and candidate selection. Much of this information is either generated internally or licensed from commercial vendors. For example, access to large sets of screening results, patent databases and databases of clinical candidates can be used to identify chemical tools or leads for a target of interest or to assess competitive position. Additional data classes add incrementally to this view. For example, internally generated crystal structures, complexed with drug-like ligands, provide valuable information for structure-based drug design and lead optimization. Large numbers of absorption, distribution, metabolism and excretion (ADME) and toxicity measurements also allow the building of predictive models to prioritize compounds, select the best candidates for further development and attempt to minimize the risks of potential adverse effects. By contrast, academic researchers have typically had to rely on a far smaller number of available public-domain resources, together with information scattered across the literature. Access to large chemical and pharmacological datasets has previously been limited, in part due to concerns about the potential loss of intellectual property associated with disclosing compound structures.

    Public data & precompetitive initiatives

    In recent years, however, there has been a significant increase in the availability of large-scale open data for drug discovery. In particular, the number and size of screening databases has expanded significantly. The establishment of initiatives such as the NIH Molecular Libraries Program [1] and the Broad Institute’s Chemical Biology Platform are making access to high-throughput screening (HTS) capabilities and the subsequent primary data more widely available to academic groups. These data are typically fed into public databases such as PubChem [2] and ChemBank [3]. Initial plans are also underway for a similar European infrastructure project (EU-openscreen), whose aim will be to connect a network of screening centres across Europe and provide access to the results via a common European chemical biology database [101]. In addition, the recent transfer of the ChEMBL database from the private sector into the public domain [102] will supplement existing activity databases such as BindingDB [4], IUPHARDB [5] and PDSP Ki [103]. Publishers could also play a part in this data-accessibility process by setting policies for deposition of screening data into public repositories (as is currently the case for sequence and protein structure data) and helping to standardize the way such data are reported. Nature Chemical Biology, for example, has already produced guidelines for the submission of screening data [6]. In the area of toxicity, several public screening initiatives are also underway, including the EPA ToxCast [7,8] and Tox21 [104] projects. While these efforts are primarily focused around environmental chemicals, the resulting data may still be informative in a drug-discovery context.

    In addition to screening and bioactivity information, there are also now an increasing number of large chemical structure repositories, providing access to tens of millions of compounds for applications such as virtual screening [9] (e.g., PubChem, Zinc [10] and GDB-13 [11]). Several other public domain databases containing drug discovery-relevant information are also being developed – for example, DrugBank [12] and DailyMed [105] provide information regarding approved drugs, ClinicalTrials.gov [106] provides data on clinical-stage experimental drugs and DSSTox [13,14] and TOXNET [15] collate toxicity information from a wide range of public sources.

    The increasing availability of public data coincides with initiatives in the pharmaceutical industry aimed at reducing costs, for example via increased outsourcing and engaging in precompetitive activities. The establishment of the Pistoia Alliance (a not-for-profit consortium of pharmaceutical companies, institutes and technology vendors, established for the purpose of brokering common precompetitive needs [16]) and the European Innovative Medicines Initiative [17] are both helping to provide a driving force towards further development and integration of tools and databases within the public domain. Public–private partnerships, such as the Structural Genomics Consortium-led chemical probes initiative [18], are becoming increasingly common and, further to this, pharmaceutical companies are starting to release some of their own formerly proprietary data. GlaxoSmithKline, for instance, has recently announced that it will make a large dataset of 13,500 compounds with antimalarial activity publicly available [107]. It is expected that other companies will follow this lead.

    The impact of open data

    The availability of public large-scale datasets is likely to have a significant impact on academic, not-for-profit and industrial drug discovery. First, groups will be enabled with access to the data they need for individual projects, for example rapid identification of high-quality tool compounds to help validate targets or profile disease models. Second, and perhaps more important, the datasets will encourage the development of new tools and predictive algorithms within the public domain, benefiting the widest possible community. A parallel to this can perhaps be seen when considering the vast array of bioinformatics tools and methods developed for functional annotation of proteins following the exponential growth in deposition of sequence and structure data since the early 1990s. A similar explosion and investment of funding in chemoinformatics and computational chemical biology research may help address many of the unmet needs in drug discovery and design. For example, databases of launched drugs and medicinal chemistry compounds could be data mined to discover key properties and rules related to successful drugs or to identify possible lead-optimization strategies and tactics. Large bioactivity datasets can be used to derive panels of quantitative structure–activity relationship or classification models, allowing prediction of compound activity from structure. Such predictions can contribute to the elucidation of the molecular targets of phenotypic assays, prediction or explanation of drug side effects and identification of potential drug repurposing opportunities through optimization of alternative activities. Identification of new leads may also be accomplished through the application of structure-based virtual-screening methods such as docking and pharmacophore- or molecular similarity-based methods.

    However, with all predictive methods, the quality and relevance of the training data are paramount in determining the accuracy and applicability domain of resulting models. HTS results are often uncurated and typically have a relatively high false-positive rate, for example. Dose response studies in published literature do not always adequately report negative results. Chemical structures may often be depicted or named incorrectly. As datasets become more readily available, we will see the emphasis move towards quality, in addition to indexing and organization of data, rather than raw quantity. Indeed, many analyses are already being published that assess the quality of public screening libraries and identify promiscuous or reactive compounds that could be responsible for many of the false-positive results [19,20] or investigate the accuracy of compound structures in various repositories [21]. Progress within just this one area will have a profound impact on improving the discovery rate of genuinely useful chemical probes as a starting point for the development of novel and safe therapeutics. With the increasingly rapid growth of these public-domain sources, ensuring quality and interoperability is going to pose significant challenges.

    Accompanying the growth of open data and associated research activities, we are also starting to see increasing growth in the availability of open-source tools for chemical data processing and analysis. For example, toolkits and workflow tools such as CDK/Taverna [22], Bioclipse [23], RDKit [108], KNIME [24] and OpenBabel [109] are gaining in popularity, allowing scientists to tap into the increasing number of available resources and facilitating data-mining efforts, without needing investment in expensive commercial software – this mirrors projects such as BioPerl for the bioinformatics research community. Similarly, efforts are underway to better integrate disparate chemical and drug-discovery data sources [25,26] and improve interoperability through the development of standards (e.g., the use of the InChI representation for chemical structures [27]). Further emphasis in this area will be essential to promote maximal utility of the data.

    The changing face of drug discovery

    Perhaps a logical extension of many of the developments discussed above is in acting as a catalyst for the collaboration of different groups and organizations on the actual process of drug discovery. While in most areas this poses questions around retention of intellectual property, several collaborative efforts are already underway in the area of neglected disease research. Not-for-profit organizations such as the Medicines for Malaria Venture and the Drugs for Neglected Diseases initiative have already been established for this purpose and a growing number of public collaborative drug-discovery resources are being established (e.g., the TDR Targets database [28] and The Synaptic Leap [110]).

    In order for collaborative and academic drug-discovery efforts to really succeed, however, researchers will need access to the full range of tools and data available to those in industry. While this is becoming increasingly possible, datasets in some areas are still lacking. There is still only a limited amount of public information regarding the ADME properties of compounds, for example [29]. Without such data and the development of good-quality ADME models, potential lead compounds may lack the properties required for good bioavailability in vivo and may subsequently fail in early development. The pharmaceutical industry has also invested much time and money into identifying and eliminating causes of toxicity but, again, much of this information is not publicly available, meaning mistakes of the past risk being repeated. Finally, a large body of chemical structure, synthesis and pharmacology information is contained only within patent documents. Though these documents are readily available online, they are not in a suitably structured form for large-scale searching and analysis. Some efforts are underway to facilitate indexing of these documents. For example, OSRA is an open-source tool for conversion of graphical representations of compounds in documents into computer-readable formats, allowing images in patents to be extracted and searched by structure [30]. However, the extraction of other valuable data from patent texts remains a nontrivial task. Arguably, tackling this data-accessibility gap within the public domain could result in huge benefits in productivity and efficiency.

    Future perspective

    Formerly, the billions of dollars spent annually on research within the pharmaceutical industry provided industrial researchers with unparalleled access to critical tools and resources that were largely beyond the reach of academics, not-for-profits and SMEs. However, it is now becoming clear that this business model of drug-discovery research and development is not sustainable or cost effective [31], and we are seeing the drug-discovery industry, together with data publishers and funding agencies, adopt new business models based on increased outsourcing, collaborative skills transfer and precompetitive activities [32,33]. Ultimately, as the volume and quality of open data increase, we are likely to see a growth in enabled academic and collaborative drug discovery. There is also likely to be an increase in the number of small biotechnology/pharmaceutical companies, accompanied by a decrease in the amount of research carried out within the closed walls of large pharmaceutical companies; this trend will depend crucially on facile access to enabling data. Hopefully, a benefit of this change in model will be greater levels of innovation and a boost to the dwindling productivity of the drug-discovery industry as a whole.

    Acknowledgements

    The authors wish to thank the Wellcome Trust for a Strategic Award and the EMBL-EBI for additional support. We are grateful to the referees of this paper for their suggestions and improvements.

    Financial & competing interests disclosure

    The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

    No writing assistance was utilized in the production of this manuscript.

    Papers of special note have been highlighted as:▪ of interest ▪▪ of considerable interest

    Bibliography

    • Austin CP, Brady LS, Insel TR, Collins FS. NIH Molecular Libraries Initiative. Science306(5699),1138–1139 (2004).
    • Wang Y, Bolton E, Dracheva S et al. An overview of the PubChem BioAssay resource. Nucleic Acids Res.38(Database issue),D255–266 (2010).
    • Seiler KP, George GA, Happ MP et al. ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic Acids Res.36(Database issue),D351–D359 (2008).
    • Liu T, Lin Y, Wen X, Jorrisen RN, Gilson MK. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res.35(Database issue),D198–201 (2007).
    • Harmar AJ, Hills RA, Rosser EM et al. IUPHAR-DB: the IUPHAR database of G protein-coupled receptors and ion channels. Nucleic Acids Res.37(Database issue),D680–685 (2009).
    • Inglese J, Shamu C, Gu R. Reporting data from high-throughput screening of small-molecule libraries. Nat. Chem. Biol.3(8),438–441 (2007).▪▪ Important article calling for journals to enforce standards for the reporting of screening data.
    • Judson RS, Houck KA, Kavlock RJ et al.In vitro screening of environmental chemicals for targeted testing prioritization: the ToxCast project. Environ. Health Perspect.118(4),485–492 (2010).
    • Collins FS, Gray GM, Bucher JR. Toxicology. Transforming environmental health protection. Science319(5865),906–907.
    • Villoutreix BO, Renault N, Lagorce D, Sperandio O, Montes M, Miteva MA. Free resources to assist structure-based virtual ligand screening experiments. Curr. Protein Pept. Sci.8(4),381–411 (2007).
    • 10  Irwin JJ, Shoichet BK. ZINC – a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model.45(1),177–182 (2005).
    • 11  Blum LC, Reymond JL. 970 million druglike small molecules for virtual screening in the chemical universe database GDB-13. J. Am. Chem. Soc.131(25),8732–8733 (2009).
    • 12  Wishart DS, Knox C, Guo AC et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res.36(Database issue),D901–D906 (2008).
    • 13  Richard AM. DSSTox website launch: improving public access to databases for building structure-toxicity prediction models. Preclinica2,103–108 (2004).
    • 14  Richard AM, Williams CR. Distributed structure-searchable toxicity (DSSTox) public database network: a proposal. Mutat. Res.499(1),27–52 (2002).
    • 15  Hochstein C, Arnesen S, Goshorn J. Environmental health and toxicology resources of the United States National Library of Medicine. Med. Ref. Serv. Q.26(3),21–45 (2007)
    • 16  Barnes MR, Harland L, Foord SM et al. Lowering industry firewalls: pre-competitive informatics initiatives in drug discovery. Nat. Rev. Drug Discov.8(9),701–708 (2009).▪ Important paper describing the aims of pharmaceutical companies in setting up precompetitive initiatives.
    • 17  Hunter AJ. The Innovative Medicines Initiative: a pre-competitive initiative to enhance the biomedical science base of Europe to expedite the development of new medicines for patients. Drug Discov. Today13(9–10),371–373 (2008).
    • 18  Edwards AM, Bountra C, Kerr DJ, Willson TM. Open access chemical and clinical probes to support drug discovery. Nat. Chem. Biol.5(7),436–440 (2009).▪ Details an important public–private partnership to develop freely available chemical probes for key targets.
    • 19  Feng BY, Simeonov A, Jadhav A et al. A high-throughput screen for aggregation-based inhibition in a large compound library. J. Med. Chem.50(10),2385–2390 (2007).
    • 20  Soares KM, Blackmon N, Shun TY et al. Profiling the NIH Small Molecule Repository for compounds that generate H2O2 by redox cycling in reducing environments. Assay Drug Dev. Technol. (2010) in press.
    • 21  Young D, Martin T, Venkatapathy R, Harten P. Are the chemical structures in your QSAR correct? QSAR Comb. Sci.27(11–12),1337–1345 (2008).▪ Informative article highlighting issues with data quality when building quantitative structure–activity relationship models.
    • 22  Kuhn T, Willighagen EL, Zielesny A, Steinbeck C. CDK-Taverna: an open workflow environment for cheminformatics. BMC Bioinformatics11(1),159 (2010).
    • 23  Spjuth O, Helmus T, Willighagen EL et al. Bioclipse: an open source workbench for chemo- and bioinformatics. BMC Bioinformatics8,59 (2007).
    • 24  Berthold MR, Cebron N, Dill F et al. KNIME: The Konstanz Information Miner. In: Data Analysis, Machine Learning and Applications. Preisach C, Schmidt-Thieme L (Eds). Springer-Verlag, Berlin, 319–326 (2008).
    • 25  Jentzsch A, Hassanzadeh O, Bizer C, Andersson B, Stephens S. Enabling tailored therapeutics with linked data. Presented at: The 2nd Workshop about Linked Data on the Web. Madrid, Spain, 20 April 2009.
    • 26  Belleau F, Nolin MA, Tourigny N, Rigault P, Morissette J. Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inform.41(5),706–716 (2008).
    • 27  Heller SR, McNaught AD. The IUPAC international chemical identifier (InChI). Chem. Int.31(1),7 (2009).
    • 28  Agüero F, Al-Lazikani B, Aslett M et al. Genomic-scale prioritization of drug targets: the TDR targets database. Nat. Rev. Drug Discov.7(11),900–907 (2008).
    • 29  Ekins S, Williams AJ. Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development. Lab Chip10(1),13–22 (2010).▪ Thorough discussion of issues with the availability of absorption, distribution, metabolism, excretion and toxicity data in the public domain and the potential advantages of releasing such data.
    • 30  Filippov IV, Nicklaus MC. Optical structure recognition software to recover chemical information: OSRA, an open source solution. J. Chem. Inf. Model.49(3),740–743 (2009).
    • 31  Munos B. Lessons for 60 years of pharmaceutical innovation. Nat. Rev. Drug Discov.8(12),959–968 (2009).▪▪ Interesting and detailed analysis of trends in the productivity of the pharmaceutical industry throughout its history
    • 32  Melese T, Lin SM, Chang JL, Cohen NH. Open innovation networks between academia and industry: an imperative for breakthrough therapies. Nat. Med.15(5),502–507 (2009).
    • 33  Munos BH, Chin WW. A call for sharing: adapting pharmaceutical research to new realities. Sci. Transl. Med.1(9),9 (2009).
    • 101  EU OpenScreen. www.eu-openscreen.de
    • 102  Wellcome Trust press release www.wellcome.ac.uk/News/Media-office/Press-releases/2010/WTX058219.htm
    • 103  PDSP Database http://pdsp.med.unc.edu/pdsp.php
    • 104  Tox21: Putting a lens on the vision of toxicity testing in the 21st Century www.alttox.org/ttrc/overarching-challenges/way-forward/austin-kavlock-tice
    • 105  DailyMed http://dailymed.nlm.nih.gov/dailymed/about.cfm
    • 106  Clinical Trials homepage www.clinicaltrials.gov
    • 107  GSK announces ‘open innovation’ strategy to help deliver new and better medicines for people living in the world’s poorest countries – press release www.gsk.com/media/pressreleases/2010/2010_pressrelease_10009.htm
    • 108  RDKit: cheminformatics and machine learning software www.rdkit.org/
    • 109  Open Babel: the open source toolbox http://openbabel.org
    • 110  The Synaptic Leap Homepage www.thesynapticleap.org