We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×

AutoQSAR: an automated machine learning tool for best-practice quantitative structure–activity relationship modeling

    Steven L Dixon

    Schrödinger, Inc., 120 West 45th Street, New York, NY 10036, USA

    ,
    Jianxin Duan

    Schrödinger GmbH, Dynamostrasse 13, 68165 Mannheim, Baden-Württemberg, Germany

    ,
    Ethan Smith

    Schrödinger, Inc., 101 SW Main Street, Portland, OR 97204, USA

    ,
    Christopher D Von Bargen

    Schrödinger, Inc., 120 West 45th Street, New York, NY 10036, USA

    ,
    Woody Sherman

    Schrödinger, Inc., 120 West 45th Street, New York, NY 10036, USA

    &
    Matthew P Repasky

    *Author for correspondence:

    E-mail Address: matt.repasky@schrodinger.com

    Schrödinger, Inc., 101 SW Main Street, Portland, OR 97204, USA

    Published Online:https://doi.org/10.4155/fmc-2016-0093

    Aim: We introduce AutoQSAR, an automated machine-learning application to build, validate and deploy quantitative structure–activity relationship (QSAR) models. Methodology/results: The process of descriptor generation, feature selection and the creation of a large number of QSAR models has been automated into a single workflow within AutoQSAR. The models are built using a variety of machine-learning methods, and each model is scored using a novel approach. Effectiveness of the method is demonstrated through comparison with literature QSAR models using identical datasets for six end points: protein–ligand binding affinity, solubility, blood–brain barrier permeability, carcinogenicity, mutagenicity and bioaccumulation in fish. Conclusion: AutoQSAR demonstrates similar or better predictive performance as compared with published results for four of the six endpoints while requiring minimal human time and expertise.

    Papers of special note have been highlighted as: • of interest; •• of considerable interest

    References

    • 1 Li J, Ballmer SG, Gillis EP et al. Synthesis of many different types of organic small molecules using one automated process. Science 347(6227), 1221–1226 (2015).
    • 2 Szymkuć S, Gajewska EP, Klucznik T et al. Computer-assisted synthetic planning: the end of the beginning. Angew. Chem. Int. Ed. Engl. 55(20), 5904–5937 (2016).
    • 3 Wang L, Wu Y, Deng Y et al. Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 137(7), 2695–2703 (2015).
    • 4 Yazawa H, Hirasawa A, Horie K et al. Oxytocin receptors expressed and coupled to Ca2+ signalling in a human vascular smooth muscle cell line. Br. J. Pharmacol. 117(5), 799–804 (1996).
    • 5 Cherkasov A, Muratov EN, Fourches D et al. QSAR modeling: where have you been? where are you going to? J. Med. Chem. 57(12), 4977–5010 (2014). • Describes the evolution of QSAR methods including development of best practices concepts and provides guidance for where the field is likely heading.
    • 6 Cumming JG, Davis AM, Muresan S, Haeberlein M, Chen H. Chemical predictive modelling to improve compound quality. Nat. Rev. Drug Disc. 12(12), 948–962 (2013).
    • 7 Rodgers SL, Davis AM, Tomkinson NP, van de Waterbeemd H. Predictivity of simulated ADME AutoQSAR models over time. Mol. Inform. 30(2–3), 256–266 (2011). •• Describes the experience at AstraZeneca of using QSAR models regenerated as more data became available during pharmaceutical projects.
    • 8 Cox R, Green DV, Luscombe CN, Malcolm N, Pickett SD. QSAR workbench: automating QSAR modeling to drive compound design. J. Comp. Aided Mol. Des. 27(4), 321–336 (2013).
    • 9 Wood DJ, Buttar D, Cumming JG, Davis AM, Norinder U, Rodgers SL. Automated QSAR with a hierarchy of global and local models. Mol. Inform. 30(11–12), 960–972 (2011). • Describes the methodology developed at AstraZeneca for automated creation of QSAR models.
    • 10 Stalring JC, Carlsson LA, Almeida P, Boyer S. Azorange – high performance open source machine learning for QSAR modeling in a graphical programming environment. J. Cheminform. 3, 28 (2011).
    • 11 Tropsha A. Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 29(6–7), 476–488 (2010). •• Outlines a best practices guidelines for QSAR model development.
    • 12 Sastry M, Lowrie JF, Dixon SL, Sherman W. Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments. J. Chem. Inf. Model. 50(5), 771–784 (2010).
    • 13 Canvas. v2.8 Schrodinger, LLC. NY, USA (2016).
    • 14 Todeschini R, Consonni V. Handbook of Molecular Descriptors. Mannhold R, Kubinyi H, Timmerman H (Eds). Wiley-VCH, NY, USA (2008).
    • 15 Hall LH, Kier LB. Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J. Chem. Inf. Comp. Sci. 35(6), 1039–1045 (1995).
    • 16 Ghose AK, Viswanadhan VN, Wendoloski JJ. Prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: an analysis of Alogp and Clogp methods. J. Phys. Chem. A 102(21), 3762–3772 (1998).
    • 17 Ertl P, Rohde B, Selzer P. Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43(20), 3714–3717 (2000).
    • 18 Duan J, Dixon SL, Lowrie JF, Sherman W. Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J. Mol. Graph. Model. 29(2), 157–170 (2010).
    • 19 An Y, Sherman W, Dixon SL. Hole filling and library optimization: application to commercially available fragment libraries. Bioorg. Med. Chem. 20(18), 5379–5387 (2012).
    • 20 An Y, Sherman W, Dixon SL. Kernel-based partial least squares: application to fingerprint-based QSAR with model visualization. J. Chem. Inf. Model. 53(9), 2312–2321 (2013).
    • 21 Bender A, Mussa HY, Glen RC, Reiling S. Molecular similarity searching using atom environments, information-based feature selection, and a naive bayesian classifier. J. Chem. Inf. Comp. Sci. 44(1), 170–178 (2004).
    • 22 Bender A, Mussa HY, Glen RC, Reiling S. Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J. Chem. Inf. Comp. Sci. 44(5), 1708–1718 (2004).
    • 23 Rogers D, Brown RD, Hahn M. Using extended-connectivity fingerprints with Laplacian-modified Bayesian analysis in high-throughput screening follow-up. J. Biomol. Screen. 10(7), 682–686 (2005).
    • 24 Dixon S, Merz KM Jr, Lauri G, Ianni JC. QMQSAR: utilization of a semiempirical probe potential in a field-based QSAR method. J. Comp. Chem. 26(1), 23–34 (2005).
    • 25 Dixon SL, Smondyrev AM, Knoll EH, Rao SN, Shaw DE, Friesner RA. Phase: a new engine for pharmacophore perception, 3D QSAR model development, and 3D database screening: 1. Methodology and preliminary results. J. Comp. Aided Mol. Des. 20(10–11), 647–671 (2006).
    • 26 Klon AE, Lowrie JF, Diller DJ. Improved naive Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction. J. Chem. Inf. Model. 46(5), 1945–1956 (2006).
    • 27 Dixon SL, Villar HO. Investigation of classification methods for the prediction of activity in diverse chemical libraries. J. Comp. Aided Mol. Des. 13(5), 533–545 (1999).
    • 28 Breiman L. Random forests. Machine Learning 45(1), 5–32 (2001).
    • 29 Susnow RG, Dixon SL. Use of robust classification techniques for the prediction of human cytochrome P450 2d6 Inhibition. J. Chem. Info. Comp. Sci. 43(4), 1308–1315 (2003).
    • 30 Cheng A, Dixon SL. In silico models for the prediction of dose-dependent human hepatotoxicity. J. Comp.Aided Mol. Des. 17(12), 811–823 (2003).
    • 31 Zhang L, Zhu H, Oprea TI, Golbraikh A, Tropsha A. QSAR modeling of the blood-brain barrier permeability for diverse organic compounds. Pharm. Res. 25(8), 1902–1914 (2008).
    • 32 Wang J, Krudy G, Hou T, Zhang W, Holland G, Xu X. Development of reliable aqueous solubility models and their application in druglike analysis. J. Chem. Inf. Model. 47(4), 1395–1404 (2007).
    • 33 Fjodorova N, Vracko M, Novic M, Roncaglioni A, Benfenati E. New public QSAR model for carcinogenicity. Chem. Cent. J. 4(Suppl. 1), S3 (2010).
    • 34 Ferrari T, Gini G. An open source multistep model to predict mutagenicity from statistical analysis and relevant structural alerts. Chem. Cent. J. 4(Suppl. 1), S2 (2010).
    • 35 Lombardo A, Roncaglioni A, Boriani E, Milan C, Benfenati E. Assessment and validation of the caesar predictive model for bioconcentration factor (BCF) in fish. Chem. Cent. J. 4(Suppl. 1), S1 (2010).
    • 36 Dimitrov S, Dimitrova N, Parkerton T, Comber M, Bonnell M, Mekenyan O. Base-line model for identifying the bioaccumulation potential of chemicals. SAR QSAR Environ. Res. 16(6), 531–554 (2005).
    • 37 Zhao C, Boriani E, Chana A, Roncaglioni A, Benfenati E. A new hybrid system of QSAR models for predicting bioconcentration factors (BCF). Chemosphere 73(11), 1701–1707 (2008).
    • 38 Acharjee S, Do-Rego JL, Oh DY et al. Molecular cloning, pharmacological characterization, and histochemical distribution of frog vasotocin and mesotocin receptors. J. Mol. Endocrinol. 33(1), 293–313 (2004).
    • 39 Benfenati E, Benigni R, Demarini DM et al. Predictive models for carcinogenicity and mutagenicity: frameworks, state-of-the-art, and perspectives. J. Environ. Sci. Health 27(2), 57–90 (2009).
    • 40 Piegorsch WW, Zeiger E. Measuring Intra-Assay Agreement for the Ames Salmonella Assay. Hothorn L (Ed.). Springer, Berlin, Heidelberg (1991).