AutoQSAR: an automated machine learning tool for best-practice quantitative structure–activity relationship modeling
Abstract
Aim: We introduce AutoQSAR, an automated machine-learning application to build, validate and deploy quantitative structure–activity relationship (QSAR) models. Methodology/results: The process of descriptor generation, feature selection and the creation of a large number of QSAR models has been automated into a single workflow within AutoQSAR. The models are built using a variety of machine-learning methods, and each model is scored using a novel approach. Effectiveness of the method is demonstrated through comparison with literature QSAR models using identical datasets for six end points: protein–ligand binding affinity, solubility, blood–brain barrier permeability, carcinogenicity, mutagenicity and bioaccumulation in fish. Conclusion: AutoQSAR demonstrates similar or better predictive performance as compared with published results for four of the six endpoints while requiring minimal human time and expertise.
Papers of special note have been highlighted as: • of interest; •• of considerable interest
References
- 1 Synthesis of many different types of organic small molecules using one automated process. Science 347(6227), 1221–1226 (2015).
- 2 Computer-assisted synthetic planning: the end of the beginning. Angew. Chem. Int. Ed. Engl. 55(20), 5904–5937 (2016).
- 3 Accurate and reliable prediction of relative ligand binding potency in prospective drug discovery by way of a modern free-energy calculation protocol and force field. J. Am. Chem. Soc. 137(7), 2695–2703 (2015).
- 4 Oxytocin receptors expressed and coupled to Ca2+ signalling in a human vascular smooth muscle cell line. Br. J. Pharmacol. 117(5), 799–804 (1996).
- 5 QSAR modeling: where have you been? where are you going to? J. Med. Chem. 57(12), 4977–5010 (2014). • Describes the evolution of QSAR methods including development of best practices concepts and provides guidance for where the field is likely heading.
- 6 . Chemical predictive modelling to improve compound quality. Nat. Rev. Drug Disc. 12(12), 948–962 (2013).
- 7 . Predictivity of simulated ADME AutoQSAR models over time. Mol. Inform. 30(2–3), 256–266 (2011). •• Describes the experience at AstraZeneca of using QSAR models regenerated as more data became available during pharmaceutical projects.
- 8 . QSAR workbench: automating QSAR modeling to drive compound design. J. Comp. Aided Mol. Des. 27(4), 321–336 (2013).
- 9 . Automated QSAR with a hierarchy of global and local models. Mol. Inform. 30(11–12), 960–972 (2011). • Describes the methodology developed at AstraZeneca for automated creation of QSAR models.
- 10 . Azorange – high performance open source machine learning for QSAR modeling in a graphical programming environment. J. Cheminform. 3, 28 (2011).
- 11 . Best practices for QSAR model development, validation, and exploitation. Mol. Inform. 29(6–7), 476–488 (2010). •• Outlines a best practices guidelines for QSAR model development.
- 12 . Large-scale systematic analysis of 2D fingerprint methods and parameters to improve virtual screening enrichments. J. Chem. Inf. Model. 50(5), 771–784 (2010).
- 13 Canvas. v2.8 Schrodinger, LLC. NY, USA (2016).
- 14 . Handbook of Molecular Descriptors. Mannhold R, Kubinyi H, Timmerman H (Eds). Wiley-VCH, NY, USA (2008).
- 15 . Electrotopological state indices for atom types: a novel combination of electronic, topological, and valence state information. J. Chem. Inf. Comp. Sci. 35(6), 1039–1045 (1995).
- 16 . Prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: an analysis of Alogp and Clogp methods. J. Phys. Chem. A 102(21), 3762–3772 (1998).
- 17 . Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J. Med. Chem. 43(20), 3714–3717 (2000).
- 18 . Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods. J. Mol. Graph. Model. 29(2), 157–170 (2010).
- 19 . Hole filling and library optimization: application to commercially available fragment libraries. Bioorg. Med. Chem. 20(18), 5379–5387 (2012).
- 20 . Kernel-based partial least squares: application to fingerprint-based QSAR with model visualization. J. Chem. Inf. Model. 53(9), 2312–2321 (2013).
- 21 . Molecular similarity searching using atom environments, information-based feature selection, and a naive bayesian classifier. J. Chem. Inf. Comp. Sci. 44(1), 170–178 (2004).
- 22 . Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance. J. Chem. Inf. Comp. Sci. 44(5), 1708–1718 (2004).
- 23 . Using extended-connectivity fingerprints with Laplacian-modified Bayesian analysis in high-throughput screening follow-up. J. Biomol. Screen. 10(7), 682–686 (2005).
- 24 . QMQSAR: utilization of a semiempirical probe potential in a field-based QSAR method. J. Comp. Chem. 26(1), 23–34 (2005).
- 25 . Phase: a new engine for pharmacophore perception, 3D QSAR model development, and 3D database screening: 1. Methodology and preliminary results. J. Comp. Aided Mol. Des. 20(10–11), 647–671 (2006).
- 26 . Improved naive Bayesian modeling of numerical data for absorption, distribution, metabolism and excretion (ADME) property prediction. J. Chem. Inf. Model. 46(5), 1945–1956 (2006).
- 27 . Investigation of classification methods for the prediction of activity in diverse chemical libraries. J. Comp. Aided Mol. Des. 13(5), 533–545 (1999).
- 28 . Random forests. Machine Learning 45(1), 5–32 (2001).
- 29 . Use of robust classification techniques for the prediction of human cytochrome P450 2d6 Inhibition. J. Chem. Info. Comp. Sci. 43(4), 1308–1315 (2003).
- 30 . In silico models for the prediction of dose-dependent human hepatotoxicity. J. Comp.Aided Mol. Des. 17(12), 811–823 (2003).
- 31 . QSAR modeling of the blood-brain barrier permeability for diverse organic compounds. Pharm. Res. 25(8), 1902–1914 (2008).
- 32 . Development of reliable aqueous solubility models and their application in druglike analysis. J. Chem. Inf. Model. 47(4), 1395–1404 (2007).
- 33 . New public QSAR model for carcinogenicity. Chem. Cent. J. 4(Suppl. 1), S3 (2010).
- 34 . An open source multistep model to predict mutagenicity from statistical analysis and relevant structural alerts. Chem. Cent. J. 4(Suppl. 1), S2 (2010).
- 35 . Assessment and validation of the caesar predictive model for bioconcentration factor (BCF) in fish. Chem. Cent. J. 4(Suppl. 1), S1 (2010).
- 36 . Base-line model for identifying the bioaccumulation potential of chemicals. SAR QSAR Environ. Res. 16(6), 531–554 (2005).
- 37 . A new hybrid system of QSAR models for predicting bioconcentration factors (BCF). Chemosphere 73(11), 1701–1707 (2008).
- 38 Molecular cloning, pharmacological characterization, and histochemical distribution of frog vasotocin and mesotocin receptors. J. Mol. Endocrinol. 33(1), 293–313 (2004).
- 39 Predictive models for carcinogenicity and mutagenicity: frameworks, state-of-the-art, and perspectives. J. Environ. Sci. Health 27(2), 57–90 (2009).
- 40 . Measuring Intra-Assay Agreement for the Ames Salmonella Assay. Hothorn L (Ed.). Springer, Berlin, Heidelberg (1991).

