We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×
Preliminary CommunicationOpen Accesscc iconby iconnc iconnd icon

BioPrint meets the AI age: development of artificial intelligence-based ADMET models for the drug-discovery platform SAFIRE

    Sarah E Biehn

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    ,
    Luis Miguel Goncalves

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    ,
    Juerg Lehmann

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    ,
    Jessica D Marty

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    ,
    Christoph Mueller

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    ,
    Samuel A Ramirez

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    ,
    Fabien Tillier

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    &
    Carleton R Sage

    *Author for correspondence:

    E-mail Address: carleton.sage@discovery.eurofinsus.com

    Eurofins DiscoveryAI, Eurofins Panlabs, Inc., Saint Charles, MO 63304, USA

    Published Online:https://doi.org/10.4155/fmc-2024-0007

    Abstract

    Background: To prioritize compounds with a higher likelihood of success, artificial intelligence models can be used to predict absorption, distribution, metabolism, excretion and toxicity (ADMET) properties of molecules quickly and efficiently. Methods: Models were trained with BioPrint database proprietary data along with public datasets to predict various ADMET end points for the SAFIRE platform. Results: SAFIRE models performed at or above 75% accuracy and 0.4 Matthew's correlation coefficient with validation sets. Training with both proprietary and public data improved model performance and expanded the chemical space on which the models were trained. The platform features scoring functionality to guide user decision-making. Conclusion: High-quality datasets along with chemical space considerations yielded ADMET models performing favorably with utility in the drug discovery process.

    Tweetable abstract

    BioPrint meets the artificial intelligence age: researchers trained absorption, distribution, metabolism, excretion and toxicity machine learning models with the BioPrint database for the new SAFIRE platform.

    The drug discovery process is time consuming and expensive. Advanced artificial intelligence (AI)-based computational tools have the potential to enhance the likelihood of success in the early stages of small molecule drug discovery. Machine learning (ML), a branch of AI, can be harnessed to train models with drug discovery data to provide researchers with key insights into the favorability of lead molecules. In the drug discovery process, compounds of interest must be evaluated not only for activity and selectivity for the biological target but also for absorption, distribution, metabolism, excretion and toxicity (ADMET) properties. Target selectivity end points are typically project and disease area specific, but ADMET properties must be optimized for most efforts. Experimental measurements are both resource intensive and ‘make or break’ for small molecule series, as these properties are critical for success in the drug-discovery pipeline. Computational prediction of ADMET properties can lead to the reduction of resources exerted, with unsuccessful compounds quickly and cost-effectively deprioritized. Various tools for ADMET property prediction have been developed, with a variety of efforts pursued by large pharmaceutical companies with large proprietary datasets, academic groups using publicly available data and industrial efforts with a myriad of dataset sources [1–8].

    Successful computational predictions rely on diverse, high-quality experimental datasets to effectively train ML algorithms. The proprietary BioPrint database provides a unique opportunity for ML development, as the collection contains standardized assay conditions resulting in high-quality data [9]. The BioPrint database, containing approximately 2500 compounds, features robust data for multiple ADMET assays and has been used to develop computational predictive tools [9]. As many advances in both predictive methods and interpretation of chemical space have occurred since the early 2000s [10–18], the authors anticipated that training AI methods with the high-quality BioPrint database would create a robust predictive tool to guide ADMET decision-making during small molecule drug discovery.

    Using the BioPrint dataset along with additional public data, the authors sought to develop computational models employing ML algorithms to predict ADMET properties for small molecules from chemical structure alone. In this work, they developed their Suite of ADMET Predictions For In Silico Refinement and Evaluation, or SAFIRE, a collection of models that predict six ADMET and eight cytochrome P450 isoform (CYP) inhibition parameters. ADMET models included human plasma protein binding, Caco-2 permeability, efflux ratio (ER), aqueous solubility, human microsomal stability and Human ether-a-go-go related gene (hERG) channel inhibition. CYP inhibition models were created for isoforms 1A2, 2D6, 2C9, 2C19, 3A4/3A5, 2B6 and 2C8. Models were developed using BioPrint data when available along with public datasets. Using accuracy and Matthews correlation coefficient (MCC) as performance indicators, the authors found that models were highly accurate and performed according to thresholds recommended by other industry pursuits [1]. When applicable, they examined model training efforts with BioPrint in concert with public data, which further illuminated the value of the BioPrint database along with the importance of dataset balancing. They included two scoring functions with their predictive platform to guide user decision-making, with each scoring function providing different levels of flexibility. In this work, the authors used high-quality training data to build accessible models that users can leverage via the SAFIRE platform, which offers a free version of SAFIRE models. Overall, these efforts demonstrated that the BioPrint database remains a competitive advantage for training accurate models that perform at industry standards.

    Methods

    Experimental datasets

    The models were trained using molecular descriptors derived from simplified molecular-input line-entry system (SMILES) structural inputs and associated assay results from the BioPrint database to extrapolate predictions from molecular structures of approximately 2500 compounds. The permeability model was trained using Caco-2 apical to basolateral apparent permeability data at pH 7.4. ER data were determined based on the ratio of Caco-2 basolateral to apical and apical to basolateral permeability from the BioPrint database; additional compounds from ChEMBL [19] were included for training. The BioPrint solubility dataset featured kinetic aqueous solubility data in phosphate-buffered saline obtained at pH 7.4. It was supplemented with additional literature data of aqueous solubility assessments around pH 7.0 in various physical and salt forms [20]. Metabolic stability data of human liver microsomes from the BioPrint database were expressed as percentage remaining after a 60 min incubation. The BioPrint database does not currently contain experimental datasets for plasma protein binding or hERG. Human plasma protein binding [21,22] percent fraction bound training data and hERG [23], IC50 training data were extracted from the literature. BioPrint data for CYP isoforms consisted of IC50 data for compounds with greater than 30% inhibition in a first round of testing performed at 10 μM. Compounds below the activity threshold of 30% inhibition at 10 μM were still included in the datasets labeled as ‘inactive.’ Training datasets for five CYP isoforms, including 3A4, 1A2, 2D6, 2C9 and 2C19, were supplemented with additional publicly available data to obtain better dataset class balances [24]. CYP isoform 2C8 training data were also supplemented with data from ChEMBL [19]. Training dataset sizes and class ratios are denoted in Table 1.

    Table 1. Model classification cutoffs and training set information.
    Experimental assay model all species humanTraining dataset source (n compounds in training set)Class assignmentExperimental cutoffClass ratio in training set (%)Awarded score
    Plasma protein bindingPublic literature (5547 compounds)Low binding≤95% fraction bound611
    High binding>95% fraction bound390
    Aqueous solubility pH 7.4BioPrint (18%) and public literature (12,054 compounds)Low<5 μM120
    Medium5–100 μM160.5
    High>100 μM721
    Metabolic stabilityBioPrint (1720 compounds)Low<30% remaining after 1 h270
    High≥30% remaining after 1 h731
    Caco-2 permeability (apical to basolateral)BioPrint (1591 compounds)Low≤1 × 10-6 cm/s240
    High>1 × 10-6 cm/s761
    Efflux ratio (basolateral to apical) / (apical to basolateral)BioPrint (81%) and public (1867 compounds)Low efflux potential<2701
    High efflux potential≥2300
    hERGPublic literature (27,722 compounds)InactiveIC50 >10 μM511
    ActiveIC50 ≤10 μM490
    CYP isoformsBioPrint (15–100% depending on isoform) and public (521–16,054 compounds depending on isoform)InactiveIC50 >10 μM70–86 (depending on isoform)1
    ActiveIC50 ≤10 μM14–30 (depending on isoform)0

    Model development & validation

    Models were developed and implemented in Python 3.10.5. Model development began with compound curation in which conventional quality inspections were applied to eliminate duplicates, verify the structure and stereochemistry of SMILES and ensure compound standardization. Classes were assigned to compounds based on the cutoffs identified in Table 1. To achieve balanced datasets, various efforts were explored, including majority-to-minority ratio-based class weights, synthetic minority oversampling techniques, random undersampling and ensemble learning [25–27]. RDKit descriptors [28], extended connectivity fingerprints [29] and a combination of both descriptors and fingerprints were inspected for use in model training before ultimately pursuing solely descriptors due to performance improvements. Feature refinement was conducted by removing low variance and highly correlated descriptors and implementing a random forest pruning workflow to ensure the most informative features were used to train the models.

    Different ML methodologies such as random forest [30], support vector machine [31], Naïve Bayes [32] and gradient boosting algorithms [33] were explored using simple automated techniques to evaluate various algorithms and optimize respective hyperparameters in KNIME [34] or Python. Model algorithms were achieved using the Python package scikit-learn [27]. Algorithms were ranked by accuracy, and the model with the highest accuracy was pursued.

    Performance was reported based on repeated stratified sampling of five folds with three repeats per fold. Metrics assessed included accuracy (Equation 1),

    Accuracy=TP+TNTP+TN+FP+FN(Equation 1)

    the number of true positives (TPs) and true negatives (TNs) divided by the sum of TPs, TNs, false positives (FPs) and false negatives (FNs); and MCC (Equation. 2) [35].

    MCC=TP×TN(FP×FN)TP+FN(TP+FN)(TN+FP)(TN+FN)(Equation 2)

    The applicability domain of each model was determined based on the range of descriptor values used during model training, known as the bounding box method [36].

    Visual assessment of chemical space

    Training set compounds were converted from SMILES to 3D, geometry-optimized structures that were used to determine principal moments of inertia using RDKit functionality within KNIME [28,34]. Relevant chemical space was graphed using Python (v3.10.6) packages Matplotlib (v3.6.1) and Scipy (v1.9.3). Training set space was compared with a combination of US FDA-approved drugs and drugs that were tested or are currently being tested in clinical trials to compare the chemical space context of model training sets with drugs in the market.

    Comparison of proprietary & public datasets for model training

    For models in which public data were included in training, a comparison exercise was performed to address the value of combining public and proprietary data. Models were developed with only BioPrint, only public and both BioPrint and public data; then accuracy and MCC were calculated for respective 20% hold-out sets. The solubility model was further analyzed with an external test set obtained from the literature [37] and an additional proprietary project dataset consisting of compounds tested for multiple internal drug discovery projects.

    Assessment of compounds based on predicted classes using scalarization

    Predictions were scored based on favorability, with favorable predictions awarded a score of 1, unfavorable 0 and inconclusive 0.5, as represented in the ‘Awarded score’ column of Table 1. Individual predictions were used to calculate the sum score, or the average of all prediction scores. The geometric mean score was calculated using the nth root of the product of all model scores, where the value of n was 13 to represent the 13 models within the SAFIRE platform.

    Results & discussion

    Models demonstrated favorable performance meeting industry thresholds

    At its inception, the BioPrint database [9] sought to understand what makes certain drug molecules better than unsuccessful candidates. It contains approximately 2500 compounds and consists of 60% marketed drugs, 5% drugs tested in the clinic but not marketed, 1% drugs in clinical trials, 2% withdrawn drugs and 20% standard pharmacology references. The remaining 12% includes an assortment of compounds with varied purposes. Overall, the BioPrint database contains over 1 million experimental datapoints, including more than 25,000 ADMET data points, making it ideal for training ML models.

    When developing these models, the authors used molecular descriptors and AI methods to deliver the most robust and accurate models possible with the hope of achieving performance similar to that of large pharmaceutical companies that have access to tens of thousands of compounds and associated assay data [1,2]. For performance, the authors valued accuracy because it is a straightforward metric for contextualizing the proportion of correctly predicted observations within the data. However, accuracy is not an ideal metric for determining performance on imbalanced datasets. A skewed distribution of the dominating class can lead to misleading accuracy values. To ensure the performance metrics were not misleading, the authors emphasized the need for models to perform with a favorable MCC in addition to reasonable accuracy. MCC is considered a more versatile and balanced measure of a model's performance [35]. Rather than examining only the correct number of predictions, MCC is based on all components of the confusion matrix, providing a more complete representation of model performance at crucial stages of development. The authors set their performance goals at an accuracy of at least 75% and an MCC of 0.4 and evaluated each of the models using the respective hold-out sets.

    Overall, SAFIRE models performed with high accuracy and favorable MCC. As evident in Figure 1, SAFIRE models had an average accuracy of 83%, with all 13 models exceeding the accuracy threshold of 75%. Additionally, all models demonstrated an MCC greater than or equal to 0.4, exceeding the threshold for predictive quality used in other industry pursuits [1]. Due to the comparable performance to similar efforts in the industry, it was concluded that the models were reliable enough for predicting how test compounds might be classified for an experimental assay.

    Figure 1. Accuracy of ADMET models (A), CYP models (B) and Matthews correlation coefficient of absorption, distribution, metabolism, excretion and toxicity models (C) and CYP models (D) with 20% stratified test sets partitioned from the training set but unused in machine learning.

    Grey lines on each graph represent performance thresholds of 75% accuracy and 0.4 Matthews correlation coefficient. Exact values are shown above each bar, and bars are colored by experimental assay type.

    ADMET: Absorption, distribution, metabolism, excretion and toxicity; CYP: Cytochrome P450 isoform; MCC: Matthews correlation coefficient; PPB: Plasma protein binding.

    In addition to accuracy and MCC, the authors examined the strengths and weaknesses of the models by examining the confusion matrices for each model, as shown in Figure 2. Confusion matrices demonstrate the percentage of TP, TN, FP and FN predictions. The confusion matrices in Figure 2 are heatmaps, and darker shading along the left-to-right diagonal indicates a higher percentage of accurate predictions. The authors targeted at least 0.65 TP/TN value for each class. Lower TP/TN values were observed for metabolic stability, CYP2C8 inhibition and CYP2B6 inhibition. Metabolic stability is known to be quite challenging to predict, as overall metabolism can be a function of multiple mechanisms, so the authors were pleased to achieve a favorable TP/TN value, given the technical challenges. While CYP2C8 and CYP2B6 play important roles in drug metabolism and interactions, they are often not prioritized as highly as other CYP isoforms with more detrimental metabolic effects. On the whole, the majority of SAFIRE models achieved the 0.65 TP/TN value threshold.

    Figure 2. Absorption, distribution, metabolism, excretion and toxicity and CYP models per-class performance.

    Confusion matrices illustrating the ratio of compounds in each class assigned to particular classes for 20% stratified test set partitioned from the training set but unused in machine learning.

    CYP: Cytochrome P450 isoform.

    While the BioPrint database used to train most of the SAFIRE models is smaller than typical proprietary datasets available to large pharmaceutical companies, the accuracy and MCC performance achieved by SAFIRE models was comparable to previously described standards of performance [1], indicating that the BioPrint dataset effectively supported model development. The promising model performance was attributed to the quality and diversity of the BioPrint dataset. The BioPrint database has enhanced quality and control offered by experimental consistency via specific defined assay protocols, which is not necessarily inherent to or evident in every publicly available dataset. Many AI models exist in the field. While some ADMET AI models are freely available and trained with public data, others are developed by large pharmaceutical companies using internal proprietary datasets and are inaccessible to the everyday user. To further demonstrate the efficacy of the BioPrint data, the authors compared SAFIRE models to one commercially available and four academically available ADMET model packages and observed that, overall, SAFIRE models performed as well if not better than similar platforms. The novelty of this work stems from the partnering of diverse, high-quality BioPrint data with present-day AI methods. The authors paired high-quality training data with accessible models that users can leverage for valuable insights that could streamline decision-making.

    Training models with combination of proprietary & public data led to improved performance

    During the development of SAFIRE models, the authors began by training solely with BioPrint data when available. In some cases such as solubility, ER and certain CYP isoforms, dataset imbalance led to challenges in predictive quality and accuracy. The authors sought to determine whether the inclusion of public data in addition to proprietary data led to higher quality models. They trained models using isolated datasets from either BioPrint or the public domain. They then compared accuracy, MCC and confusion matrices with SAFIRE models developed with both proprietary and public data. Figure 3 demonstrates the performance of models developed with different training sets.

    Figure 3. Examination of model performance when training with proprietary, public, or both proprietary and public datasets.

    Comparison of accuracy (A) and Matthews correlation coefficient (B) of solubility models trained with only BioPrint data (light blue), only publicly available data (navy blue) and both proprietary and public data (medium blue.) Metrics were calculated using respective 20% stratified hold-out datasets partitioned from the initial training data that were never used during model training, a literature dataset and a proprietary project dataset from Eurofins Beacon Discovery. Confusion matrix heatmaps for 20% stratified hold-out datasets partitioned from training data (C) and for a proprietary dataset (D). Darker shading indicates a higher percentage of compounds respective to a particular category. Comparison of accuracy (E) and Matthews correlation coefficient (F) of CYP1A2, 2D6, 2C9, 2C19 and 2C8 and efflux ratio models trained with only BioPrint data (light blue), only publicly available data (navy blue) and both proprietary and public data (medium blue) calculated with respective 20% stratified hold-out sets.

    CYP: Cytochrome P450 isoform; MCC: Matthews correlation coefficient.

    The accuracy (Figure 3A) and MCC (Figure 3B) of training with BioPrint, public and both BioPrint and public datasets are outlined in Figure 3. The authors' solubility model trained with both BioPrint and public data tended to perform better than models trained with either proprietary or public data alone, as evidenced by consistently better MCC values and moderately better accuracy values. The BioPrint-trained model struggled to obtain quality predictions, as indicated with the lower MCC values in Figure 3B, which was likely due to the dataset imbalance. This was further evidenced by the confusion matrix heatmaps in Figure 3C & D, as an ideal output has darker shading along the left-to-right diagonal rather than the edges. The model trained with only BioPrint data tended to assign compounds erroneously to the high class. The model trained with only public data also demonstrated a tendency to assign low-solubility compounds to the high class. Training with both public and proprietary data led to improvements in the confusion matrices, most notably for medium-solubility compound classification. A similar exercise was executed for additional models with comparable outcomes, as shown in Figure 3E & F, which illustrate the performance of CYP isoform and ER models developed with proprietary, public and combined data. Overall, this exercise demonstrated the value of high-quality proprietary data while maintaining the utility of publicly available data.

    Model training sets had sizable chemical space coverage

    In an attempt to understand the coverage of chemical space upon which these models were trained, the authors examined the chemical space associated with training set molecules. Chemical space is difficult to express and depends on the specific ways in which the molecules have been described. A historically used method incorporated principal moment of inertia (PMI) plots for visualization, with the signature triangular area representative of different molecule shapes such as rods, discs and spheres in a dataset [10]. One previous effort, Reymond's Chemical Space Project [10,38], used chemical space triangle visualizations to compare datasets such as PubChem and ChEMBL with a database with all generated compound possibilities for up to 17 heavy atoms. While these visualizations do not quantitatively define chemical space, such efforts can be used to quickly conceptualize the qualitative overlap between datasets. With this in mind, the authors compared SAFIRE model training sets to current market drugs as an effort to demonstrate coverage of chemical space. For the assessment, training set molecules were converted from SMILES to optimized 3D conformers that were used to assess the normalized first (smallest) PMI and normalized second PMI. Figure 4 demonstrates the comparison of chemical space sampled for current market drugs (Figure 4A), BioPrint data for the metabolic stability training set (Figure 4B), BioPrint plus public data for the solubility training set (Figure 4C) and BioPrint plus public data for CYP2D6 training (Figure 4D).

    Figure 4. Assessment of chemical space coverage of (A) current market drugs (∼3600 compounds), (B) the metabolic stability model training set (∼1700 compounds) from the BioPrint database, (C) the solubility model training set (∼10,000 compounds) containing both the BioPrint database and publicly available data and (D) the CYP2D6 model training set (∼10,000 compounds) with both BioPrint and publicly available data.

    Conformers were generated for each training molecule; then the principal moment of inertia values were determined and plotted. The triangle represents the entire chemical space region, with the different molecular shapes (rod, sphere and disc) labeled on each vertex. Compounds were plotted using colors based on a density heatmap, with red/orange indicating a denser region and blue/purple indicating a sparsely populated region.

    PMI1: First principal moment of inertia; PMI2: Second principal moment of inertia.

    As evident from the chemical space plots, the BioPrint dataset occupied a similar region of chemical space as current market drugs; however, the inclusion of additional literature data led to increased occupancy and more overlapping coverage with current market drugs. All models had high occupancy of the upper left vertex, or rod-shaped molecules. The BioPrint database provided an excellent foundation for model training, and supplementing training sets with publicly available data allowed for the expansion of chemical space on which models were trained and for the increase of the applicability domain of the models.

    Molecule ranking using a simple scoring function to aggregate model outputs

    To guide unbiased decision-making, the authors strove to provide researchers with data-driven means to objectively prioritize candidate molecules. While each SAFIRE model could be used to predict potential liabilities for individual pharmacological factors, additional scoring was included to provide a comprehensive analysis across all SAFIRE model predictions. To achieve this, two scoring functions were implemented to quantitatively sort molecules of interest by overall predicted favorability.

    Figure 5 illustrates the predictive scheme of the SAFIRE models. After standardizing and assigning predicted ADMET classes, the authors scored input compounds based on the favorability of predictions. For example, compounds with high solubility received a score of 1, those with medium solubility a score of 0.5 and those with low solubility a score of 0. The first and most flexible scoring feature was the sum score, which represented the average of all prediction scores. The second scoring feature was the geometric mean score, which was calculated by taking the 13th root of multiplied values. The geometric mean is generally known to be less impacted by variability in a dataset and thus potentially more representative of the molecular series as a whole. However, it was also a much stricter method of scoring, as if even one model had an unfavorable prediction, the molecule received an overall score of 0 and failed, which may not be ideal in all scenarios.

    Figure 5. Predictive scheme for drug candidates using SAFIRE models.

    Molecules were imported into the workflow, then standardized prior to descriptor generation. The SAFIRE models determined predicted classes for each experimental property; then predictions were assigned a score based on favorability. Per-model scores were incorporated into a sum score and geometric mean score to guide decision-making for candidates. Score graphs are an example snapshot not representative of actual data.

    FB: Fraction bound; PPB: Plasma protein binding.

    To provide guidance on the applicability of each scoring approach in the drug discovery life cycle, the authors identified three common scenarios. First, hit expansion projects typically contain many compounds (1000–10,000 compounds) with the goal of narrowing the scope to one to five compound series of interest. The sizable number of compounds at this stage can be difficult to condense into promising candidates. The geometric mean would simplify the series identification process such that more promising series could be prioritized quickly based on the objectives of the project. The sum score would be a more appropriate scoring function for lead optimization, as it provides additional flexibility to consider midrange molecules still worth pursuing at this stage of drug discovery. Finally, users can pursue a personalized scoring function if they desire a different prioritization of ADMET properties. Specific experimental property predictions could be emphasized, minimized or eliminated depending on project goals. Overall, the scoring system provided additional usability to SAFIRE models and value to the small molecule drug discovery process.

    To enable enhanced user decision-making, the authors' ADMET models and scoring functions are available within the SAFIRE application online. Users may input their compounds as SMILES via the input box or as an input file (comma separated values file, Microsoft Excel file or structured data file). The platform additionally includes calculated physiochemical properties and druglikeness filters to provide researchers with additional parameters for characterizing and prioritizing molecules most likely to result in successful drug development. The authors envision users interacting with the SAFIRE platform as a tool to quickly inform them of potential ADMET liabilities. The models are intended to serve as a starting point for prioritizing compounds to be sent for in vitro evaluation where computationally proposed liabilities can be verified or disproved experimentally. SAFIRE was trained with BioPrint and high-quality literature data for small molecules and was intended for use for small molecule drug discovery projects. As different experimental properties may be critical for different drug discovery goals, the authors anticipate that users will partner the context of their objectives with the knowledge that SAFIRE predictions and scoring systems provide. Projects might require distinct ranking systems based on desired properties. The scoring system based on prediction favorability offers a foundation from which users can then tailor to their project-specific needs.

    Conclusion

    Quickly and accurately classifying ADMET performance of drug candidates is crucial for reducing resources required during small molecule drug discovery. The SAFIRE models, developed with high-quality data from the BioPrint database, demonstrated high accuracy and MCC comparable to other efforts in the field. All models met or exceeded the 75% accuracy and 0.4 MCC threshold. Most SAFIRE models met or exceeded the 0.65 TP/TN threshold value, suggesting top-performing models on a per-class basis as well as an overall model basis. The inclusion of the BioPrint database in training the models led to high performance, suggesting that data quality supersedes data quantity, as we have achieved large proprietary performance thresholds without extremely large proprietary datasets. As the first iteration of AI models trained with BioPrint data, the SAFIRE platform will be useful for medicinal chemists and drug discovery scientists throughout the field. We plan to continue improving our models in both performance and design. Additional refinements of ADMET models will be pursued in the acquisition and incorporation of high-quality proprietary data based on updates to the drug market landscape.

    Training models with proprietary and, in some cases, public data led to models with strong performance metrics. Incorporating public data into training sets provided additional chemical space occupancy while furthering the performance threshold achievements from training with the BioPrint data alone. This can be attributed to the improvement of training dataset balance and the expansion of the applicability domain, making SAFIRE models relevant to a larger chemical space region. Chemical space, however, remains vast. While we qualitatively assessed the chemical space on which our models were trained, we recognize that there is a significant region of space that has yet to be explored. We speculate that strategically and rationally expanding the BioPrint database with compounds from diverse regions of chemical space will further improve the performance and usability of SAFIRE models. Upcoming efforts will be focused on continuing to explore chemical space to improve AI model development and better capture the future of small molecule drug space. Additional efforts will prioritize the development of local models based on project-specific datasets to tailor AI models to series requiring models with different regions of chemical space.

    The SAFIRE platform predictive scheme also included scoring functionality to help guide candidate selection based on prediction favorability. The average model scores captured in the sum score can guide a flexible consideration of compounds in which a few unfavorable predictions will not result in deprioritization, which would be particularly valuable at the lead optimization stage of small molecule drug discovery. The geometric mean provides a strict scoring system that can be used to quickly prioritize large collections of compounds such as those explored during hit expansion. These scoring systems can aid users at each stage of the drug discovery lifecycle and streamline the decision-making process, accelerating the process and reducing costs. Future work will focus on exploring other scoring methodologies to continue offering robust functionality to users.

    Reflecting on the future of the field, we anticipate that it will continue to evolve around the further exploration of chemical space. As computational resources become more accessible, chemical space can be better surveyed and understood such that small molecules outside of traditionally pursued chemical space can be considered in drug discovery. AI models can continue to be trained on the additional chemical space as it is explored. Further efforts in the field will pursue a balance between automated machine learning endeavors and maintaining explainable, understandable AI models that can be further utilized in various stages of drug discovery.

    Overall, our models demonstrated the quality of the BioPrint data and the utility of ‘gold-standard' data in the development of AI methodologies. Interested users can explore the SAFIRE platform for their own small molecule drug discovery efforts by visiting www.eurofinsdiscovery.com/solution/safire.

    Summary points
    • The BioPrint database is a high-quality proprietary dataset consisting of ~2500 compounds with experimental datapoints for a variety of absorption, distribution, metabolism, excretion and toxicity (ADMET) and target-based assays.

    • The SAFIRE platform contained artificial intelligence models that were trained with the BioPrint database for the following ADMET properties: plasma protein binding, aqueous solubility, human microsomal stability, Caco-2 permeability, Caco-2 efflux ratio, hERG inhibition and CYP inhibition for isoforms 1A2, 2D6, 2C9, 2C19, 3A4/3A5, 2C8 and 2B6.

    • ADMET models within the SAFIRE platform demonstrated performance that met or exceeded performance thresholds of 75% accuracy and 0.4 Matthews correlation coefficient with validation sets.

    • Model training efforts with both BioPrint and publicly available datasets demonstrated that dataset balance is a key component of model performance.

    • The SAFIRE platform employs two simple scoring functions to guide user decision-making based on ADMET predictions. It also offers calculated metrics and is available to potential users via an online application.

    • The chemical space occupied by the BioPrint dataset was similar to the chemical space occupied by current market drugs.

    • The BioPrint database, while being a smaller dataset, demonstrated that high-quality datasets are more valuable for achieving high-performing models than large-quantity datasets, as it achieved performance results comparable to those of other pharmaceutical endeavors.

    Author contributions

    CR Sage and LM Goncalves were responsible for the conception and design of the work. SE Biehn and CR Sage wrote the manuscript, and SE Biehn generated all figures for the manuscript. SE Biehn, S Ramirez and C Mueller were responsible for dataset curation and retrieval and model development. SE Biehn created all models for the SAFIRE platform in Python. J Lehmann provided analysis and conceptual research along with dataset curation and advised on the analysis and interpretation of data for the work. SE Biehn and J Lehmann assessed publicly available models. JD Marty provided project management structure and advised on analysis and interpretation of data for the work. C Mueller created web service for platform model usage. F Tillier retrieved proprietary data and consulted on the usage, interpretation and analysis of proprietary data. All authors approved the draft.

    Acknowledgments

    The authors would like to thank all members of the Eurofins DiscoveryAI group for their feedback and insights along with various members of Eurofins Discovery for their feedback and guidance on the platform and manuscript.

    Financial disclosure

    The authors declare the following financial interests/relationships which might be viewed as potential competing interests: all authors are employees of Eurofins Panlabs, Inc.

    Competing interests disclosure

    The authors have no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending or royalties.

    Writing disclosure

    No writing assistance was utilized in the production of this manuscript.

    Open access

    This work is licensed under the Attribution-NonCommercial-NoDerivatives 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

    Papers of special note have been highlighted as: • of interest; •• of considerable interest

    References

    • 1. Göller AH, Kuhnke L, Montanari F et al. Bayer's in silico ADMET platform: a journey of machine learning over the past two decades. Drug Discov. Today 25(9), 1702–1709 (2020). •• An overview of Bayer's absorption, distribution, metabolism, excretion and toxicity machine learning efforts, including commentary on performance requirements, particularly Matthews correlation coefficient ≤0.4 as a model quality threshold.
    • 2. Kumar K, Chupakhin V, Vos A et al. Development and implementation of an enterprise-wide predictive model for early absorption, distribution, metabolism and excretion properties. Future Med. Chem. 13(19), 1639–1654 (2021). • Janssen's graph convolutional neural network-based platform for predicting 18 absorption, distribution, metabolism, excretion and toxicity end points, trained on large proprietary datasets, demonstrated favorable performance and successful deployment across the company's drug discovery projects.
    • 3. Daina A, Michielin O, Zoete V. SwissADME: a free web tool to evaluate pharmacokinetics, drug-likeness and medicinal chemistry friendliness of small molecules. Sci. Rep. 7(1), 42717 (2017).
    • 4. Xiong G, Wu Z, Yi J et al. ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties. Nucleic Acids Res. 49(W1), W5–W14 (2021).
    • 5. Yang H, Lou C, Sun L et al. admetSAR 2.0: web-service for prediction and optimization of chemical ADMET properties. Bioinformatics 35(6), 1067–1069 (2019).
    • 6. Pires DEV, Blundell TL, Ascher DB. pkCSM: predicting small-molecule pharmacokinetic and toxicity properties using graph-based signatures. J. Med. Chem. 58(9), 4066–4072 (2015).
    • 7. Banerjee P, Dunkel M, Kemmler E, Preissner R. SuperCYPsPred – a web server for the prediction of cytochrome activity. Nucleic Acids Res. 48(W1), W580–W585 (2020).
    • 8. Optibrium. StarDrop: small molecule drug discovery & data visualisation software. https://optibrium.com/stardrop/
    • 9. Krejsa CM, Horvath D, Rogalski SL et al. Predicting ADME properties and side effects: the BioPrint approach. Curr. Opin. Drug Discov. Devel. 6(4), 470–480 (2003). •• The BioPrint database, a collection of compound structures and respective chemical properties and experimental values for a multitude of assays, was used to develop quantitative structure–activity relationship models.
    • 10. Reymond J-L. The Chemical Space Project. Acc. Chem. Res. 48(3), 722–730 (2015). •• The project sought to enumerate chemical space by generating all potential compounds with up to 17 heavy atoms. Chemical space graphs similar to what is shown in this work are shown in the first figure of the reference.
    • 11. Volkamer A, Riniker S, Nittinger E et al. Machine learning for small molecule drug discovery in academia and industry. Artif. Intell. Life Sci. 3, 100056 (2023).
    • 12. Lombardo F, Desai PV, Arimoto R et al. In silico absorption, distribution, metabolism, excretion, and pharmacokinetics (ADME-PK): utility and best practices. An industry perspective from the International Consortium for Innovation through Quality in Pharmaceutical Development. J. Med. Chem. 60(22), 9097–9113 (2017).
    • 13. Rácz A, Bajusz D, Miranda-Quintana RA, Héberger K. Machine learning models for classification tasks related to drug safety. Mol. Divers. 25(3), 1409–1424 (2021).
    • 14. Fromer JC, Coley CW. Computer-aided multi-objective optimization in small molecule discovery. Patterns 4(2), 100678 (2023).
    • 15. Tran TTV, Tayara H, Chong KT. Artificial intelligence in drug metabolism and excretion prediction: recent advances, challenges, and future perspectives. Pharmaceutics 15(4), 1260 (2023).
    • 16. Maltarollo VG, Gertrudes JC, Oliveira PR, Honorio KM. Applying machine learning techniques for ADME-Tox prediction: a review. Expert Opin. Drug Metab. Toxicol. 11(2), 259–271 (2015).
    • 17. Dara S, Dhamercherla S, Jadav SS, Babu CM, Ahsan MJ. Machine learning in drug discovery: a review. Artif. Intell. Rev. 55(3), 1947–1999 (2022).
    • 18. Jiménez-Luna J, Grisoni F, Schneider G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2(10), 573–584 (2020).
    • 19. Mendez D, Gaulton A, Bento AP et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47(D1), D930–D940 (2019).
    • 20. Cui Q, Lu S, Ni B et al. Improved prediction of aqueous solubility of novel compounds by going deeper with deep learning. Front. Oncol. 10, (2020).
    • 21. Lou C, Yang H, Wang J et al. IDL-PPBopt: a strategy for prediction and optimization of human plasma protein binding of compounds via an interpretable deep learning method. J. Chem. Inf. Model. 62(11), 2788–2799 (2022).
    • 22. Ingle BL, Veber BC, Nichols JW, Tornero-Velez R. Informing the human plasma protein binding of environmental chemicals by machine learning in the pharmaceutical space: applicability domain and limits of predictability. J. Chem. Inf. Model. 56(11), 2243–2252 (2016).
    • 23. Du F, Yu H, Zou B, Babcock J, Long S, Li M. hERGCentral: a large database to store, retrieve, and analyze compound-human Ether-à-go-go related gene channel interactions to facilitate cardiotoxicity assessment in drug development. Assay Drug Dev. Technol. 9(6), 580–588 (2011).
    • 24. National Center for Biotechnology Information. PubChem bioassay record for AID 1851, cytochrome panel assay with activity outcomes (2009). https://pubchem.ncbi.nlm.nih.gov/bioassay/1851
    • 25. XGBoost parameters – xgboost 2.0.3 documentation (2022). https://xgboost.readthedocs.io/en/stable/parameter.html
    • 26. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002).
    • 27. Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(85), 2825–2830 (2011).
    • 28. Landrum G. RDKit: open-source cheminformatics software (2023). https://rdkit.org/
    • 29. Rogers D, Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50(5), 742–754 (2010).
    • 30. Breiman L. Random forests. Mach. Learn. 45(1), 5–32 (2001).
    • 31. Cortes C, Vapnik V. Support-vector networks. Mach. Learn. 20(3), 273–297 (1995).
    • 32. Zhang H. The optimality of Naive Bayes (2024).
    • 33. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery, NY, USA, 785–794 (2016). https://dl.acm.org/doi/10.1145/2939672.2939785
    • 34. Berthold MR, Cebron N, Dill F et al. KNIME: the Konstanz information miner. In: Data Analysis, Machine Learning and Applications. Preisach CBurkhardt HSchmidt-Thieme LDecker R (Eds). Springer, Berlin, Germany, 319–326 (2008).
    • 35. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21(1), 6 (2020). • Using six hypothetical and one real-world examples, the authors demonstrate the reliability of the Matthews correlation coefficient to better capture binary classifier performance over other metrics.
    • 36. Jaworska J, Nikolova-Jeliazkova N, Aldenberg T. QSAR applicability domain estimation by projection of the training set in descriptor space: a review. Altern. Lab. Anim. 33(5), 445–459 (2005).
    • 37. Delaney JS. ESOL: estimating aqueous solubility directly from molecular structure. J. Chem. Inf. Comput. Sci. 44(3), 1000–1005 (2004).
    • 38. Ruddigkeit L, van Deursen R, Blum LC, Reymond J-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52(11), 2864–2875 (2012).