Series of screening compounds with high hit rates for the exploration of multi-target activities and assay interference

Aim: Generation of a database of analog series (ASs) with high assay hit rates for the exploration of assay interference and multi-target activities of compounds. Methodology: ASs were computationally extracted from extensively tested screening compounds with high hit rates. Data: A total of 6941 ASs were assembled comprising 14,646 unique compounds that were tested in a total of 1241 different assays covering 426 specified targets. These ASs were organized and prioritized on the basis of different activity and assay frequency criteria. All ASs and associated information are made available in an open access deposition. Next steps: The large set of ASs will be further analyzed computationally and from a chemical perspective to identify assay interference compounds and candidates for exploring target promiscuity.

From publicly available screening compounds, analog series with high assay hit rates were isolated. The figure shows three analogs (black: conserved core, red: substituents) tested in 357-361 assays and reports their hit rates. Analog series (AS): Series of closely related compounds sharing the same core structure and having different substituents (R-groups) at one or more sites (substitution sites). In this study, only ASs with a single substitution site were considered.
Hit rate (HR): Defined here as the proportion of assays in which a screening compound was active.
Matching molecular series (MMS): A series of two or more compounds that are only distinguished by a chemical modification at a single site. The MMS represents a computational data structure for the systematic extraction of ASs with single substitution sites from large compound sets.
Multi-target activities of small molecules experience increasing interest in pharmaceutical research, especially in the context of polypharmacology [1][2][3]. For example, small subsets of highly promiscuous kinase inhibitors have become a paradigm for polypharmacological drug action in cancer treatment [4]. However, true multi-target activities of compounds must be clearly distinguished from nonspecific interactions and other assay artifacts [5,6], which represent a general problem for biological screening and medicinal chemistry. Compounds that are likely to be reactive or autofluorescent under assay conditions have high potential to cause false-positive activity signals. Such assay interference compounds have been investigated in specific assay formats and instructive case studies, for example [6][7][8], and have also been explored computationally by compound data mining, for example, [9]. However, the study of assay interference is far from being a trivial task since compounds with interference potential typically occur as substructures in larger molecules and their potential reactivities tend to depend on the structural context in which they are presented [9]. Recently, we have reported a large-scale statistical analysis of publicly available screening data to identify extensively tested analog series (ASs) with high hit rates (HRs) across assays [10]. The major goal of this study was the identification of the most frequently active ASs. Collecting such series was thought to provide a general basis for further exploring assay interference of screening compounds versus multi-target activities. Our analysis identified thousands of ASs with unusually high HRs considering the global HR distribution over all screening assays [10]. In this data note, we report an open access deposition of all qualifying ASs and associated assay frequency and activity data. The ASs were organized and prioritized according to different parameters, as discussed in the following.

Methodology
In the following, a summary of the methodological framework of our ASs and HR analysis is provided, yielding the collection of most frequently active ASs that is made freely available. For further methodological details, the interested reader is referred to the original publication [10].

Definition of analog series
To generate ASs, matching molecular series (MMSs) [11] were systematically extracted from selected screening compounds. MMS is an extension of the matched molecular pair formalism [12] and defined as a series of compounds that share the same core structure and are distinguished at a single substitution site [11]. MMSs are algorithmically extracted from compound sources in an efficient manner [11]. For their generation, retrosynthetic rules and size restrictions for substituents were applied [10] such that MMSs represented typically observed ASs with a single substitution site.
So defined ASs with single-site modifications provided HR controls for closely related analogs. For example, if closely related compounds comprising a series are active in a wide variety of assays, interference potential is very likely. However, if they are consistently active against the same set of targets, promiscuity should be further explored. Furthermore, if some analogs are frequently active but others are not, questions concerning experimental data and confidence levels might be raised. Given this important control opportunity provided by comparing analogs, ASs instead of individual compounds were identified to represent the basic structural unit for our study.

Statistical analysis
From PubChem Bioassays [13], the major public repository for biological screening data, a set of 437,257 compounds was preselected that were extensively tested in primary and confirmatory assays, with a median value of 347 assays per compound. The frequency distribution for the much larger number of primary assays was analyzed and a subset of 327,532 compounds selected that were each tested in more than 257 primary assays, with a median of 426 assays per compound. For this subset, the HR distribution over all primary assays was analyzed, yielding a median HR of 0.4% and a third quartile HR of 1.0%. On the basis of this distribution, 80,495 compounds with HR >1.0% were taken and their HR distribution was separately analyzed, yielding a median HR of 1.8%. From this subset, 41,609 compounds with HR >1.8% were then selected as the subset having highest HRs in primary screening assays, with a median HR of 2.9%. More than 93% of these compounds were also active in confirmatory assays. From these 41,609 screening compounds with overall highest HRs, MMSs were systematically extracted.

Series parameters
Three parameters were introduced to characterize MMSs/ASs and calculated as follows: • Cumulative series hit rate The number of unique assays in which one or more analogs comprising a series were active divided by the number of unique assays in which all analogs were tested; • Assay overlap The proportion of assays in which all analogs of a series were tested (shared assays) relative to the union of unique assays;

• Assays with inconsistent activity
The proportion of shared assays in which different analogs of a series were active and inactive.

Data
Analog series with high hit rates From the 41,609 extensively tested compounds with HR >1.8%, a total of 6941 ASs of varying size (two to 17 analogs) were extracted, which contained a total of 17,599 compounds (14,646 unique compounds; 2953 participated in multiple ASs). The 6941 ASs were tested in a total of 1241 assay covering 426 specified targets. Distributions of assay-and target-based HRs for ASs were similar, yielding median values of 5.8% and 4.9%, respectively. Filtering for assay interference candidates [6,14] revealed that 5065 ASs did not contain any known interference compounds. Hence, these ASs should provide a substantial resource for follow-up analysis. ASs statistics and a data summary are given in Table 1. ASs with high priority for follow-up analysis of multi-target activities and/or assay interference should best display high cumulative HR, high assay overlap and low proportion of assays with inconsistent activity. In other words, analogs comprising a prioritized AS should be tested in as many shared assays as possible, be frequently active, and exhibit consistent assay activity profiles. Figure 1 shows an exemplary AS meeting these criteria. The ASs were separately ranked according to criteria 1-3 described above and a rank fusion was calculated to globally organize ASs. Following rank fusion, series were ordered and prioritized according to the smallest sum of ranks (consensus rank).

Open access deposition
All 6941 ASs and the associated assay and activity information have been made freely available as a deposition on the ZENODO open access platform [15]. The deposition contains two text files with ASs and supplementary compound and target information: We note that the current Zenodo deposition [15] contains two versions 1 and 2. The original version 1 was updated to include b.10 (version 2).

Summary & next steps
The AS organization of screening compounds with high HRs was intended to provide structural context information for a further evaluation of assay interference potential and multi-target activities. The selection process was exclusively based on HRs and structural relationships, without preconceived notion of assay interference or target promiscuity. Since compounds forming the ASs were selected to have statistically highest hit rates across primary assays, it is anticipated that series might contain more previously unrecognized interference compounds than candidates for polypharmacology. This hypothesis will be evaluated by an in-depth analysis of the organized ASs from a chemical perspective. Making the ASs and associated information publicly available provides an unbiased basis for computational and knowledge-based follow-up analyses in different laboratories. This is expected to enable further progress in rationalizing undesired assay interference characteristics and differentiating them from true multi-target activities of small molecules. No writing assistance was utilized in the production of this manuscript.

Open access
This

Executive summary
• Compounds with multi-target activity provide the basis for polypharmacology.
• Assay artifacts caused by interference compounds represent substantial problems for biological screening and medicinal chemistry. Methodology • A statistical analysis of assay hit rates (HRs) of extensively tested screening compounds was carried out.
• A subset of compounds with overall highest HRs was selected.
• Analog series (ASs) were systematically extracted from these compounds to provide structural context information for activity analysis. Data • More than 6000 ASs formed by compounds with high HRs were identified.
• The ASs were organized and prioritized on the basis of different parameters.
• An open access deposition of all qualifying ASs and associated compound, assay frequency and activity information was prepared and reported herein. Next steps • The collection of ASs will be further analyzed to identify compounds and series with assay interference potential and multi-target activities. • Open access to the data is expected to enable further progress in understanding assay interference and differentiating it from multi-target activities of small molecules.