LANaPD: Towards a Unified Latin America Natural Products Database

Around the World, the number of compound databases of natural products in the public domain is rising. This is in line with the increasing synergistic combination of natural product research and chemoinformatics. Towards this global endeavor, countries in Latin America are assembling, curating, and analyzing the contents and diversity of natural products available in their geographical regions. In this manuscript we collect and analyze the efforts that countries in Latin America have made so far to build natural product databases. We further encourage the scientific community in particular in Latin America, to continue their efforts to building quality natural product databases and, whenever possible, to make them publicly accessible. It is proposed that all compound collections could be assembled into a unified resource called LANaPD: Latin America Natural Products Database. Opportunities and challenges to build, distribute, and maintain LANaPD are also discussed


Introduction
For centuries, natural products (NPs) have been the basis for the prevention and treatment of diseases. Up to date, NPs continue to have a profound impact in drug discovery [1,2]. There are compound from natural sources or are natural product derivatives that are now drugs approved for clinical use. Amid the current pandemic of COVID-19, a notable example of a promising NP is chloroquine phosphate that is an analogue of the alkaloid quinine, originally extracted from the bark of cinchona trees [3] ( Figure 1). In addition, NPs have largely contributed to compounds that to start compounds that are later optimized in terms of potency or pharma kinetic or pharmacodynamics properties or serve as source of inspiration to synthesize organic compounds. It is well known that the heath-related benefits are somehow associated with their structural uniqueness, diversity, and complexity, as compared to other drugs from different sources [4,5]. Chemical informatics (also termed in the literature chem(i)oinformatics) is increasingly contributing to drug discovery at different levels [6][7][8]. For instance, a key contribution of this research discipline is designing, building, and curating compound databases. Indeed, compound databases play a significant role in drug discovery [9] and 3 different collections, in particular in the public domain, have been reviewed elsewhere [10]. Additional major contributions of chemoinformatics to drug discovery projects are assisting screening compounds (for example, filtering in silico compounds libraries to select compounds for experimental testing), and analyzing the outcome of experimental screening assays, either, in small, medium, and/or high-throughput format [6,7,11].
Based on the experimental results, cheminformatics help to generate and/or refine hypothesis of the mechanism of action of bioactive molecules at the molecular level, and/or build models to predict the outcome of untested compounds e.g., part of a new cycle of in silico screening. All these and several other chemoinformatics tools have been successfully applied for organic compounds, including NPs and food chemicals [12,13]. Recently, it has been proposed the systematic application of chemoinformatics resources to further advance the field of Medicinal Organometallic Chemistry [14].
Contributions of informatics to advance NP research in general, and NP-based drug discovery in particular, are increasing [15,16]. One of such key contributions has been the organization and analysis of chemical information of NPs, with or without biological activity, in compound databases. Over the past five years, several reviews of NP databases have been published [17][18][19][20][21]. Some of these reviews include chemoinformatic analysis of the contents, diversity, and coverage of the compounds in chemical space.
Of note, there has been a rapid increase in the number of publicly accessible NP databases. One of the first reviews was published in 2012 [18] that included five NPs datasets (commercial and non-commercial with chemical structures available on the web). Recently, it was published the COlleCtion of Open NatUral producTs (COCONUT) database that collects over 120 databases collecting more than 400,000 non-redundant NPs and are freely accessible [22]. As part of the global efforts, different countries around the world are analyzing the information of NPs in their countries of origin.
Examples are AfroDd, from Africa [23] and VIETHERB from Vietnam [24]. As part of such global efforts, different Latin America countries are building their own compound databases using chemoinformatics resources [25],.
The primary goal of this manuscript is to discuss the recent progress of countries in Latin America to put together, curate, and analyze compound databases of NP molecules contained in their geographical region. Indeed, Latin American countries are traditionally rich in their unique biodiversity and herbal medicine has a strong tradition and use in the region. Herein we also propose to join efforts and assemble a unified Latin America Natural Products Database (LANaPD).

Contents
The first release of NuBBEDB contained approximately 640 compounds collected from publications of the NuBBE research group [26]. Four years later, the same group published an update expanding the number of compounds to more than 2000, thus increasing representation of the large biodiversity in Brazil. The update also had significant enhancements to the web-site interface [27]. Compounds in NuBBEDB are secondary metabolites of plants, fungi, insects, marine organisms, and bacteria.
Compounds in NuBBEDB are annotated with chemical, biological, pharmacological, and spectroscopic data. Chemical information includes IUPAC name, chemical structure, drug-like physicochemical properties, and metabolic class. The biological information comprises species, geographical location, and biological activities. The spectroscopic data includes molar mass and nuclear magnetic resonance, NMR data.

Accessibility and searching capabilities
NuBBEDB is accessible and searchable at the web-site interface (link in Table 1). It is also available at ChemSpider and ZINC 15 [32] where it can be found, for instance, as a natural product catalog. NuBBEDB has been recently included the COCONUT database [22].
The user can download the entire database or perform on-line searches. It has inbuilt a broad range of searching and filtering criterions ( Figure 2). For instance, is possible to search by species, geographical region in Brazil, source, biological properties, chemical structure, chemical drug-like descriptors, spectroscopic data (specifically, NMR information), and bibliographic information.

Diversity analysis and other applications
The most recent published version of NuBBEDB was analyzed based on structural diversity and complexity of the chemical structures. To this end, several chemoinformatic tools were employed. As part of the study, the contents and diversity profile NuBBEDB was compared to other commercial and non-commercial NP collections whose chemical structures are freely available. The reference collections included the Universal Natural Product Database, with more than 200000 molecules [33], and ChEMBL [34]. It was concluded that compounds in NuBBEDB are diverse in terms of molecular fingerprints, chemical scaffolds, and drug-like properties. Using stablished chemoinformatic tools, the study supported that several compounds in NuBBEDB are promising candidates for drug discovery and medicinal chemistry [35]. Interestingly, the study also revealed that 12% of the chemical scaffolds in NuBBEDB are not present in ChEMBL. Also, an in silico

Developers
Over the past few years, the Center for Pharmacognostic Research on Panamanian

Contents
The first disclosure of CIFPMA contained 354 compounds [28] and recently was updated to 454 molecules [29]. CIFPMA has compounds that have been tested biologically under more than 25 in vitro and in vivo bioassays. Examples of target therapeutic indications are anti-HIV, antioxidants, and anticancer. A web-site is under construction. Currently, the chemical structures would be available upon request.

Diversity analysis and other applications
The content, diversity analysis, as systematic structure-structure activity relationship studies of compounds in CIFPMA have been reported [28,29].

Contents
This is a database intended to collect part of the large biodiversity of Mexico that has been published by the Natural Products Department of the Institute of Chemistry, UNAM.
Compounds in UNIIQUIM are NP isolated in Mexico from plants, fungi, marine organisms, and insects found in Mexico. The total number of compounds is not totally clear from the website that is available only in Spanish (Table 1).
Compounds in UNIIQUIM are annotated with chemical and biological data, when available. Chemical information includes molecular formula, IUPAC names, CAS number, and the chemical structure. Each compound record is linked to the reported biological activity, if reported in the publication source.

Accessibility and searching capabilities
UNIIQUIM database is accessible at the web-site interface ( Table 1) that is currently available in Spanish (an English version will be released). It is not possible to download the entire database. The user can browse the contents by displaying either of two lookup tables: list of chemical compounds, and list of organisms ( Figure 3). The user can select the desired chemical compound or organism for specific information. It is also possible to search the database by bibliographic information.

Diversity analysis and other applications
To the best of our knowledge there are no reports of published applications of UNIIQUIM.
The contents was first reviewed in [37]. It is anticipated that the database will be cited in the near future.

Developers
For the past two years, the Computer-Aided Design at the School of Chemistry group (DIFACQUIM, for its acronym in Spanish) at UNAM is building and curating a NP database containing compounds isolated in Mexico. The final goal is capturing, as much as possible, the Mexican biodiversity.

Contents
The first version of BIOFACQUIM was released in 2019 and contained 423 molecules gathered from publications of the School of Chemistry for a 10-year period [30]. The same year, the database was updated with 148 structures to reach 553 compounds including molecules isolated not only in that institution but also by research groups in other Mexican institutions. As other NP databases discussed herein, BIOFACQUIM continue to be updated. Most of the compounds in BIOFACQUIM were isolated from plant, bacteria, and Mexican propolis.
Molecules in BIOFACQUIM are annotated with the chemical name and structure, bibliographic information, kingdom, genus, and species of the NP, and geographical location of the collection. If the biological information is included in the original publication, the activity data is included in the compound record.

Accessibility and searching capabilities
The first version of BIOFACQUIM is accessible and searchable at the "BIOFACQUIM Explorer" web-site (link in Table 1) (Figure 4). It is also available at ZINC 15 and is part

Diversity analysis and other applications
A comprehensive diversity analysis of the first release of BIOFACQUIM was published recently, along with the disclosure of the database itself [30]. It was concluded that compounds in this database have a broad coverage of the chemical space, overlapping with drug-like space as compared to approved drugs. Furthermore, the analysis also

Towards LANaPD
Herein it is proposed building a unified database of NPs that represent the biodiversity of Latin America. Challenging tasks that can be overcome, one more difficult than others are discussed hereunder. Recent guidelines to assemble databases of NP have been published, in particular when intended to be used in virtual screening [21].

Collection and standardization
The first step towards creating LANaPD is putting together all NP databases, processing, and curating them using standard protocols. Although this step is not straightforward it is feasible. It would be advisable that a research group would be in charge of this endeavor using publicly accessible tools and scripts or workflows available in public repositories  [39,40]. COCONUT database (vide supra) is an example of a large-scale database assembled and curated from several different sources around the world [22].
However, as discussed above, COCONUT is not focused on specific geographical regions and it does not contain all public databases from Latin America.

Accessibility
Ideally, LANaPD can be made accessible to the public. This can be done generating a web-server dedicated to the database following the Findable, Accessible, Interoperable, and Reusable (FAIR) principles [41]. Another option to deploy the database is using a public repository such as Figshare (https://figshare.com/) or ZENDO (https://zenodo.org/) where uploads are assigned a Digital Object Identifier (DOI) making them easily and uniquely citeable. LANaPD could be also accessible through other major databases broadly used so far like the ZINC 15 database [32]. NP databases such as BIOFACQUIM and AfroDB, for example, are accessible through ZINC 15 database.

Maintenance
Updating and maintaining compound databases is of critical importance for the sustained and timely use of the information. This is also a challenging step, in particular for public databases, because of issues of sustained funding that experience basically all research groups and consortiums. For instance, it is well-known that several webservers in the public domain are discontinued after certain time [42]. In the NPs area an example is the Universal Natural Products Database [33] that, at that time, was the largest non-commercial and openly available database and contained 197,201 NPs from plants, animals, and microorganisms. The web-site hosting the database is no longer accessible. One of the workaround to address this problem is making use of repositories  [32]. Other examples of public databases with sustained financial support are PubChem [43], ChEMBL [34], and DrugBank [44].

Conclusions
In line with the continued significance of NP to drug discovery and the accessibility of