A new standardized data collection system for brain stereotactic external radiotherapy: the PRE.M.I.S.E project

Background: In recent years, novel radiation therapy techniques have moved clinical practice toward tailored medicine. An essential role is played by the decision support system, which requires a standardization of data collection. The Aim of the Prediction Models In Stereotactic External radiotherapy (PRE.M.I.S.E.) project is the implementation of systems that analyze heterogeneous datasets. This article presents the project design, focusing on brain stereotactic radiotherapy (SRT). Materials & methods: First, raw ontology was defined by exploiting semiformal languages (block and entity relationship diagrams) and the natural language; then, it was transposed in a Case Report Form, creating a storage system. Results: More than 130 brain SRT’s variables were selected. The dedicated software Beyond Ontology Awareness (BOA-Web) was set and data collection is ongoing. Conclusion: The PRE.M.I.S.E. project provides standardized data collection for a specific radiation therapy technique, such as SRT. Future aims are: including other centers and validating an extracranial SRT ontology.

historically been guided by international guidelines, based on randomized clinical trial evidence that provide a patient's selection beforehand. Population-based observational studies are recently emerging as a complementary form of research, often named 'rapid learning healthcare' (RLHC), which is essential to ensure that clinical trials results can be translated into tangible benefits for the general population [3]. Data collection quality in the RLHC approach can be low as data is frequently collected using different procedures, thus, pooled multicenter research is difficult to perform. Standardized data collection improves the quality of this process, defining variables and the way they should be shared without ambiguity [4].
Data collection standardization methods benefit from the use of a common ontology system. Ontologies are commonly defined as an 'explicit specification of a conceptualization' [5]: this, in our specific context, is equivalent to a classification system where uniform and not ambiguous definitions represent each variable and all their relationships. A large heterogeneous database is required to store all the information without knowing beforehand what the research topic would be. From the hypothesis, it is determined what features should be included in the learning effort in order to obtain a predictive model that represents the distribution of the same features and their relationship inside the dataset. Predictive models are the basis of predictive tool implementation; beside the more appealing interactive websites, graphical calculating devices, like nomograms [6,7].
In oncological literature, several experiences have been published regarding decision support system (DSS) implementations in different anatomical sites [1,4,8,9], but a DSS for a specific radiation technique is not available yet. The PREdiction Models In Stereotactic External radiotherapy (PRE.M.I.S.E.) is one of the research projects involved in the 'umbrella protocol' [9], which works at facilitating RLHC. The aim of the PRE.M.I.S.E. project is to create a consistent dataset to support the future development of DSSs for stereotactic radiotherapy (SRT), moving toward a 'shared decision making' approach. Doctors, together with patients, will be able to evaluate pros and cons of different treatment strategies. On the other hand, clinicians will be able to actively discuss and decide for the best therapeutic intervention, once having assessed all the features to optimize a stereotactic treatment plan.

Materials & methods
A multidisciplinary team was created with members stemming from the first two centers involved. Physicians, physicists, nurses and therapists were involved in the team. The group planned a set of phases and scheduled periodical meetings to assess the development of each single task: this iterative approach (design-implementationvalidation-back to design) helped us in exploiting the synergy of the multiple discipline in our team.
The local ethics committee approved the protocol before patients' accrual according to the legislation of the country.
Our general workflow was divided into different phases and can be summed up as follows: • Ontology definition; • Setting of the storage system; • Data analysis.

Ontology definition
To reduce the ambiguity in collecting and analyzing data, the first step was the definition of a clear ontology. Even if an ontology can intuitively be represented using the natural language, this is commonly discouraged, because even if it is the simplest solution for a human to human communication it cannot be easily translated into a formal language to be computed by a computer. The opposite, the immediate development of an ontology through the use of one of the many available formal languages, such as the options provided by the World Wide Web Consortium, is a task that requires a specific training, a multidisciplinary task force and the final result cannot easily be made understandable for non-skilled physicians for review or external validation. This also makes validating the ontology a complex task, as many participants in multicentric studies lack the required training.
The radiation oncology ontology [10] is an example of ontology written in web ontology language for radiotherapy, but the complexity of the language and the implications (in particular for automatic reasoning) are an important barrier for an extended use in a real world scenario. Because of the aforementioned reasons, we adopted the following strategy: a raw ontology was defined, exploiting some semi-formal languages (block diagrams and entity relationship diagrams) and the natural language; then it was transposed in a case report form (CRF). A CRF is a format that can be loaded, parsed and executed by a computer, where each single clinical variable is described in terms of type, admitted values, relation with other variables. Here each variable is framed as an attribute of more general entities (or classes) such as patients, treatments, visits and toxicities. The relations among the entities are also provided with the specification of the cardinalities. Once the ontology has been extensively validated and consolidated by the practice, we will consider an implementation with the World Wide Web Consortium technology, in order to exploit the automatic inference for some minor tasks (e.g., descriptive statistics on the cohort).
In building the ontology, the complexity of the knowledge domain has been separated into three different and distinct layers: the registry level is the most general tier and includes the baseline patient and tumor characteristics (age, gender, ethnicity etc.), which are considered relevant for epidemiological analysis only. The Procedure level comprises treatment information and related toxicities, and the evaluation of outcome in terms of disease-free survival and acute and late toxicities. The final level is the Research level, and includes clinical and imaging information used for in-depth, advanced research projects only.
In order to implement and use the ontology and guide the work of the designated data managers, the team created the CRF according to the format compatible with beyond ontology awareness (BOA) -a research electronic data capture software.
Setting of the storage system BOA utilizes a relational database model as the base of the data layer. SQLite has been chosen as the designated database in order to guarantee a degree of portability, by allowing installation of the complete service on wide variety of devices. A part of the implemented ontology structured in the SQLite database is shown in Figure 1. Specifically, a single archive was created for the study and successively populated with patients that could have one or more pathologies. Each of these pathologies could have one or more treatments. The specific CRFs designed for this study were then imported into BOA and converted to the required structure, which meant that each CRF had a multitude of related questions of various types (e.g., dates, single-select lists, multiselect lists or other types of inputs) with specific constraints for allowable inputs defined during the definition of the ontology. During the data entry phase, CRF links were automatically generated -linking each recorded answer to a specific question and finally linking each completed CRF link to a specific phase in the patient history (e.g., first contact with the  patient). This architecture allows not only to guarantee the integrity of the ontology, but also greatly eases any subsequent data extraction / data analysis effort. BOA itself is structured as a portable Django webservice, which allows data managers to quickly access to the required interfaces and automatically handle all required data validation aspects. An example screenshot of a CRF is shown in Figure 2. BOA can store data in two different ways, depending on the needs of the center and the wishes of the participants: BOA.Cloud and BOA.Local.

BOA.Cloud
The collected data are automatically anonymized and transferred to a cloud-based large database. After the transfer, it will not be possible to reconstruct the history of the transferred data or the pertaining patient files, due to the complete anonymization algorithm that does not allow identifying information to be conveyed, including unique IDs.

BOA.Local
The data are stored in a local database, in a secure area that prohibits any data exchange between the local client and other computers in the local area network or internet.
The two distinct pathways and their optional convergence toward a final database are also highlighted in Figure 3, which depicts a general overview of how the BOA service is laid out. In this particular example, the institution marked with the blue color, works through a BOA.Local installation that can (if desired) upload data to the purple BOA.Cloud Master server installation on demand, while both the institutions marked with the green and yellow colors connect directly to the aforementioned BOA.Cloud Master server, without the need to locally store data.

Data analysis
One of the goals of the PRE.M.I.S.E. project is to also be able to support multicentric clinical studies. To face this challenge, the project can not rely only on a collection of data stemming from a local repository or a centralized database, as these options present remarkable problems concerning the patient's privacy. Techniques such as anonymization or de-identification are dangerous because part of the information is shared; data encryption   or homomorphic encryption are suboptimal as they can potentially be decrypted. To ensure the patient's privacy and guarantee data ownership, PRE.M.I.S.E. exploits distributed learning to generate statistical models through multiple separated databases in the various BOA installation sites (both BOA.Local and BOA.Cloud infrastructures): through this paradigm only, aggregated data are shared or transferred -the data never leaves the databases [11] and the metaphorical walls of the institutions that the data is stored in.
In more detail, the distributed learning architecture is composed of one central master node and many client nodes (called local learning nodes) distributed in all database end points. The master node will have the primary task of coordinating and overviewing the learning protocols between the single hospitals, and as such, will never have direct access to clinical data but will only process aggregate data, as necessary for the algorithms intended to be run. The second part of the architecture is composed of the many local learning nodes, which are installed at each hospital. They have access to the local data and perform learning tasks as instructed by the master node. Patient data is not shared with the outside world.
The complete algorithm can be summed up by the following steps: • The local application learns a model from local data. • This local model is sent to the master, where it is processed and compared with the models sent by the other hospitals. • A consensus model is generated and sent back to each hospital for refinement.
• After preset convergence criteria are met, a final consensus model is generated.
The information exchanged between master and local nodes is limited to aggregate values (e.g., parameter weights, general statistics, coefficients) and contains no patient data. All traffic between master and local nodes is managed, monitored and audited by the infrastructure. An entire learning run is an iterative process that usually requires many cycles until the master determines that the learning process has been completed.
In the distributed learning mode, distributed research nodes do not move data around at all: they only apply iterative algorithms that the distributed research master will use to build a consensus model and estimate the Table 1. Extract from brain stereotactic radiotherapy ontology registry level.

Variables Definition Measurement
The phase The phase of oncologic history in which the patient is evaluated 0: at diagnosis 1: at follow-up 2: at progression or recurrence 3: others missing data Intent model's parameters. Distributed learning can support many algorithms for data analysis. It has been widely used as an inferential regression analysis tools, mainly based on the relationship between outcomes (binary, continuous or multinomial) and covariates, or elements in the dataset. It establishes a data-to-outcome one-way link, investigated using traditional statistical tools as linear models, generalized linear models, survival models and support vector machines [11][12][13], among others. The final model can then be presented to the end-user in a variety of ways, such as nomograms, or via interactive websites. In order to become a reliable tool to be used in clinical contexts, each model must undergo a strict evaluation process, mainly based on internal and external validation [14]. Discrimination will be assessed using the c-statistic or area under the curve of the receiver-operating characteristic. The c-statistic is comparable to the area under the curve for dichotomous outcomes but can also be used for Cox regression analyses. Plotting the expected versus the observed outcomes will provide a graphical assessment of the calibration. In addition, to identify variables to be inserted in the ontology, validate variables and build a system that defines variables' characteristics and relationships among them, the Hosmer-Lemeshow test will be used. The future development of the sharing platform is to involve other radiotherapy centers to combine multiple datasets.

Results
The 'umbrella protocol' has been utilized in order to standardize both the data and procedures, this led to the creation of a consistent dataset reporting 'personalized treatment registry', which is paving the way to obtain a trustful analysis for the DSS. A well-defined data collection model -able to collect, standardize and organize features -called 'Ontology' was then created.
The team identified more than 130 variables related with brain SRT. All features were collected and organized into three levels: 'Registry level', containing epidemiological information; 'Procedural level', which includes elements about treatment, toxicities and outcome evaluation and 'Research level' where dimensional data, such as imaging information, are collected (Table 1). When identifying variables on a specific technique, attention must be kept on treatment characteristics for every single phase, from the simulation CT scan, to the delivery. We decided to start by grouping treatment variables into three separate categories: 'contouring', 'planning' and 'delivery'. We then tackled features related to patient's set up and subsequently organized different aspects like contouring guidelines to be followed, varying imaging characteristics, clinical target volume, planning target volume margins and lesion(s) localization. In particular, regarding the planning phase, we considered the isodose line prescription, the conformity index, the calculation algorithms, the resolution grid, the multi-lesion treatments with a single isocenter, the beam energy, the gradient index and the normalization method. For the delivery phase, we identified other variables describing image guided radiation therapy techniques (Table 2). While the ontology was put into writing, the group realized that one of the most relevant aspects regarding brain metastases is that they are often multiple and the possibility to treat them together with a single isocenter and a single plan depends on their location in the brain. For this reason, in order to easily calculate distance among lesions, we decided to equate every lesion to an equivalent sphere (Figure 4). The BOA web service platform was completely set up and configured for both the BOA.Local and BOA.Cloud pathways, and the ontology has been successfully uploaded. A first institution is currently collecting data using the local server, allowing the institution to store all data without complete anonymization in accordance with the previously reported principles, while a second institution is collecting data through the cloud server. Both involved centers are using ontology-driven CRF [15][16][17] and all collected data is now available on an on-demand basis, ready to be further processed.

Discussion
Stereotactic radiosurgery is a radiation therapy technique in which multiple focused radiation beams intersect over a target to deliver a conformal, high-dose radiation and minimal radiation to surrounding normal tissues, thanks to the steep dose gradient. It is usually delivered in a single fraction but can sometimes be delivered over multiple once-daily fractions, usually to a maximum of five [18].
To our knowledge, no standardized data collection system or predictive model focusing on a treatment technique is available in literature. Several ontologies focusing on pathologies and different anatomical sites (e.g., rectum, thyroid and prostate) [1,8,15] can be found in the literature; however, to date, none of these focus on a specific       radiotherapy technique (Table 3). Only the brachytherapy ontology can be considered a technique-specific tool but is for head and neck cancers only [4,19]. The PRE.M.I.S.E. project aims to focus on SRT in every anatomic site, which will lead us to strongly emphasize both the technical and dosimetrical aspects of stereotactic treatments in our ontology. In fact, when approaching SRT from an ontology perspective, a large number of variables have to be taken into consideration. Gantry-based LINAC (lin[ear] ac[celerator]) systems use either fixed circular collimators or multileaf collimators. Treatment planning imaging is based on CT scans, but other images including magnetic resonance images and positron emission tomography, which can be fused to the treatment CT. Once again, different on-board imaging can be used to assure patient alignment. The treatment can be delivered as either multiple arcs or as one continuous arc. The isocenter is generally in the middle of the target lesion, but newer systems with volumetric modulated arc therapy allow for treatment of multiple lesions in a single arc. Dose prescription varies and treatments can be prescribed to the 60-80% isodose line or to 95-98% of the planning target volume and dose distribution can be inhomogeneous or homogenous [27]. In order to easily face the complexity of such a sophisticated technique we decided to start writing the SRT ontology, focusing on brain stereotactic treatment. The choice to build an ontology for each anatomic site was driven both by the need to reduce the risk to deal with a lot of variables, thus omitting relevant ones and by the necessity to reduce the bias of target motion in other anatomic sites (e.g., the lungs and liver).
The team identified more than 130 variables related to brain SRT (isodose line prescription, resolution grid etc.) and organized them into three levels (registry, procedural and research) in order to classify all the information to easily address the query depending on requests. In trying to create an SRT common language we faced a lot of difficulties (Table 4). First, the lack of a unique definition for SRT in terms of dose, fractions and dose homogeneity and second, differences in treatment and planning modalities among different centers, led our research toward collecting a greater number of variables that needed to be included in our ontology, in order to make it suitable for anyone. Another important aspect we faced is represented by multiple lesions treatments. In these cases, lesions can either be treated as a group with a single isocenter or as a single lesion separated from the others, thus, using multiple isocenters. When defining variables for the lesions' position, we realized that no standard exists in literature for defining tridimensional distance between lesions. We decided to assume each lesion as a sphere and calculate both the distance between the equivalent spheres and the distance among their longitudinal axis. The latter parameter appears to be important in clinical practice when deciding to treat different lesions with single or multiple radiotherapy plans. Collecting the distance between lesions could also be useful because the predictive model could be able to suggest how to treat multiple lesions (with one or multiple isocentres).
Brain SRT is usually employed when treating brain metastases. This aspect implies the need to include the primary tumor and its stage in the ontology not excluding all complementary treatments. We considered variables regarding new therapies such as immunotherapy and target therapy, for which an internationally recognized standard timing for concomitant radiotherapy is not yet available.
PRE.M.I.S.E. perspectives reside in the need to develop a system allowing the clinical decision-making process to be shared between physicians and patients in order to choose the best tailored treatment.
This project could lead to the development of predictive models based on individual patients features complementing the existing consensus or guidelines. The large amount of clinical data can then be further processed either through more classical statistical approaches or through the use of modern machine learning tools, which can be further refined into reliable clinical decision making support tools in order to guarantee a personalized approach to medicine. Clinical evidences are difficult to be generated rapidly, in a reliable way and the analysis of retrospective case series can present data collection biases due to known outcomes.

Conclusion
This project represents the first example of a standardized data collection system created for a particular radiation therapy technique and specifically for SRT. The next step of this initiative is patient enrollment. The setup of a DSS-based validated prediction model represents the long term aim of the project and could be helpful in personalizing treatment choices, both in terms of efficacy and toxicity and in identifying the most suitable patients to be included in future randomized clinical trials [8,28].

Future perspective
We intend to substantially expand the number institutions engaged with the project and the data collection efforts, initiating a parallel effort to incorporate start an ontology for stereotactic body radiation therapy into the workflow. Moreover, we aim to provide a DSS capable of individualizing the SRT treatment: developing, validating and improving prediction models for overall survival, local control, disease free survival as well as acute and late radiation-induced side effects relevant for patients that undergo a stereotactic treatment.
These prediction models could be very useful to better informs patients on the risks (acute and late toxicity) and benefits of the treatment.

Financial & competing interests disclosure
The authors have no relevant affiliations or financial involvement with any organization or entity with a financial interest in, or financial conflict, with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending or royalties.
No writing assistance was utilized in the production of this manuscript.

Ethical conduct of research
The authors state that they have obtained appropriate institutional review board approval or have followed the principles outlined in the Declaration of Helsinki for all human or animal experimental investigations. In addition, for investigations involving human subjects, informed consent has been obtained from the participants involved.

Summary points
• 'Personalized medicine' is defined by the National Cancer Institute (MA, USA) as a "form of medicine that uses information about a person's genes, proteins and environment to prevent, diagnose and treat disease. In cancer, personalized medicine uses specific information about a person's tumor to help diagnose, plan treatment, find out how well treatment is working or make a prognosis". • The tendency toward individualized medicine and the increasing amount and complexity of data, makes extremely difficult to identify which clinical decisions are better for specific patients. • In daily clinical practice, decision support systems could help to personalize clinical choice. Ontology levels • The ontology is a system to collect heterogeneous data in a standardized way in order to create large databases.
• The creation of an ontology increased the power of description by moving from local data dictionaries to a global data vocabulary. Storage system level • The storage system architecture is based on the use of a specific software called Beyond Ontology Awareness, which proposes two distinct data consolidation approaches and two data processing strategies. Distributed learning • The complete algorithm can be summed up by the following steps: • The local application learns a model from local data; • This local model is sent to the master, where it is processed and compared with the models sent by the other hospitals; • A consensus model is generated and sent back to each hospital for refinement; • After preset convergence criteria are met, a final consensus model is generated.
• PRE.M.I.S.E. project innovation resides mainly in having created an ontology for a particular radiation therapy technique instead of creating a model that only concerns a specific pathology. Future perspective • To provide decision support system capable of individualizing the treatment: • Development, validation and improvement of prediction models for overall survival, local control and disease-free survival for patients that undergo a stereotactic treatment; • Development, validation and improvement of prediction models for acute and late radiation-induced side effects relevant for patients that undergo a stereotactic treatment; • Use of prediction models to better informs patients on the risks (acute and late toxicity) and benefits of the treatment.