Abstract The large diversity and volume of extracellular RNA (exRNA) data that will form the basis of the exRNA Atlas generated by the Extracellular RNA Communication Consortium pose a substantial data integration challenge. We here present the strategy that is being implemented by the exRNA Data Management and Resource Repository, which employs metadata, biomedical ontologies and Linked Data technologies, such as Resource Description Framework to integrate a diverse set of exRNA profiles into an exRNA Atlas and enable integrative exRNA analysis. We focus on the following three specific data integration tasks: (a) selection of samples from a virtual biorepository for exRNA profiling and for inclusion in the exRNA Atlas; (b) retrieval of a data slice from the exRNA Atlas for integrative analysis and (c) interpretation of exRNA analysis results in the context of pathways and networks. As exRNA profiling gains wide adoption in the research community, we anticipate that the strategies discussed here will increasingly be required to enable data reuse and to facilitate integrative analysis of exRNA data. Keywords: ERC Consortium, DMRR, exRNA, exRNA Atlas, exRNA Portal __________________________________________________________________ To catalyze the understanding of extracellular RNA (exRNA) in human health and disease, the Extracellular RNA Communication Consortium (ERC Consortium) aims to generate a large volume of highly diverse exRNA expression profiles, assimilate them into a publicly accessible exRNA Atlas and enable their integrative analysis using online accessible exRNA analysis tools. The exRNA Atlas profiles will originate from biofluid samples provided by multiple Consortium participants, will be generated using diverse experimental methods and will be made publicly accessible according to a public data release policy developed by the ERC Consortium and made accessible at [48]www.exrna.org. The exRNA Atlas profiles will be analysed in the context of source genomes (human and non-human), subtypes of RNA species within these genomes, and specific biological pathways and networks within cell types of origin and target cells. The informatics infrastructure developed for the exRNA Atlas will be implemented as free open-source code and will also be made available for use by the broad scientific community as a web-hosted service to enable integrative analysis data beyond that produced by the ERC Consortium members. The initial exRNA Atlas profiles will be generated by the ERC Consortium and in the future may be expanded to include data from literature, although ERC Consortium currently does not focus on systematic compilation of data from literature. We describe strategies that will be employed by the Data Management and Resource Repository (DMRR), a component of the Consortium, to process and analyse exRNA profiles generated by Consortium members and to support integrative analysis of exRNA profiling data through the exRNA Atlas. Towards this goal, the DMRR has organized Consortium Working Groups, including the Metadata and Data Analysis Standards and Ontology Working Groups. During the past year, the Metadata Working Group has been actively developing the data and metadata standards for submission of exRNA profiling data for inclusion in the exRNA Atlas. A process has now been established to submit sequence data to the DMRR along with metadata in standard formats. The standards cover metadata about donors, biosamples, experiments, studies and analysis steps. The metadata enable efficient selection of samples of interest (e.g. specific health condition of the donor, biofluid or cell/tissue type, library preparation method and sequencing assay) for integrative analyses. The metadata will help organize the data in the exRNA Atlas for efficient interactive access via the exRNA Portal as well as for programmatic access via REST Application Programming Interfaces (APIs) and Linked Data technologies. Biological ontologies provide controlled vocabulary for metadata fields, thus promoting integration both within the exRNA Atlas and with important non-ERC Consortium data sets, such as ENCODE. Our metadata standard now includes biomedical ontologies available via resources, including the BioPortal ([49]1) developed by the National Center for Biomedical Ontology (NCBO), Open Biological and Biomedical Ontology (OBO) Foundry ([50]2), Ontobee ([51]3) and Ontology Lookup Service ([52]4). In addition, ontological relationships between concepts pave the way for knowledge-based data discovery, integration and analysis. Specifically, transitive relations such as “is-a” and “part-of” can be traversed in order to group samples and experiments into more broad categories for the purpose of retrieval and integrative analyses. Also, non-hierarchical relationships (e.g. inhibit, interact and regulate) can be used to implement expressive semantic data queries. Both metadata and ontologies fall within the broad category of approaches to data integration that also includes Linked Data technologies such as RDF (Resource Description Framework; [53]www.w3.org/RDF/). The Consortium aims to develop an RDF knowledge base about pathways and network modules of relevance for exRNA biology that will inform interpretation of exRNA profiling data. In the following, we review a strategy to employ metadata, ontology-based reasoning and RDF to integrate and analyse exRNA profiling data, focusing on the three tasks highlighted in [54]Fig. 1a. Fig. 1. [55]Fig. 1 [56]Open in a new tab Data slicing and pathway enrichment analysis. This illustration is based on a hypothetical example of sequencing-based exRNA profiling of cerebrospinal fluid (CSF) from a brain tumour patient. Based on metadata about the selected samples, (a) “data slice” is extracted for further downstream analysis using pathway/network modules to detect activation of a metastatic brain tumour pathway. Panel b details selection of samples for profiling and inclusion in the exRNA Atlas using sample (CSF) and disease (CNS neoplasm) ontology traversals. Panel c details sequencing assay selection process using assay and experiment ontology traversals. The highlighted ontologies “CNS neoplasm” and “sequencing assay” are examples of terms that occur within an “ontology slim.” Ontology traversal in panel d identifies RNA species of interest. (a) “Data slice” defined by selections (b–d) is analysed to obtain a set of exRNA genes that show a pattern of coordinated changes. The metastatic brain tumour pathway ([57]www.wikipathways.org/index.php/Pathway:WP2249) in panel e shows enrichment for the exRNA genes overexpressed in this hypothetical case. Selection of samples from a virtual biorepository As part of an overall exRNA profiling project illustrated in [58]Fig. 1a, the Resource Sharing Working Group within the Consortium is developing a virtual biorepository that will provide access to exRNA profiles derived from clinically relevant biofluid samples. Clinical information about donors, and appropriately curated information about the biosamples and their derivatives (e.g. exosomes, RNA, DNA or protein extracts), is critical for large translational science projects such as the ERC Consortium. It is important to record the information about the methods used to obtain, process and store biosamples and also to link this information to both clinical and laboratory information. Ideally, a virtual biorepository will allow secure, de-identified storage of key biological sample data in a way that can be easily linked to assay results. The biorepository will grant different levels of data entry, editing and superuser/supervisor privileges and have the ability to interface with data entry applications running on iOS, Android and Windows Phone platforms. Sample curation will allow recording of sample location, temperature histories and use/depletion. The first use case focuses on selecting biorepository samples for the purpose of exRNA profiling. [59]Figure 1b describes a use case based on exRNA profiling done on cerebrospinal fluid (CSF) and serum samples from patients with brain tumours ([60]5). The virtual biorepository would potentially store information on donors, biospecimens, sample collection procedures, disease at the time of sample collection, available quantity of biospecimen, sample request and other relevant sample metadata, thereby facilitating the sample selection process. To link our metadata model to ontologies, we started with the ontologies adopted by the ENCODE Data Coordination Center or DCC ([61]6,[62]7) as both the ERC Consortium and ENCODE include RNA-seq data. These ontologies include Uber Anatomy Ontology (UBERON) ([63]8) for tissues and Foundation Model of Anatomy (FMA) ([64]9) for biofluids (including controlled values for exRNA sources such as serum, plasma, CSF, urine, saliva and other body fluids), Cell Ontology (CL) ([65]10) for primary cell types and Experimental Factor Ontology (EFO) ([66]11) for immortalized cell lines. While this initial set of ontologies serves as a good starting point, additional ontologies will need to be included since the ERC Consortium is more clinically focused than ENCODE. Because many exRNA experiments will include samples from subjects affected by disease (e.g. cancer, Alzheimer's disease, etc.), additional ontologies such as the Disease Ontology (DOID) ([67]12) will be required to capture the disease terms of interest. Ideally, the metadata describe samples and disease conditions in highly specific terms that are most informative but not suitable for retrieval. A set of general terms that covers all samples – referred to as an “ontology slim” – is generally useful to group samples at the top level. For example, the general term “central nervous system neoplasm” may provide a useful grouping of samples, including those that are annotated by highly specific terms such as “glioblastoma” or “metastatic CNS neoplasm.” These general terms are inferred by traversing ontologies from more specific terms up the hierarchy until the “slim” terms are encountered, as illustrated in [68]Fig. 1b. We note that the general terms are useful for grouping and retrieval, while the most specific terms are still available for drilling-down and sub-selection. Integrative analysis of exRNA profiling data using the exRNA toolset in the Genboree Workbench and selection of “data slices” from the exRNA Atlas As illustrated in [69]Fig. 1a, the selected samples are profiled for their exRNA content. The profiling includes both experimental assays and computational steps. The initial focus of the ERC Consortium is on RNA sequencing and qPCR, the two most commonly used assays for profiling exRNAs. Currently, both the small and long RNA-seq pipelines accept data in FASTQ format. Data submission is accompanied by relevant metadata in JSON (JavaScript Object Notation; [70]www.json.org/) or predefined tabbed value formats. Experimental metadata fields utilize the Ontology for Biomedical Investigations (OBI) ([71]13) for experimental assays and Chemical Entities of Biological Interest (ChEBI) ([72]14) for chemical treatments. The metadata are validated against ontologies dynamically using the BioPortal ([73]15) web service. An exRNA Metadata Tracking System provides a user interface for browsing, managing, querying, viewing, uploading and downloading exRNA metadata documents. The exceRpt small RNA-seq pipeline, accessible via the exRNA toolset in the Genboree Workbench ([74]www.genboree.org/), profiles sequencing data using various small RNA databases including miRNAs from miRBase ([75]16), tRNAs from gtRNAdb ([76]17), piRNAs from RNAdb ([77]18) and annotations from Gencode ([78]19). Abundance estimates for each of the genes within these RNA species are computed, as are a variety of quality control metrics such as read-length distributions, summaries of reads mapped to each library and detailed mapping information for each read mapped to each library. The RNA species and genes are annotated using Sequence Ontology (SO) ([79]20), thus facilitating retrieval of abundance data for genes within different RNA species. Genomic coordinates of RNA genes within each species will define a “data slice.” Such a “data slice” uniquely identifies data based on specific sample and disease ontologies, sequencing assay and experiment ontologies and the RNA species of interest. Abundancy estimates will be pre-computed in the exRNA Atlas, thus making it possible to deliver the “data slices” very fast. While the standardized processing will make the profiles maximally comparable, incompatibilities will naturally exist between different technologies such as qPCR and RNA-seq. This issue will be addressed in part by providing experimental metadata that is sufficient to identify comparable profiles and selection tools in the exRNA Atlas for integrative analysis. While the immediate focus will be on integrating exRNA profiles from human biofluids, the longer term goal includes enabling cross-species analyses. The data within the exRNA Atlas may be conceptually organized along the following three dimensions illustrated in [80]Fig. 1d: (a) donors and biosamples; (b) assays and experiments and (c) genomic coordinates of RNA genes. A query of the exRNA Atlas should provide a “data slice” in this three-dimensional space that is relevant for downstream analysis. As discussed above, each of the three dimensions is covered by ontologies. Similar to ontology “slims” for the biosample dimension (discussed in the previous section), appropriate “slims” will be defined for experimental and genomic dimensions. Ontology traversal will infer general terms along all the three dimensions (illustrated in [81]Fig. 1b–d), thus facilitating retrieval of “data slices” for downstream analysis. Contextual interpretation of the results of integrative analyses As illustrated in [82]Fig. 1a, “data slices” are used for downstream integrative analysis. Such analyses produce sets of exRNA genes with relevant profiles. For example, comparative analysis may identify exRNA species that are highly abundant in plasma samples from patients that have a particular type of disease. As another example, an unsupervised clustering may identify a group of exRNA species that show highly correlated abundance levels across a variety of samples. In both examples, the result of analysis is a set of genes, possibly ranked by a significance score. After identifying a list of candidate exRNAs, the next step often involves relating these candidates to existing knowledge of mechanism and function. The ERC Consortium is pursuing two avenues towards this goal. First, knowledge of exRNA functions is often scattered in unstructured or semi-structured form across many online databases. These databases describe key information like expression patterns, vesicle and body fluid localization, and literature references. We will