Abstract

   The large diversity and volume of extracellular RNA (exRNA) data that
   will form the basis of the exRNA Atlas generated by the Extracellular
   RNA Communication Consortium pose a substantial data integration
   challenge. We here present the strategy that is being implemented by
   the exRNA Data Management and Resource Repository, which employs
   metadata, biomedical ontologies and Linked Data technologies, such as
   Resource Description Framework to integrate a diverse set of exRNA
   profiles into an exRNA Atlas and enable integrative exRNA analysis. We
   focus on the following three specific data integration tasks: (a)
   selection of samples from a virtual biorepository for exRNA profiling
   and for inclusion in the exRNA Atlas; (b) retrieval of a data slice
   from the exRNA Atlas for integrative analysis and (c) interpretation of
   exRNA analysis results in the context of pathways and networks. As
   exRNA profiling gains wide adoption in the research community, we
   anticipate that the strategies discussed here will increasingly be
   required to enable data reuse and to facilitate integrative analysis of
   exRNA data.

   Keywords: ERC Consortium, DMRR, exRNA, exRNA Atlas, exRNA Portal
     __________________________________________________________________

   To catalyze the understanding of extracellular RNA (exRNA) in human
   health and disease, the Extracellular RNA Communication Consortium (ERC
   Consortium) aims to generate a large volume of highly diverse exRNA
   expression profiles, assimilate them into a publicly accessible exRNA
   Atlas and enable their integrative analysis using online accessible
   exRNA analysis tools. The exRNA Atlas profiles will originate from
   biofluid samples provided by multiple Consortium participants, will be
   generated using diverse experimental methods and will be made publicly
   accessible according to a public data release policy developed by the
   ERC Consortium and made accessible at [48]www.exrna.org. The exRNA
   Atlas profiles will be analysed in the context of source genomes (human
   and non-human), subtypes of RNA species within these genomes, and
   specific biological pathways and networks within cell types of origin
   and target cells.

   The informatics infrastructure developed for the exRNA Atlas will be
   implemented as free open-source code and will also be made available
   for use by the broad scientific community as a web-hosted service to
   enable integrative analysis data beyond that produced by the ERC
   Consortium members. The initial exRNA Atlas profiles will be generated
   by the ERC Consortium and in the future may be expanded to include data
   from literature, although ERC Consortium currently does not focus on
   systematic compilation of data from literature.

   We describe strategies that will be employed by the Data Management and
   Resource Repository (DMRR), a component of the Consortium, to process
   and analyse exRNA profiles generated by Consortium members and to
   support integrative analysis of exRNA profiling data through the exRNA
   Atlas. Towards this goal, the DMRR has organized Consortium Working
   Groups, including the Metadata and Data Analysis Standards and Ontology
   Working Groups.

   During the past year, the Metadata Working Group has been actively
   developing the data and metadata standards for submission of exRNA
   profiling data for inclusion in the exRNA Atlas. A process has now been
   established to submit sequence data to the DMRR along with metadata in
   standard formats. The standards cover metadata about donors,
   biosamples, experiments, studies and analysis steps. The metadata
   enable efficient selection of samples of interest (e.g. specific health
   condition of the donor, biofluid or cell/tissue type, library
   preparation method and sequencing assay) for integrative analyses. The
   metadata will help organize the data in the exRNA Atlas for efficient
   interactive access via the exRNA Portal as well as for programmatic
   access via REST Application Programming Interfaces (APIs) and Linked
   Data technologies.

   Biological ontologies provide controlled vocabulary for metadata
   fields, thus promoting integration both within the exRNA Atlas and with
   important non-ERC Consortium data sets, such as ENCODE. Our metadata
   standard now includes biomedical ontologies available via resources,
   including the BioPortal ([49]1) developed by the National Center for
   Biomedical Ontology (NCBO), Open Biological and Biomedical Ontology
   (OBO) Foundry ([50]2), Ontobee ([51]3) and Ontology Lookup Service
   ([52]4).

   In addition, ontological relationships between concepts pave the way
   for knowledge-based data discovery, integration and analysis.
   Specifically, transitive relations such as “is-a” and “part-of” can be
   traversed in order to group samples and experiments into more broad
   categories for the purpose of retrieval and integrative analyses. Also,
   non-hierarchical relationships (e.g. inhibit, interact and regulate)
   can be used to implement expressive semantic data queries.

   Both metadata and ontologies fall within the broad category of
   approaches to data integration that also includes Linked Data
   technologies such as RDF (Resource Description Framework;
   [53]www.w3.org/RDF/). The Consortium aims to develop an RDF knowledge
   base about pathways and network modules of relevance for exRNA biology
   that will inform interpretation of exRNA profiling data. In the
   following, we review a strategy to employ metadata, ontology-based
   reasoning and RDF to integrate and analyse exRNA profiling data,
   focusing on the three tasks highlighted in [54]Fig. 1a.

Fig. 1.

   [55]Fig. 1
   [56]Open in a new tab

   Data slicing and pathway enrichment analysis. This illustration is
   based on a hypothetical example of sequencing-based exRNA profiling of
   cerebrospinal fluid (CSF) from a brain tumour patient. Based on
   metadata about the selected samples, (a) “data slice” is extracted for
   further downstream analysis using pathway/network modules to detect
   activation of a metastatic brain tumour pathway. Panel b details
   selection of samples for profiling and inclusion in the exRNA Atlas
   using sample (CSF) and disease (CNS neoplasm) ontology traversals.
   Panel c details sequencing assay selection process using assay and
   experiment ontology traversals. The highlighted ontologies “CNS
   neoplasm” and “sequencing assay” are examples of terms that occur
   within an “ontology slim.” Ontology traversal in panel d identifies RNA
   species of interest. (a) “Data slice” defined by selections (b–d) is
   analysed to obtain a set of exRNA genes that show a pattern of
   coordinated changes. The metastatic brain tumour pathway
   ([57]www.wikipathways.org/index.php/Pathway:WP2249) in panel e shows
   enrichment for the exRNA genes overexpressed in this hypothetical case.

Selection of samples from a virtual biorepository

   As part of an overall exRNA profiling project illustrated in [58]Fig.
   1a, the Resource Sharing Working Group within the Consortium is
   developing a virtual biorepository that will provide access to exRNA
   profiles derived from clinically relevant biofluid samples. Clinical
   information about donors, and appropriately curated information about
   the biosamples and their derivatives (e.g. exosomes, RNA, DNA or
   protein extracts), is critical for large translational science projects
   such as the ERC Consortium. It is important to record the information
   about the methods used to obtain, process and store biosamples and also
   to link this information to both clinical and laboratory information.
   Ideally, a virtual biorepository will allow secure, de-identified
   storage of key biological sample data in a way that can be easily
   linked to assay results. The biorepository will grant different levels
   of data entry, editing and superuser/supervisor privileges and have the
   ability to interface with data entry applications running on iOS,
   Android and Windows Phone platforms. Sample curation will allow
   recording of sample location, temperature histories and use/depletion.

   The first use case focuses on selecting biorepository samples for the
   purpose of exRNA profiling. [59]Figure 1b describes a use case based on
   exRNA profiling done on cerebrospinal fluid (CSF) and serum samples
   from patients with brain tumours ([60]5). The virtual biorepository
   would potentially store information on donors, biospecimens, sample
   collection procedures, disease at the time of sample collection,
   available quantity of biospecimen, sample request and other relevant
   sample metadata, thereby facilitating the sample selection process.

   To link our metadata model to ontologies, we started with the
   ontologies adopted by the ENCODE Data Coordination Center or DCC
   ([61]6,[62]7) as both the ERC Consortium and ENCODE include RNA-seq
   data. These ontologies include Uber Anatomy Ontology (UBERON) ([63]8)
   for tissues and Foundation Model of Anatomy (FMA) ([64]9) for biofluids
   (including controlled values for exRNA sources such as serum, plasma,
   CSF, urine, saliva and other body fluids), Cell Ontology (CL) ([65]10)
   for primary cell types and Experimental Factor Ontology (EFO) ([66]11)
   for immortalized cell lines. While this initial set of ontologies
   serves as a good starting point, additional ontologies will need to be
   included since the ERC Consortium is more clinically focused than
   ENCODE. Because many exRNA experiments will include samples from
   subjects affected by disease (e.g. cancer, Alzheimer's disease, etc.),
   additional ontologies such as the Disease Ontology (DOID) ([67]12) will
   be required to capture the disease terms of interest.

   Ideally, the metadata describe samples and disease conditions in highly
   specific terms that are most informative but not suitable for
   retrieval. A set of general terms that covers all samples – referred to
   as an “ontology slim” – is generally useful to group samples at the top
   level. For example, the general term “central nervous system neoplasm”
   may provide a useful grouping of samples, including those that are
   annotated by highly specific terms such as “glioblastoma” or
   “metastatic CNS neoplasm.” These general terms are inferred by
   traversing ontologies from more specific terms up the hierarchy until
   the “slim” terms are encountered, as illustrated in [68]Fig. 1b. We
   note that the general terms are useful for grouping and retrieval,
   while the most specific terms are still available for drilling-down and
   sub-selection.

Integrative analysis of exRNA profiling data using the exRNA toolset in the
Genboree Workbench and selection of “data slices” from the exRNA Atlas

   As illustrated in [69]Fig. 1a, the selected samples are profiled for
   their exRNA content. The profiling includes both experimental assays
   and computational steps. The initial focus of the ERC Consortium is on
   RNA sequencing and qPCR, the two most commonly used assays for
   profiling exRNAs. Currently, both the small and long RNA-seq pipelines
   accept data in FASTQ format. Data submission is accompanied by relevant
   metadata in JSON (JavaScript Object Notation; [70]www.json.org/) or
   predefined tabbed value formats. Experimental metadata fields utilize
   the Ontology for Biomedical Investigations (OBI) ([71]13) for
   experimental assays and Chemical Entities of Biological Interest
   (ChEBI) ([72]14) for chemical treatments. The metadata are validated
   against ontologies dynamically using the BioPortal ([73]15) web
   service. An exRNA Metadata Tracking System provides a user interface
   for browsing, managing, querying, viewing, uploading and downloading
   exRNA metadata documents.

   The exceRpt small RNA-seq pipeline, accessible via the exRNA toolset in
   the Genboree Workbench ([74]www.genboree.org/), profiles sequencing
   data using various small RNA databases including miRNAs from miRBase
   ([75]16), tRNAs from gtRNAdb ([76]17), piRNAs from RNAdb ([77]18) and
   annotations from Gencode ([78]19). Abundance estimates for each of the
   genes within these RNA species are computed, as are a variety of
   quality control metrics such as read-length distributions, summaries of
   reads mapped to each library and detailed mapping information for each
   read mapped to each library.

   The RNA species and genes are annotated using Sequence Ontology (SO)
   ([79]20), thus facilitating retrieval of abundance data for genes
   within different RNA species. Genomic coordinates of RNA genes within
   each species will define a “data slice.” Such a “data slice” uniquely
   identifies data based on specific sample and disease ontologies,
   sequencing assay and experiment ontologies and the RNA species of
   interest. Abundancy estimates will be pre-computed in the exRNA Atlas,
   thus making it possible to deliver the “data slices” very fast. While
   the standardized processing will make the profiles maximally
   comparable, incompatibilities will naturally exist between different
   technologies such as qPCR and RNA-seq. This issue will be addressed in
   part by providing experimental metadata that is sufficient to identify
   comparable profiles and selection tools in the exRNA Atlas for
   integrative analysis. While the immediate focus will be on integrating
   exRNA profiles from human biofluids, the longer term goal includes
   enabling cross-species analyses.

   The data within the exRNA Atlas may be conceptually organized along the
   following three dimensions illustrated in [80]Fig. 1d: (a) donors and
   biosamples; (b) assays and experiments and (c) genomic coordinates of
   RNA genes. A query of the exRNA Atlas should provide a “data slice” in
   this three-dimensional space that is relevant for downstream analysis.
   As discussed above, each of the three dimensions is covered by
   ontologies. Similar to ontology “slims” for the biosample dimension
   (discussed in the previous section), appropriate “slims” will be
   defined for experimental and genomic dimensions. Ontology traversal
   will infer general terms along all the three dimensions (illustrated in
   [81]Fig. 1b–d), thus facilitating retrieval of “data slices” for
   downstream analysis.

Contextual interpretation of the results of integrative analyses

   As illustrated in [82]Fig. 1a, “data slices” are used for downstream
   integrative analysis. Such analyses produce sets of exRNA genes with
   relevant profiles. For example, comparative analysis may identify exRNA
   species that are highly abundant in plasma samples from patients that
   have a particular type of disease. As another example, an unsupervised
   clustering may identify a group of exRNA species that show highly
   correlated abundance levels across a variety of samples. In both
   examples, the result of analysis is a set of genes, possibly ranked by
   a significance score.

   After identifying a list of candidate exRNAs, the next step often
   involves relating these candidates to existing knowledge of mechanism
   and function. The ERC Consortium is pursuing two avenues towards this
   goal. First, knowledge of exRNA functions is often scattered in
   unstructured or semi-structured form across many online databases.
   These databases describe key information like expression patterns,
   vesicle and body fluid localization, and literature references. We will