Abstract

Background

   As the number of RNA-seq datasets that become available to explore
   transcriptome diversity increases, so does the need for easy-to-use
   comprehensive computational workflows. Many available tools facilitate
   analyses of one of the two major mechanisms of transcriptome diversity,
   namely, differential expression of isoforms due to alternative
   splicing, while the second major mechanism—RNA editing due to
   post-transcriptional changes of individual nucleotides—remains
   under-appreciated. Both these mechanisms play an essential role in
   physiological and diseases processes, including cancer and neurological
   disorders. However, elucidation of RNA editing events at
   transcriptome-wide level requires increasingly complex computational
   tools, in turn resulting in a steep entrance barrier for labs who are
   interested in high-throughput variant calling applications on a large
   scale but lack the manpower and/or computational expertise.

Results

   Here we present an easy-to-use, fully automated, computational pipeline
   (Automated Isoform Diversity Detector, AIDD) that contains open source
   tools for various tasks needed to map transcriptome diversity,
   including RNA editing events. To facilitate reproducibility and avoid
   system dependencies, the pipeline is contained within a pre-configured
   VirtualBox environment. The analytical tasks and format conversions are
   accomplished via a set of automated scripts that enable the user to go
   from a set of raw data, such as fastq files, to publication-ready
   results and figures in one step. A publicly available dataset of Zika
   virus-infected neural progenitor cells is used to illustrate AIDD’s
   capabilities.

Conclusions

   AIDD pipeline offers a user-friendly interface for comprehensive and
   reproducible RNA-seq analyses. Among unique features of AIDD are its
   ability to infer RNA editing patterns, including ADAR editing, and
   inclusion of Guttman scale patterns for time series analysis of such
   editing landscapes. AIDD-based results show importance of diversity of
   ADAR isoforms, key RNA editing enzymes linked with the innate immune
   system and viral infections. These findings offer insights into the
   potential role of ADAR editing dysregulation in the disease mechanisms,
   including those of congenital Zika syndrome. Because of its automated
   all-inclusive features, AIDD pipeline enables even a novice user to
   easily explore common mechanisms of transcriptome diversity, including
   RNA editing landscapes.

   Keywords: High-throughput sequencing, Analysis of RNA-seq,
   Transcriptome, Editome, RNA editing, Isoform, Differential expression,
   Sequencing variants, Adenosine deaminases acting on RNA (ADAR)

Background

   Transcriptome complexity and diversity, including patterns of
   differential isoform expression, non-canonical transcripts, diversity
   of non-coding RNAs, and regulation of RNA editing, including editing by
   adenosine deaminases acting on RNA (ADAR) enzymes resulting in A to I
   substitutions, play fundamental roles in both normal physiological
   function and disease mechanisms [[39]1–[40]4]. Due to advances in deep
   sequencing technologies, RNA-seq experiments have become a more
   affordable and therefore popular tool for studying intricacies of
   molecular processes [[41]5–[42]8]. In fact, currently RNA-seq can be
   considered almost routine if not for the still substantial costs of
   experiments and subsequent in-silico analyses [[43]9], including those
   associated with data storage and handling [[44]10]. This, along with
   explosive increases in available volumes of data generated in
   large-scale RNA-seq experiments, contributes to an ongoing demand for
   universal, easy-to-use computational tools capable of user-specific
   customization.

   One of the widely used workflows available for high-throughput RNA-seq
   analyses is Galaxy, which is a reproducible and collaborative analytic
   platform that offers developers a framework for integrating and sharing
   their tools and workflows [[45]11, [46]12]. Yet, although Galaxy is
   designed to be relatively easy to use, even for a beginner, performing
   more in depth analysis with multi-step workflows often requires that a
   user possesses and/or has access to a specialized bioinformatics
   expertise. Other challenges are related to sharing potentially
   large-scale analyses on a public webserver, which can become
   time-consuming, e.g., with time to completion increasing during high
   peak usage hours. Further, while there are hundreds of workflows
   currently accessible on Galaxy, many of these are quite complex and
   have a substantial learning curve to perform analyses and/or often
   require user knowledge of reference genomes and file formats. This
   limits the types of datasets that can be analysed without deploying a
   custom Galaxy instance, which in turn requires specialized skills.
   Likewise, for tasks beyond the basic transcriptome discovery analysis
   the user would need to know how to install and utilize additional tools
   in the Galaxy instance, somewhat hampering its usability to the
   potential user with only the basic computing skills. We would like to
   note that Galaxy Training Network
   ([47]https://training.galaxyproject.org/, accessed 12 August 2020)
   already provides a variety of excellent tutorials to help inexperienced
   Galaxy users to performed complex analyses [[48]13]. These tutorials
   nonetheless require substantial time and effort investments from users,
   which may exclude small labs lacking necessary manpower or somewhat
   limit Galaxy’s usability in the classrooms. In the past few years
   several toolboxes have been released in an effort to address such
   challenges with using Galaxy [[49]14–[50]19]. Yet, these toolkits are
   often designed to analyse only one specific dimension of transcriptome
   diversity, and/or not fully automated and require some prior knowledge
   of R command line script [[51]20].

Implementation:

AIDD features overview

   To help overcome some of these limitations, our pipeline—Automated
   Isoform Diversity Detector (AIDD)—has been designed implicitly with a
   novice user in mind, and thus, can be used, for example, as an
   educational tool for RNA-seq-based laboratory exercises in the
   classroom setting with a minimal prior user training. Because the
   pipeline is packaged in a VirtualBox environment, it is easy to install
   on essentially any operating system and/or a broad range of hardware
   (Windows, Linux, MacOS) that is capable of handling a VirtualBox
   installation without concerns for compatibility. Yet despite the
   seeming simplicity of installing it, our AIDD pipeline is powerful
   enough to handle a broad range of RNA-seq analyses, spanning from
   differential gene and isoform expression, to variant calling, and RNA
   editing analysis using dimension reduction and machine learning
   approaches, including Guttman scale patterns [[52]21] for time series
   analysis of ADAR editing landscapes. Unlike comparable tools, AIDD
   offers a fully automated data analysis pipeline with a simple setup and
   one-click execution, while still allowing for easily customizable
   options to account for a wide range of experimental conditions that
   users may wish to include. AIDD incorporates GATK haplotype caller
   [[53]22], which is currently not available from Galaxy, as a variant
   caller for RNA editing prediction, customizable R and bash scripts for
   detailed statistical analyses of the transcriptome, including RNA
   editing patterns as well as transcriptome-level differential expression
   combined with gene enrichment and pathway analysis. SnpEff [[54]23]is
   used to add depth to the complete transcriptome analysis by predicting
   the impact of RNA editing on protein structure and function. AIDD also
   performs data visualization as part of the automated pipeline and
   produces publication-ready heatmaps, volcano and violin plots, bar
   charts and Venn diagrams.

AIDD availability and hardware requirements

   The AIDD pipeline is built in an Oracle VirtualBox
   ([55]https://www.oracle.com/virtualization/virtualbox/index.html,
   accessed 12 August 2020) virtual machine based on Ubuntu 18.04.2 LTS
   (Bionic Beaver) 64-bit PC (AMD64) desktop image
   ([56]http://releases.ubuntu.com/18.04/, accessed 12 August 2020) and
   contains all tools necessary for transcriptome-level analysis
   (Fig. [57]1). The distributed VirtualBox image is ~ 20 Gb in size and
   is publicly available for download via GoogleDrive link
   ([58]https://drive.google.com/open?id=1XOWh9H-v1nA6_Vl53PI6G2gKaVoZX6ls
   , accessed 12 August 2020). The up-to-date detailed description of
   included software tools, AIDD manual and step-by-step tutorial for AIDD
   are distributed via our GitHub site
   ([59]https://github.com/RNAdetective/AIDD, accessed 12 August 2020).

Fig. 1.

   [60]Fig. 1
   [61]Open in a new tab

   Flow chart of the tools and steps used in the automated workflow
   carried out by AIDD pipeline. The analysis begins from gathering
   relevant RNA-seq data files from the NCBI SRA database, followed by
   reads alignment using HISAT2 with Ensembl annotations. Transcriptome
   assembly is then performed by Stringtie. Downstream expression analysis
   can be performed using multiple tools, including DESeq2, edgeR and
   topGO. Variant calling to detect RNA-editing events, including A-to-I
   editing, is performed using tools implemented in GATK; and statistical
   analysis of the effect of RNA editing is performed using custom R
   scripts

   Implicitly tailored toward a novice user with no or minimal experience
   in computational analyses, AIDD is designed to run automatically with
   limited user input through a customizable bash script that controls
   multiple computational tools, including HISAT2 and GATK, among others,
   to comprehensively analyse RNA-seq datasets. AIDD can be deployed on
   almost any modern laboratory, classroom or office computer capable of
   running Ubuntu 18.04 in a VirtualBox environment. To shortcut the early
   learning curve, the pipeline is set up to run with default parameters
   directly “out of the box”, and includes commented out examples in the
   form of R markdown file that the user can choose to deploy as a
   step-by-step tutorial.

   The minimum recommended hardware specifications include 4 GHz dual-core
   processor (or better), 8 to 12 GB system memory available to the
   virtual environment, and 50 GB of free hard drive space
   ([62]https://www.ubuntu.com/download/desktop, accessed 12 August 2020),
   although at least 16 GB system memory is recommended, and some
   applications may require more. For example, STAR alignment tool needs
   at least 10 times more memory bytes than the target genome, which for
   human genome translates into at least 32 GB and upwards if annotations
   are needed [[63]24].

Included example datasets: transcriptomes of ZIKV-infected neural progenitor
cell lines and importance of ADAR gene family

   To illustrate the AIDD capabilities, we use a publicly available
   dataset from a study by McGrath et al. [[64]25] that contains RNA-seq
   data from three genetically distinct neural progenitor cell (NPC) lines
   infected with Zika virus (ZIKV). The authors found varying degrees of
   severity of symptoms associated with congenital Zika syndrome (CZS),
   including decreased differentiation and proliferation, and increased
   signs of apoptosis [[65]25]. McGrath et al. also reported increased
   expression of genes involved in innate immune response, including
   interferon alpha (IFNA) and adenosine deaminase acting on RNA (ADAR)
   during ZIKV infection (Additional file [66]2: Table 1 in McGrath et al.
   2017) [[67]25]. The ADAR gene family consists of three genes, namely,
   ADAR (also referred to as ADAR1), ADARB1 (ADAR2), and ADARB2 (ADAR3).
   Only ADAR and ADARB1 have proven deaminase activity [[68]26–[69]28]
   catalyzing the deamination of adenosine (A) to inosine (I) transition
   seen in RNA editing [[70]29, [71]30]. ADARB2 is thought to play a
   regulatory role through competition with other ADARs for substrate
   binding [[72]29, [73]31]. ADARs play a prominent role in the nervous
   system [[74]30, [75]32, [76]33], specifically in the brain [[77]34,
   [78]35], where the majority of ADAR editing target genes are expressed
   [[79]20, [80]26, [81]34, [82]36], including during development
   [[83]37].

Running AIDD: uploading RNA-seq data into AIDD

   AIDD is designed to automatically download and convert RNA-seq datasets
   from the SRA accession numbers that user defines in the experimental
   conditions table. For the example analysis discussed here, a subset of
   Bioproject PRJNA360845 [[84]25] was downloaded and converted to fastq
   format. Once converted to fastq format, fastqc
   ([85]http://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
   accessed 12 August 2020) is used for quality control. Upon user
   assessment of quality of files, fastx-Toolkit
   ([86]http://hannonlab.cshl.edu/fastx_toolkit/, accessed 12 August 2020)
   is used to trim fastq files to assure best quality for alignment. In
   addition to downloading and preparing sequences, AIDD also
   automatically downloads and formats all necessary default references