Abstract Background As the number of RNA-seq datasets that become available to explore transcriptome diversity increases, so does the need for easy-to-use comprehensive computational workflows. Many available tools facilitate analyses of one of the two major mechanisms of transcriptome diversity, namely, differential expression of isoforms due to alternative splicing, while the second major mechanism—RNA editing due to post-transcriptional changes of individual nucleotides—remains under-appreciated. Both these mechanisms play an essential role in physiological and diseases processes, including cancer and neurological disorders. However, elucidation of RNA editing events at transcriptome-wide level requires increasingly complex computational tools, in turn resulting in a steep entrance barrier for labs who are interested in high-throughput variant calling applications on a large scale but lack the manpower and/or computational expertise. Results Here we present an easy-to-use, fully automated, computational pipeline (Automated Isoform Diversity Detector, AIDD) that contains open source tools for various tasks needed to map transcriptome diversity, including RNA editing events. To facilitate reproducibility and avoid system dependencies, the pipeline is contained within a pre-configured VirtualBox environment. The analytical tasks and format conversions are accomplished via a set of automated scripts that enable the user to go from a set of raw data, such as fastq files, to publication-ready results and figures in one step. A publicly available dataset of Zika virus-infected neural progenitor cells is used to illustrate AIDD’s capabilities. Conclusions AIDD pipeline offers a user-friendly interface for comprehensive and reproducible RNA-seq analyses. Among unique features of AIDD are its ability to infer RNA editing patterns, including ADAR editing, and inclusion of Guttman scale patterns for time series analysis of such editing landscapes. AIDD-based results show importance of diversity of ADAR isoforms, key RNA editing enzymes linked with the innate immune system and viral infections. These findings offer insights into the potential role of ADAR editing dysregulation in the disease mechanisms, including those of congenital Zika syndrome. Because of its automated all-inclusive features, AIDD pipeline enables even a novice user to easily explore common mechanisms of transcriptome diversity, including RNA editing landscapes. Keywords: High-throughput sequencing, Analysis of RNA-seq, Transcriptome, Editome, RNA editing, Isoform, Differential expression, Sequencing variants, Adenosine deaminases acting on RNA (ADAR) Background Transcriptome complexity and diversity, including patterns of differential isoform expression, non-canonical transcripts, diversity of non-coding RNAs, and regulation of RNA editing, including editing by adenosine deaminases acting on RNA (ADAR) enzymes resulting in A to I substitutions, play fundamental roles in both normal physiological function and disease mechanisms [[39]1–[40]4]. Due to advances in deep sequencing technologies, RNA-seq experiments have become a more affordable and therefore popular tool for studying intricacies of molecular processes [[41]5–[42]8]. In fact, currently RNA-seq can be considered almost routine if not for the still substantial costs of experiments and subsequent in-silico analyses [[43]9], including those associated with data storage and handling [[44]10]. This, along with explosive increases in available volumes of data generated in large-scale RNA-seq experiments, contributes to an ongoing demand for universal, easy-to-use computational tools capable of user-specific customization. One of the widely used workflows available for high-throughput RNA-seq analyses is Galaxy, which is a reproducible and collaborative analytic platform that offers developers a framework for integrating and sharing their tools and workflows [[45]11, [46]12]. Yet, although Galaxy is designed to be relatively easy to use, even for a beginner, performing more in depth analysis with multi-step workflows often requires that a user possesses and/or has access to a specialized bioinformatics expertise. Other challenges are related to sharing potentially large-scale analyses on a public webserver, which can become time-consuming, e.g., with time to completion increasing during high peak usage hours. Further, while there are hundreds of workflows currently accessible on Galaxy, many of these are quite complex and have a substantial learning curve to perform analyses and/or often require user knowledge of reference genomes and file formats. This limits the types of datasets that can be analysed without deploying a custom Galaxy instance, which in turn requires specialized skills. Likewise, for tasks beyond the basic transcriptome discovery analysis the user would need to know how to install and utilize additional tools in the Galaxy instance, somewhat hampering its usability to the potential user with only the basic computing skills. We would like to note that Galaxy Training Network ([47]https://training.galaxyproject.org/, accessed 12 August 2020) already provides a variety of excellent tutorials to help inexperienced Galaxy users to performed complex analyses [[48]13]. These tutorials nonetheless require substantial time and effort investments from users, which may exclude small labs lacking necessary manpower or somewhat limit Galaxy’s usability in the classrooms. In the past few years several toolboxes have been released in an effort to address such challenges with using Galaxy [[49]14–[50]19]. Yet, these toolkits are often designed to analyse only one specific dimension of transcriptome diversity, and/or not fully automated and require some prior knowledge of R command line script [[51]20]. Implementation: AIDD features overview To help overcome some of these limitations, our pipeline—Automated Isoform Diversity Detector (AIDD)—has been designed implicitly with a novice user in mind, and thus, can be used, for example, as an educational tool for RNA-seq-based laboratory exercises in the classroom setting with a minimal prior user training. Because the pipeline is packaged in a VirtualBox environment, it is easy to install on essentially any operating system and/or a broad range of hardware (Windows, Linux, MacOS) that is capable of handling a VirtualBox installation without concerns for compatibility. Yet despite the seeming simplicity of installing it, our AIDD pipeline is powerful enough to handle a broad range of RNA-seq analyses, spanning from differential gene and isoform expression, to variant calling, and RNA editing analysis using dimension reduction and machine learning approaches, including Guttman scale patterns [[52]21] for time series analysis of ADAR editing landscapes. Unlike comparable tools, AIDD offers a fully automated data analysis pipeline with a simple setup and one-click execution, while still allowing for easily customizable options to account for a wide range of experimental conditions that users may wish to include. AIDD incorporates GATK haplotype caller [[53]22], which is currently not available from Galaxy, as a variant caller for RNA editing prediction, customizable R and bash scripts for detailed statistical analyses of the transcriptome, including RNA editing patterns as well as transcriptome-level differential expression combined with gene enrichment and pathway analysis. SnpEff [[54]23]is used to add depth to the complete transcriptome analysis by predicting the impact of RNA editing on protein structure and function. AIDD also performs data visualization as part of the automated pipeline and produces publication-ready heatmaps, volcano and violin plots, bar charts and Venn diagrams. AIDD availability and hardware requirements The AIDD pipeline is built in an Oracle VirtualBox ([55]https://www.oracle.com/virtualization/virtualbox/index.html, accessed 12 August 2020) virtual machine based on Ubuntu 18.04.2 LTS (Bionic Beaver) 64-bit PC (AMD64) desktop image ([56]http://releases.ubuntu.com/18.04/, accessed 12 August 2020) and contains all tools necessary for transcriptome-level analysis (Fig. [57]1). The distributed VirtualBox image is ~ 20 Gb in size and is publicly available for download via GoogleDrive link ([58]https://drive.google.com/open?id=1XOWh9H-v1nA6_Vl53PI6G2gKaVoZX6ls , accessed 12 August 2020). The up-to-date detailed description of included software tools, AIDD manual and step-by-step tutorial for AIDD are distributed via our GitHub site ([59]https://github.com/RNAdetective/AIDD, accessed 12 August 2020). Fig. 1. [60]Fig. 1 [61]Open in a new tab Flow chart of the tools and steps used in the automated workflow carried out by AIDD pipeline. The analysis begins from gathering relevant RNA-seq data files from the NCBI SRA database, followed by reads alignment using HISAT2 with Ensembl annotations. Transcriptome assembly is then performed by Stringtie. Downstream expression analysis can be performed using multiple tools, including DESeq2, edgeR and topGO. Variant calling to detect RNA-editing events, including A-to-I editing, is performed using tools implemented in GATK; and statistical analysis of the effect of RNA editing is performed using custom R scripts Implicitly tailored toward a novice user with no or minimal experience in computational analyses, AIDD is designed to run automatically with limited user input through a customizable bash script that controls multiple computational tools, including HISAT2 and GATK, among others, to comprehensively analyse RNA-seq datasets. AIDD can be deployed on almost any modern laboratory, classroom or office computer capable of running Ubuntu 18.04 in a VirtualBox environment. To shortcut the early learning curve, the pipeline is set up to run with default parameters directly “out of the box”, and includes commented out examples in the form of R markdown file that the user can choose to deploy as a step-by-step tutorial. The minimum recommended hardware specifications include 4 GHz dual-core processor (or better), 8 to 12 GB system memory available to the virtual environment, and 50 GB of free hard drive space ([62]https://www.ubuntu.com/download/desktop, accessed 12 August 2020), although at least 16 GB system memory is recommended, and some applications may require more. For example, STAR alignment tool needs at least 10 times more memory bytes than the target genome, which for human genome translates into at least 32 GB and upwards if annotations are needed [[63]24]. Included example datasets: transcriptomes of ZIKV-infected neural progenitor cell lines and importance of ADAR gene family To illustrate the AIDD capabilities, we use a publicly available dataset from a study by McGrath et al. [[64]25] that contains RNA-seq data from three genetically distinct neural progenitor cell (NPC) lines infected with Zika virus (ZIKV). The authors found varying degrees of severity of symptoms associated with congenital Zika syndrome (CZS), including decreased differentiation and proliferation, and increased signs of apoptosis [[65]25]. McGrath et al. also reported increased expression of genes involved in innate immune response, including interferon alpha (IFNA) and adenosine deaminase acting on RNA (ADAR) during ZIKV infection (Additional file [66]2: Table 1 in McGrath et al. 2017) [[67]25]. The ADAR gene family consists of three genes, namely, ADAR (also referred to as ADAR1), ADARB1 (ADAR2), and ADARB2 (ADAR3). Only ADAR and ADARB1 have proven deaminase activity [[68]26–[69]28] catalyzing the deamination of adenosine (A) to inosine (I) transition seen in RNA editing [[70]29, [71]30]. ADARB2 is thought to play a regulatory role through competition with other ADARs for substrate binding [[72]29, [73]31]. ADARs play a prominent role in the nervous system [[74]30, [75]32, [76]33], specifically in the brain [[77]34, [78]35], where the majority of ADAR editing target genes are expressed [[79]20, [80]26, [81]34, [82]36], including during development [[83]37]. Running AIDD: uploading RNA-seq data into AIDD AIDD is designed to automatically download and convert RNA-seq datasets from the SRA accession numbers that user defines in the experimental conditions table. For the example analysis discussed here, a subset of Bioproject PRJNA360845 [[84]25] was downloaded and converted to fastq format. Once converted to fastq format, fastqc ([85]http://www.bioinformatics.babraham.ac.uk/projects/fastqc/, accessed 12 August 2020) is used for quality control. Upon user assessment of quality of files, fastx-Toolkit ([86]http://hannonlab.cshl.edu/fastx_toolkit/, accessed 12 August 2020) is used to trim fastq files to assure best quality for alignment. In addition to downloading and preparing sequences, AIDD also automatically downloads and formats all necessary default references