Abstract Background The study of RNA has been dramatically improved by the introduction of Next Generation Sequencing platforms allowing massive and cheap sequencing of selected RNA fractions, also providing information on strand orientation (RNA-Seq). The complexity of transcriptomes and of their regulative pathways make RNA-Seq one of most complex field of NGS applications, addressing several aspects of the expression process (e.g. identification and quantification of expressed genes and transcripts, alternative splicing and polyadenylation, fusion genes and trans-splicing, post-transcriptional events, etc.). Moreover, the huge volume of data generated by NGS platforms introduces unprecedented computational and technological challenges to efficiently analyze and store sequence data and results. Methods In order to provide researchers with an effective and friendly resource for analyzing RNA-Seq data, we present here RAP (RNA-Seq Analysis Pipeline), a cloud computing web application implementing a complete but modular analysis workflow. This pipeline integrates both state-of-the-art bioinformatics tools for RNA-Seq analysis and in-house developed scripts to offer to the user a comprehensive strategy for data analysis. RAP is able to perform quality checks (adopting FastQC and NGS QC Toolkit), identify and quantify expressed genes and transcripts (with Tophat, Cufflinks and HTSeq), detect alternative splicing events (using SpliceTrap) and chimeric transcripts (with ChimeraScan). This pipeline is also able to identify splicing junctions and constitutive or alternative polyadenylation sites (implementing custom analysis modules) and call for statistically significant differences in genes and transcripts expression, splicing pattern and polyadenylation site usage (using Cuffdiff2 and DESeq). Results Through a user friendly web interface, the RAP workflow can be suitably customized by the user and it is automatically executed on our cloud computing environment. This strategy allows to access to bioinformatics tools and computational resources without specific bioinformatics and IT skills. RAP provides a set of tabular and graphical results that can be helpful to browse, filter and export analyzed data, according to the user needs. Keywords: RNA-Seq, Alternative splicing, transcriptome regulation, workflow Background RNA-Seq has become one of the most popular technique across the vast landscape of the next and third generation sequencing technologies [[39]1]. It can be profitably used to investigate the gene expression process, estimating both the nature and the quantity of expressed mRNAs [[40]2] by sequencing a complete transcriptome in any cell/tissue type and condition. The ability to simultaneously detect and quantify the expression profile for a large number of genes in specific physiological and pathological conditions has opened new avenues for a deeper understanding of biological processes and their regulation (e.g. genome-wide investigation of epigenetic inheritance) and paved the way for several biotechnological and biomedical applications. Indeed, RNA-Seq can identify and quantify expressed genes and transcripts providing precious biological information on the underlying gene expression mechanisms. Notably, gene expression is a highly regulated process and in some cases final products cannot be fully characterized by analyzing short reads generated by NGS platforms particularly when many alternative transcripts of remarkable length are generated due to complex co-transcriptional and post-transcriptional nuclear processing, including alternative initiation and termination of transcription and alternative splicing [[41]3]. RNA-Seq data can be analyzed by adopting several computational strategies also depending on the requested results (e.g. expression at gene and/or transcript level, investigation of alternative splicing events, alternative polyadenylation sites, etc.). However, despite recent technological advances, key transcriptome features are yet to be fully elucidated, and its scale and complexity have not yet been fully understood [[42]4]. In order to provide easy and effective access to the gene expression studies to researchers with few or limited bioinformatics skills, user-friendly automated workflows are highly demanded to provide reliable and easy interpretable results [[43]5] which also keep up with the exponential growth of sequencing technologies [[44]6]. To the authors knowledge, several pipeline tools have been implemented for RNA-seq data analysis [[45]7-[46]13], some of them proposed along with novel algorithmic approaches to refine final results [[47]7,[48]8]. However, most of them are totally lacking an easy accessible web interface for cloud computing [[49]7-[50]9]. Most of them do not allow to execute a complete pipeline from read mapping to advanced processing tasks since they only implement a few specific steps [[51]11-[52]13]. Some tools [[53]13] are implemented as Galaxy [[54]14] customized instances, extending the platform contribution to the bioinformatics community. Unfortunately, they do not feature a powerful responsive web application to browse, filter and explore the analysis results, that is, in our experience, the main interest of final users. In this paper we present the implementation of a modular analysis workflow, named RAP (RNA-Seq Analysis Pipeline), designed to analyze sequencing data in multiple steps, each one addressing a specific task. The purpose of RAP is to investigate the complex transcriptional landscape of eukaryotic transcriptomes through a computationally optimized RNA-Seq data analysis. This tool is a web application implementing an automatic but completely customizable analysis workflow able to carry out a comprehensive transcriptome analysis and provide a wide range of results, which consist of several tabular and graphical representations that can be suitably organized, filtered and browsed according to the user needs. The workflow is presented as a multi-step pipeline. The expressed isoforms are reconstructed by adopting spliced mapping algorithms to align reads to the genome and assembling them into full-length transcripts. Then, through isoforms deconvolution methods, transcripts expression is quantified. Alternative splicing is investigated by mapping reads against an exhaustive splice junctions library constructed by a comprehensive combinatorial assortments of known splicing events. Computational strategies implemented in RAP modules also allow to identify polyadenylation sites, elementary alternative splicing events (e.g. exon skipping) and chimeric transcripts. Finally, the comparative analysis of two or more experiment groups corresponding to different physiological or pathological conditions allows the detection of statistically significant changes in expression levels at gene or transcript isoform level, in the splicing pattern (e.g. differential exon skipping) and in used polyadenylation sites. This broad variety of different analysis branches is achieved by means of a highly modularized implementation and a fully generalized computational engine. Implementation RAP (RNA-Seq Analysis Pipeline) is a web application implementing a fully automated analysis workflow, designed to integrate in-house developed scripts as well as open source analysis tools into one pipeline (Figure [55]1). Using RAP the user can perform a complete RNA-Seq analysis without any specific technical competence nor directly managing the complexity of distributed computational resources. Moreover RAP also offers a web interface for results management and visualization, allowing the user to browse and filter the massive amount of data obtained from typical RNA-Seq experiments. Figure 1. Figure 1 [56]Open in a new tab Schematic description of RAP workflow. Quality check and filtering step for quality assessment (Steps 1 and 2). High quality reads are aligned to the reference genome using TopHat (Step 3). Alignments are assembled into full-length transcripts and their relative abundances are estimated by Cufflinks to (Step 4) and raw-counted by HTSeq (Step 5). Unspliced reads are filtered out after an ungapped alignment to the genome (Step 6). Remaining reads (potentially spliced) are mapped to a custom built junction library (Step 7). Reads still unmapped are scanned to identify poly(A) tags (Step 8). Cassette exons are identified and quantified by adopting a statistical tool, SpliceTrap (Step 9), and chimeric transcripts are detected by means of ChimeraScan (Step 10). After the completion of the main analysis, several differential analyses can be executed. Cuffdiff at transcript level (Step A), based on expression levels calculated by Cufflinks. DESeq at gene level (Step B), based on gene raw counts calculated by HTSeq. Differential exons (Step C), differential junctions usage (Step D) and differential polyadenylation sites (Step E) can also be calculated. RAP takes as input short-read datasets produced by Illumina sequencing platforms and supports several standard file formats (FASTQ, SRA, BAM and compressed archives). The pipeline is designed to analyze the data through a series of phases, each of them focused on a specific task. RAP integrates two widespread tools for quality assessment: FastQC [[57]15] and NGS QC Toolkit [[58]16] both providing quality checks and statistics, useful for a preliminary evaluation and filtering of raw data. After the quality control and reads filtering phase, different paths can be simultaneously followed, according to the user's request. The first path implements TopHat [[59]17]/ Cufflinks [[60]18] / Cuffdiff2 [[61]19] and/or HTSeq/DESeq [[62]20] for gene/isoform expression quantification and differential analysis. The second one detects and quantifies alternative splicing events, particularly cassette exons and their differential occurrence through SpliceTrap [[63]21]. The third path detects and quantifies splicing junctions, including novel ones as well as polyadenylation sites and their differential occurrence in different samples. The splice junctions library is built starting from reference gene annotation models and includes both known and potential splice junctions (see below for further details). Residual reads still unmapped to the genome, transcriptome and junctions are analyzed to eventually detect polyadenylation sites. PolyA tags (reads containing a stretch of A (An) at the end of the sequence) are extracted, An-trimmed and re-aligned to the genome. Following the alignment, a pattern matching procedure is also applied to possibly detect common polyadenylation signal sequences (PAS) such as AAUAAA. Finally, the last path identifies chimeric fusion transcripts through ChimeraScan [[64]22], a tool based on Bowtie [[65]23] alignments to detect putative fusion breakpoints. At present RAP is implemented for several organisms: Homo sapiens (genomes hg18 and hg19), Mus musculus (genomes mm9 and mm10), Rattus norvegicus (genome rn4), Drosophila melanogaster (genome dm3), Saccharomyces cerevisiae (genome sacCer3) and Zea mays (genomes maize2, maize3 and Mo17_v1). Additional genomes will be considered and implemented in the future, if required by users. Quality checks An effective quality check is critical for a reliable data analysis, since the read quality may affect downstream results. FastQC [[66]15] (Figure [67]1, step 1) provides a range of quality checking modules covering different aspects of raw read quality, helpful to highlight sequencing biases and contaminants. NGS QC Toolkit [[68]16] (Figure [69]1, step 2) is a suite of tools for quality check and filtering of NGS data from Illumina and Roche 454 platforms. This tool allows to filter out low quality reads and increase the overall dataset quality. High quality reads are processed to extract several information at different analysis stages. Read mapping Since RAP is specifically designed to work with eukaryotic genomes and to detect novel splicing events, a spliced aligner is required to map reads to the genome, across the introns. TopHat2 [[70]17] (Figure [71]1, step 3), using Bowtie2 [[72]24], can map reads against both genome and transcriptome (when an annotation is provided). Since full-length transcripts bring to a significant gain in both sensitivity and accuracy (e.g. for the right recognition of pseudogenes), RAP automatically includes transcript annotations into the pipeline. Reads are first mapped to known transcripts and unmapped reads (deriving from unknown transcripts or containing many miscalled bases) or poorly aligned reads are then aligned to the genome. Transcripts reconstruction and quantification After TopHat2 execution, the resulting alignment files are provided to Cufflinks [[73]18] (Figure [74]1, step 4) to generate a transcriptome assembly and to estimate the expression level of all detected isoforms. Furthermore a gene expression raw-count is also calculated with HTSeq [[75]20]. Since RAP does not allow de novo transcript assembly, Cufflinks is guided by a gene reference annotation file (the same used during the alignment). The user can select between two reconstruction algorithms provided by Cufflinks assembler. A first option is to use reference transcripts to reconstruct known isoforms and avoid the assembly of putative novel transcripts (basic assembler). A second option is to use the supplied reference annotation to guide the assembly and include in the output both reference transcripts and novel assembled genes and isoforms (RABT assembler) [[76]25]. Reconstructed transcripts are then analyzed to estimate their relative abundance, measured in FPKM (expected Fragments Per Kilobase of transcript per Million fragments sequenced). RNA-Seq experiments suffer of known sequencing bias, that can jeopardize the assumption of uniform coverage. A bias can be typically introduced by the cDNA amplification phase when random hexamers are generally used [[77]26]. Cufflinks can be used to reduce biases adopting two strategies. It estimates approximated transcripts abundance to weight reads and profiling the transcripts sequences, then abundances are re-estimated adjusting the initial approximation on the bases of detected sequencing biases (fragment bias correction). Furthermore, reads mapped to multiple genome positions are at first uniformly assigned to each position and then expression levels are re-estimated by probabilistically dividing reads on the basis of the first abundance estimation, the inferred fragment length and the fragment bias (multi reads correction). Since most of the tools for differential gene expression analysis require raw counts, RAP also includes a second approach to estimate the relative gene expression by using the HTSeq suite [[78]20] (Figure [79]1, step 5). Since reads can overlap, even partially, two or more features (e.g. exons), RAP implemented the intersection_nonempty mode provided by HTSeq to guarantee the highest number of assignments. With this mode if a read is completely mapped on a feature and partially on another, the read is assigned to the first feature. Furthermore, this mode is able to handle partial overlap on a single feature. Splice junctions detection High quality reads, obtained from NGS QC Toolkit, are also analyzed to detect splicing junctions. This phase is executed in parallel with the previous steps. Although TopHat2 is already able to detect both known and novel splicing junctions, the algorithm implemented in this phase is more focused on this task and can detect a greater number of splicing events. To reduce the computational load, unspliced reads, detected by Bowtie mapping to the reference genome, are discarded from the initial dataset (Figure [80]1, step 6). Because Bowtie2 [[81]23] is an unspliced aligner, only intra-exonic reads will be mapped in this phase and discarded. Unmapped reads may potentially present a spliced alignment on the genome easily detectable, again using Bowtie, by mapping to a custom built splice junctions library (Figure [82]1, step 7). The splice junctions library is built starting from a gene annotation model in GTF format ([83]http://www.ensembl.org/info/website/upload/gff.html), the same already used during the alignment and the assembly steps) and includes two different categories of splice junctions: known and novel. Known junctions are directly derived from RefSeq [[84]27] while novel junctions are obtained through a combinatorial exon skipping procedure by considering all compatible exon skipping patterns. In a transcript with k exons, k-1 known splice junctions can be observed. By selectively skipping the inner exons, k-2 novel splice junctions can be identified. Multiple sequential inner exons can also be skipped, picking out further junctions. This combinatorial approach exhaustively lists all [MATH: k-2×k-12 :MATH] novel compatible splice junctions and it is applied to each annotated transcript. Splice junctions already included in the known junctions set are not counted to avoid overestimation of novel junctions. From each splice junction, the flanking exons sequences are extracted, considering the 3' end of the upstream exon and the 5' end of the downstream exon, and spliced together. The library is produced considering several boundaries lengths (50, 75, 100 and 150bp) to have an optimal fitting with the user-provided read lengths. If a flanking exon is shorter than the selected boundary length, a truncated junction is extracted. Since the two boundaries are fused together each sequence is annotated with a length of source exons, to be able to further determine the fusion point even for truncated junctions. The human RefSeq junctions library contains about 200,000 known splice sites on a total of about 2 millions combinatorial junctions included in the library. Polyadenylation site detection Reads still unmapped to the genome, transcriptome and junctions are further analyzed to identify polyadenylation sites (Figure [85]1, step 8). PolyA tags (reads containing a stretch of A at the end of the sequence) are extracted, trimmed and aligned to the genome. Since the read may cover a short final exon, the sequence could also contain a splice junction. Therefore a spliced alignment with TopHat2 is adopted. Following the alignment, a parsing procedure is applied to annotate the concurrent occurrence of polyadenylation signal (PAS) sequences (i.e. AAUAAA, AUUAAA or less common variants). PAS are searched in order of frequency, from the most common to the rarest. Over the canonical polyadenylation signal (AAUAAA) a total of 10 variants are considered [[86]28]. Cassette exons identification Even though cassette exons (i.e. exon skipping events) can be recognized by the splice junction mapping phase, RAP also handles a more specialized step focused on this and other elementary alternative splicing (AS) events such as intron retention and alternative 5' or 3' splice sites. Indeed in higher eukaryotes, exon skipping is the most common AS event, accounting nearly 40% of all events [[87]29]. RAP adopts a statistical method for splicing events identification and quantification implemented by SpliceTrap [[88]21] (Figure [89]1, step 9). In order to reliably detect and quantify every potential exon-skipping event, SpliceTrap builds an exon-trio database (TXdb) capturing all known transcripts obtained from RefSeq annotations [[90]27] and breaking each transcript into all possible exon trios. Each exon trio thus leads to two sequences: an inclusion isoform with all three exons and a skipping isoform with the two flanking exons only. Reads are then aligned to the TXdb database using Bowtie and poorly covered exon trios are filtered out applying a dynamic exon-size-dependent cut-off strategy. Exon inclusion ratios are then estimated adopting a Bayesian model, given the probability of observing each fragment on a specific isoform (as function of both inclusion and exclusion isoform expression level and normalized by the isoforms lengths) [[91]21]. The exon inclusion ratio is defined as the expression level of the inclusion isoform divided by the total expression level of both isoforms (inclusion and exclusion) derived from an exon trio. SpliceTrap is also able to identify other splicing events, such as intron retention (IR), alternative donor (5' splice site) (AD) and alternative acceptor (3' splice site) (AA). Chimeric transcripts annotation A further RAP path carries out a specific data analysis to detect chimeric transcripts (Figure [92]1, step 10), i.e. RNAs encoded by a fusion gene or by two different genes through a trans-splicing event. RAP integrates ChimeraScan [[93]22], a tool based on Bowtie alignments to detected putative fusion breakpoints. ChimeraScan aligns paired-end reads to a combined genome-transcriptome reference to discard uninformative data and to estimate the insert size distribution of the library. Unmapped reads are then trimmed into smaller segments and realigned to the genome, to build a set of putative chimeric junction sequences. Potential fusion breakpoints arise from fragments that align to distinct references or distance genomic locations (according to the