Abstract

Background

   The study of RNA has been dramatically improved by the introduction of
   Next Generation Sequencing platforms allowing massive and cheap
   sequencing of selected RNA fractions, also providing information on
   strand orientation (RNA-Seq). The complexity of transcriptomes and of
   their regulative pathways make RNA-Seq one of most complex field of NGS
   applications, addressing several aspects of the expression process
   (e.g. identification and quantification of expressed genes and
   transcripts, alternative splicing and polyadenylation, fusion genes and
   trans-splicing, post-transcriptional events, etc.).

   Moreover, the huge volume of data generated by NGS platforms introduces
   unprecedented computational and technological challenges to efficiently
   analyze and store sequence data and results.

Methods

   In order to provide researchers with an effective and friendly resource
   for analyzing RNA-Seq data, we present here RAP (RNA-Seq Analysis
   Pipeline), a cloud computing web application implementing a complete
   but modular analysis workflow. This pipeline integrates both
   state-of-the-art bioinformatics tools for RNA-Seq analysis and in-house
   developed scripts to offer to the user a comprehensive strategy for
   data analysis. RAP is able to perform quality checks (adopting FastQC
   and NGS QC Toolkit), identify and quantify expressed genes and
   transcripts (with Tophat, Cufflinks and HTSeq), detect alternative
   splicing events (using SpliceTrap) and chimeric transcripts (with
   ChimeraScan). This pipeline is also able to identify splicing junctions
   and constitutive or alternative polyadenylation sites (implementing
   custom analysis modules) and call for statistically significant
   differences in genes and transcripts expression, splicing pattern and
   polyadenylation site usage (using Cuffdiff2 and DESeq).

Results

   Through a user friendly web interface, the RAP workflow can be suitably
   customized by the user and it is automatically executed on our cloud
   computing environment. This strategy allows to access to bioinformatics
   tools and computational resources without specific bioinformatics and
   IT skills. RAP provides a set of tabular and graphical results that can
   be helpful to browse, filter and export analyzed data, according to the
   user needs.

   Keywords: RNA-Seq, Alternative splicing, transcriptome regulation,
   workflow

Background

   RNA-Seq has become one of the most popular technique across the vast
   landscape of the next and third generation sequencing technologies
   [[39]1]. It can be profitably used to investigate the gene expression
   process, estimating both the nature and the quantity of expressed mRNAs
   [[40]2] by sequencing a complete transcriptome in any cell/tissue type
   and condition. The ability to simultaneously detect and quantify the
   expression profile for a large number of genes in specific
   physiological and pathological conditions has opened new avenues for a
   deeper understanding of biological processes and their regulation (e.g.
   genome-wide investigation of epigenetic inheritance) and paved the way
   for several biotechnological and biomedical applications.

   Indeed, RNA-Seq can identify and quantify expressed genes and
   transcripts providing precious biological information on the underlying
   gene expression mechanisms. Notably, gene expression is a highly
   regulated process and in some cases final products cannot be fully
   characterized by analyzing short reads generated by NGS platforms
   particularly when many alternative transcripts of remarkable length are
   generated due to complex co-transcriptional and post-transcriptional
   nuclear processing, including alternative initiation and termination of
   transcription and alternative splicing [[41]3].

   RNA-Seq data can be analyzed by adopting several computational
   strategies also depending on the requested results (e.g. expression at
   gene and/or transcript level, investigation of alternative splicing
   events, alternative polyadenylation sites, etc.). However, despite
   recent technological advances, key transcriptome features are yet to be
   fully elucidated, and its scale and complexity have not yet been fully
   understood [[42]4]. In order to provide easy and effective access to
   the gene expression studies to researchers with few or limited
   bioinformatics skills, user-friendly automated workflows are highly
   demanded to provide reliable and easy interpretable results [[43]5]
   which also keep up with the exponential growth of sequencing
   technologies [[44]6].

   To the authors knowledge, several pipeline tools have been implemented
   for RNA-seq data analysis [[45]7-[46]13], some of them proposed along
   with novel algorithmic approaches to refine final results
   [[47]7,[48]8]. However, most of them are totally lacking an easy
   accessible web interface for cloud computing [[49]7-[50]9]. Most of
   them do not allow to execute a complete pipeline from read mapping to
   advanced processing tasks since they only implement a few specific
   steps [[51]11-[52]13]. Some tools [[53]13] are implemented as Galaxy
   [[54]14] customized instances, extending the platform contribution to
   the bioinformatics community. Unfortunately, they do not feature a
   powerful responsive web application to browse, filter and explore the
   analysis results, that is, in our experience, the main interest of
   final users.

   In this paper we present the implementation of a modular analysis
   workflow, named RAP (RNA-Seq Analysis Pipeline), designed to analyze
   sequencing data in multiple steps, each one addressing a specific task.

   The purpose of RAP is to investigate the complex transcriptional
   landscape of eukaryotic transcriptomes through a computationally
   optimized RNA-Seq data analysis. This tool is a web application
   implementing an automatic but completely customizable analysis workflow
   able to carry out a comprehensive transcriptome analysis and provide a
   wide range of results, which consist of several tabular and graphical
   representations that can be suitably organized, filtered and browsed
   according to the user needs.

   The workflow is presented as a multi-step pipeline. The expressed
   isoforms are reconstructed by adopting spliced mapping algorithms to
   align reads to the genome and assembling them into full-length
   transcripts. Then, through isoforms deconvolution methods, transcripts
   expression is quantified. Alternative splicing is investigated by
   mapping reads against an exhaustive splice junctions library
   constructed by a comprehensive combinatorial assortments of known
   splicing events. Computational strategies implemented in RAP modules
   also allow to identify polyadenylation sites, elementary alternative
   splicing events (e.g. exon skipping) and chimeric transcripts. Finally,
   the comparative analysis of two or more experiment groups corresponding
   to different physiological or pathological conditions allows the
   detection of statistically significant changes in expression levels at
   gene or transcript isoform level, in the splicing pattern (e.g.
   differential exon skipping) and in used polyadenylation sites.

   This broad variety of different analysis branches is achieved by means
   of a highly modularized implementation and a fully generalized
   computational engine.

Implementation

   RAP (RNA-Seq Analysis Pipeline) is a web application implementing a
   fully automated analysis workflow, designed to integrate in-house
   developed scripts as well as open source analysis tools into one
   pipeline (Figure [55]1). Using RAP the user can perform a complete
   RNA-Seq analysis without any specific technical competence nor directly
   managing the complexity of distributed computational resources.
   Moreover RAP also offers a web interface for results management and
   visualization, allowing the user to browse and filter the massive
   amount of data obtained from typical RNA-Seq experiments.

Figure 1.

   Figure 1
   [56]Open in a new tab

   Schematic description of RAP workflow. Quality check and filtering step
   for quality assessment (Steps 1 and 2). High quality reads are aligned
   to the reference genome using TopHat (Step 3). Alignments are assembled
   into full-length transcripts and their relative abundances are
   estimated by Cufflinks to (Step 4) and raw-counted by HTSeq (Step 5).
   Unspliced reads are filtered out after an ungapped alignment to the
   genome (Step 6). Remaining reads (potentially spliced) are mapped to a
   custom built junction library (Step 7). Reads still unmapped are
   scanned to identify poly(A) tags (Step 8). Cassette exons are
   identified and quantified by adopting a statistical tool, SpliceTrap
   (Step 9), and chimeric transcripts are detected by means of ChimeraScan
   (Step 10). After the completion of the main analysis, several
   differential analyses can be executed. Cuffdiff at transcript level
   (Step A), based on expression levels calculated by Cufflinks. DESeq at
   gene level (Step B), based on gene raw counts calculated by HTSeq.
   Differential exons (Step C), differential junctions usage (Step D) and
   differential polyadenylation sites (Step E) can also be calculated.

   RAP takes as input short-read datasets produced by Illumina sequencing
   platforms and supports several standard file formats (FASTQ, SRA, BAM
   and compressed archives).

   The pipeline is designed to analyze the data through a series of
   phases, each of them focused on a specific task.

   RAP integrates two widespread tools for quality assessment: FastQC
   [[57]15] and NGS QC Toolkit [[58]16] both providing quality checks and
   statistics, useful for a preliminary evaluation and filtering of raw
   data. After the quality control and reads filtering phase, different
   paths can be simultaneously followed, according to the user's request.

   The first path implements TopHat [[59]17]/ Cufflinks [[60]18] /
   Cuffdiff2 [[61]19] and/or HTSeq/DESeq [[62]20] for gene/isoform
   expression quantification and differential analysis.

   The second one detects and quantifies alternative splicing events,
   particularly cassette exons and their differential occurrence through
   SpliceTrap [[63]21].

   The third path detects and quantifies splicing junctions, including
   novel ones as well as polyadenylation sites and their differential
   occurrence in different samples. The splice junctions library is built
   starting from reference gene annotation models and includes both known
   and potential splice junctions (see below for further details).

   Residual reads still unmapped to the genome, transcriptome and
   junctions are analyzed to eventually detect polyadenylation sites.
   PolyA tags (reads containing a stretch of A (An) at the end of the
   sequence) are extracted, An-trimmed and re-aligned to the genome.
   Following the alignment, a pattern matching procedure is also applied
   to possibly detect common polyadenylation signal sequences (PAS) such
   as AAUAAA.

   Finally, the last path identifies chimeric fusion transcripts through
   ChimeraScan [[64]22], a tool based on Bowtie [[65]23] alignments to
   detect putative fusion breakpoints.

   At present RAP is implemented for several organisms: Homo sapiens
   (genomes hg18 and hg19), Mus musculus (genomes mm9 and mm10), Rattus
   norvegicus (genome rn4), Drosophila melanogaster (genome dm3),
   Saccharomyces cerevisiae (genome sacCer3) and Zea mays (genomes maize2,
   maize3 and Mo17_v1). Additional genomes will be considered and
   implemented in the future, if required by users.

Quality checks

   An effective quality check is critical for a reliable data analysis,
   since the read quality may affect downstream results. FastQC [[66]15]
   (Figure [67]1, step 1) provides a range of quality checking modules
   covering different aspects of raw read quality, helpful to highlight
   sequencing biases and contaminants.

   NGS QC Toolkit [[68]16] (Figure [69]1, step 2) is a suite of tools for
   quality check and filtering of NGS data from Illumina and Roche 454
   platforms. This tool allows to filter out low quality reads and
   increase the overall dataset quality.

   High quality reads are processed to extract several information at
   different analysis stages.

Read mapping

   Since RAP is specifically designed to work with eukaryotic genomes and
   to detect novel splicing events, a spliced aligner is required to map
   reads to the genome, across the introns. TopHat2 [[70]17] (Figure
   [71]1, step 3), using Bowtie2 [[72]24], can map reads against both
   genome and transcriptome (when an annotation is provided). Since
   full-length transcripts bring to a significant gain in both sensitivity
   and accuracy (e.g. for the right recognition of pseudogenes), RAP
   automatically includes transcript annotations into the pipeline. Reads
   are first mapped to known transcripts and unmapped reads (deriving from
   unknown transcripts or containing many miscalled bases) or poorly
   aligned reads are then aligned to the genome.

Transcripts reconstruction and quantification

   After TopHat2 execution, the resulting alignment files are provided to
   Cufflinks [[73]18] (Figure [74]1, step 4) to generate a transcriptome
   assembly and to estimate the expression level of all detected isoforms.
   Furthermore a gene expression raw-count is also calculated with HTSeq
   [[75]20].

   Since RAP does not allow de novo transcript assembly, Cufflinks is
   guided by a gene reference annotation file (the same used during the
   alignment). The user can select between two reconstruction algorithms
   provided by Cufflinks assembler. A first option is to use reference
   transcripts to reconstruct known isoforms and avoid the assembly of
   putative novel transcripts (basic assembler). A second option is to use
   the supplied reference annotation to guide the assembly and include in
   the output both reference transcripts and novel assembled genes and
   isoforms (RABT assembler) [[76]25].

   Reconstructed transcripts are then analyzed to estimate their relative
   abundance, measured in FPKM (expected Fragments Per Kilobase of
   transcript per Million fragments sequenced).

   RNA-Seq experiments suffer of known sequencing bias, that can
   jeopardize the assumption of uniform coverage. A bias can be typically
   introduced by the cDNA amplification phase when random hexamers are
   generally used [[77]26]. Cufflinks can be used to reduce biases
   adopting two strategies. It estimates approximated transcripts
   abundance to weight reads and profiling the transcripts sequences, then
   abundances are re-estimated adjusting the initial approximation on the
   bases of detected sequencing biases (fragment bias correction).
   Furthermore, reads mapped to multiple genome positions are at first
   uniformly assigned to each position and then expression levels are
   re-estimated by probabilistically dividing reads on the basis of the
   first abundance estimation, the inferred fragment length and the
   fragment bias (multi reads correction).

   Since most of the tools for differential gene expression analysis
   require raw counts, RAP also includes a second approach to estimate the
   relative gene expression by using the HTSeq suite [[78]20] (Figure
   [79]1, step 5). Since reads can overlap, even partially, two or more
   features (e.g. exons), RAP implemented the intersection_nonempty mode
   provided by HTSeq to guarantee the highest number of assignments. With
   this mode if a read is completely mapped on a feature and partially on
   another, the read is assigned to the first feature. Furthermore, this
   mode is able to handle partial overlap on a single feature.

Splice junctions detection

   High quality reads, obtained from NGS QC Toolkit, are also analyzed to
   detect splicing junctions. This phase is executed in parallel with the
   previous steps. Although TopHat2 is already able to detect both known
   and novel splicing junctions, the algorithm implemented in this phase
   is more focused on this task and can detect a greater number of
   splicing events.

   To reduce the computational load, unspliced reads, detected by Bowtie
   mapping to the reference genome, are discarded from the initial dataset
   (Figure [80]1, step 6). Because Bowtie2 [[81]23] is an unspliced
   aligner, only intra-exonic reads will be mapped in this phase and
   discarded. Unmapped reads may potentially present a spliced alignment
   on the genome easily detectable, again using Bowtie, by mapping to a
   custom built splice junctions library (Figure [82]1, step 7).

   The splice junctions library is built starting from a gene annotation
   model in GTF format
   ([83]http://www.ensembl.org/info/website/upload/gff.html), the same
   already used during the alignment and the assembly steps) and includes
   two different categories of splice junctions: known and novel. Known
   junctions are directly derived from RefSeq [[84]27] while novel
   junctions are obtained through a combinatorial exon skipping procedure
   by considering all compatible exon skipping patterns. In a transcript
   with k exons, k-1 known splice junctions can be observed. By
   selectively skipping the inner exons, k-2 novel splice junctions can be
   identified. Multiple sequential inner exons can also be skipped,
   picking out further junctions. This combinatorial approach exhaustively
   lists all
   [MATH: <mfrac><mrow><mfenced close=")" open="("><mrow><mi>k</mi><mo
   class="MathClass-bin">-</mo><mn>2</mn></mrow></mfenced><mo
   class="MathClass-bin">×</mo><mfenced close=")"
   open="("><mrow><mi>k</mi><mo
   class="MathClass-bin">-</mo><mn>1</mn></mrow></mfenced></mrow><mrow><mn
   >2</mn></mrow></mfrac> :MATH]
   novel compatible splice junctions and it is applied to each annotated
   transcript. Splice junctions already included in the known junctions
   set are not counted to avoid overestimation of novel junctions.

   From each splice junction, the flanking exons sequences are extracted,
   considering the 3' end of the upstream exon and the 5' end of the
   downstream exon, and spliced together. The library is produced
   considering several boundaries lengths (50, 75, 100 and 150bp) to have
   an optimal fitting with the user-provided read lengths. If a flanking
   exon is shorter than the selected boundary length, a truncated junction
   is extracted. Since the two boundaries are fused together each sequence
   is annotated with a length of source exons, to be able to further
   determine the fusion point even for truncated junctions. The human
   RefSeq junctions library contains about 200,000 known splice sites on a
   total of about 2 millions combinatorial junctions included in the
   library.

Polyadenylation site detection

   Reads still unmapped to the genome, transcriptome and junctions are
   further analyzed to identify polyadenylation sites (Figure [85]1, step
   8). PolyA tags (reads containing a stretch of A at the end of the
   sequence) are extracted, trimmed and aligned to the genome. Since the
   read may cover a short final exon, the sequence could also contain a
   splice junction. Therefore a spliced alignment with TopHat2 is adopted.
   Following the alignment, a parsing procedure is applied to annotate the
   concurrent occurrence of polyadenylation signal (PAS) sequences (i.e.
   AAUAAA, AUUAAA or less common variants). PAS are searched in order of
   frequency, from the most common to the rarest. Over the canonical
   polyadenylation signal (AAUAAA) a total of 10 variants are considered
   [[86]28].

Cassette exons identification

   Even though cassette exons (i.e. exon skipping events) can be
   recognized by the splice junction mapping phase, RAP also handles a
   more specialized step focused on this and other elementary alternative
   splicing (AS) events such as intron retention and alternative 5' or 3'
   splice sites. Indeed in higher eukaryotes, exon skipping is the most
   common AS event, accounting nearly 40% of all events [[87]29]. RAP
   adopts a statistical method for splicing events identification and
   quantification implemented by SpliceTrap [[88]21] (Figure [89]1, step
   9).

   In order to reliably detect and quantify every potential exon-skipping
   event, SpliceTrap builds an exon-trio database (TXdb) capturing all
   known transcripts obtained from RefSeq annotations [[90]27] and
   breaking each transcript into all possible exon trios. Each exon trio
   thus leads to two sequences: an inclusion isoform with all three exons
   and a skipping isoform with the two flanking exons only.

   Reads are then aligned to the TXdb database using Bowtie and poorly
   covered exon trios are filtered out applying a dynamic
   exon-size-dependent cut-off strategy. Exon inclusion ratios are then
   estimated adopting a Bayesian model, given the probability of observing
   each fragment on a specific isoform (as function of both inclusion and
   exclusion isoform expression level and normalized by the isoforms
   lengths) [[91]21].

   The exon inclusion ratio is defined as the expression level of the
   inclusion isoform divided by the total expression level of both
   isoforms (inclusion and exclusion) derived from an exon trio.

   SpliceTrap is also able to identify other splicing events, such as
   intron retention (IR), alternative donor (5' splice site) (AD) and
   alternative acceptor (3' splice site) (AA).

Chimeric transcripts annotation

   A further RAP path carries out a specific data analysis to detect
   chimeric transcripts (Figure [92]1, step 10), i.e. RNAs encoded by a
   fusion gene or by two different genes through a trans-splicing event.
   RAP integrates ChimeraScan [[93]22], a tool based on Bowtie alignments
   to detected putative fusion breakpoints. ChimeraScan aligns paired-end
   reads to a combined genome-transcriptome reference to discard
   uninformative data and to estimate the insert size distribution of the
   library. Unmapped reads are then trimmed into smaller segments and
   realigned to the genome, to build a set of putative chimeric junction
   sequences. Potential fusion breakpoints arise from fragments that align
   to distinct references or distance genomic locations (according to the