Abstract

   The recently introduced Kallisto pseudoaligner has radically simplified
   the quantification of transcripts in RNA-sequencing experiments.  We
   offer cloud-scale RNAseq pipelines Arkas-Quantification, and
   Arkas-Analysis available within Illumina’s BaseSpace cloud application
   platform which expedites Kallisto preparatory routines, reliably
   calculates differential expression, and performs gene-set enrichment of
   REACTOME pathways .  Due to inherit inefficiencies of scale, Illumina's
   BaseSpace computing platform offers a massively parallel distributive
   environment improving data management services and data importing.
   Arkas-Quantification deploys Kallisto for parallel cloud computations
   and is conveniently integrated downstream from the BaseSpace
   [28]Sequence Read Archive (SRA) import/conversion application titled
   [29]SRA Import.  Arkas-Analysis annotates the Kallisto results by
   extracting structured information directly from source FASTA files with
   per-contig metadata, calculates the differential expression and
   gene-set enrichment analysis on both coding genes and transcripts. The
   Arkas cloud pipeline supports ENSEMBL transcriptomes and can be used
   downstream from the SRA Import facilitating raw sequencing importing,
   SRA FASTQ conversion, RNA quantification and analysis steps.

   Keywords: transcriptome, sequencing, RNAseq, automation, cloud
   computing

Introduction

   High-performance computing based bioinformatic workflows have three
   main subfamilies: in-house computational packages, virtual-machines
   (VMs), and cloud based computational environments. The in-house
   approaches are substantially less expensive when raw hardware is in
   constant use and dedicated support is available, but internal
   dependencies can limit reproducibility of computational experiments.
   Specifically, “superuser’” access needed to deploy container-based,
   succinct code encapsulations (often referred to as "microservices"
   elsewhere) can run afoul of normal permissions, and the maintenance of
   broadly usable sets of libraries across nodes for users can lead to
   shared code dynamically linking to different libraries under various
   user environments. By contrast, modern cloud-based approaches and
   parallel computing are forced by necessity to offer a user-friendly
   platform with high availability to the broadest audience.
   Platform-as-a-service approaches take this one step further, offering
   controlled deployment and fault tolerance across potentially unreliable
   instances provided by third parties such as [30]Amazon Web Service
   Elastic Compute Cloud (AWS EC2) and enforcing a standard for
   encapsulation of developers' services such as [31]Docker. Within this
   framework, the user or developer cedes some control of the platform and
   interface, in exchange for the platform provider handling the details
   of workflow distribution and execution. This has provided the best
   compromise of usability and reproducibility when dealing with general
   audiences. In this regard, the lightweight-container approach
   exemplified by Docker lead to rapid development and deployment compared
   to VMs. Combined with versioning of deployments, it is feasible for
   users to reconstruct results from an earlier point in time, while
   simultaneously re-evaluating the generated data under state-of-the-art
   implementations.

   Docker offers advantages for reproducible research practices, and also
   is the principal infrastructure to leading platforms such as
   [32]Illumina's BaseSpace platform, [33]Google Genomics, [34]Galaxy and
   [35]SevenBridges. Cloud computational ecosystems preserve developmental
   environments using Docker containerization framework, and improves
   bioinformatic validation. Containerized cloud applications form part of
   the global distributive effort and are favorable over local in-house
   computational pipelines because they offer rapid access to numerous
   public workflows, easy interfacing to archived read databases, and
   accelerate the upholding process of raw data.

   A major bottleneck in RNAseq analysis is the processing steps for
   importing raw data. The majority of RNAseq analysis pipelines consist
   of read preparation steps, followed by computationally expensive
   alignment against a reference. Software for calculating transcript
   abundance and assembly can surpass 30 hours of computational time
   ^[36]1. If known or putative transcripts of defined sequences are the
   primary interest, then pseudoalignment, which is defined as
   near-optimal RNAseq transcript quantification, is achievable in minutes
   on a standard laptop using Kallisto software ^[37]1. Arkas was
   developed using a simple framework, yet massively parallel, for RNAseq
   transcript quantification that would allow users to expedite
   pseudoalignment on arbitrary datasets, and significantly reduce the
   amount of required preparatory routines.

   In collaboration with Illumina (San Diego, USA) the available
   [38]BaseSpace platform was already well-suited for parallel
   transformation of raw sequencing data into analytical results.
   BaseSpace has an available application SRA Import which automates SRA
   importing and FASTQ conversion pre-processing steps. The application
   SRA Import is simple requiring the SRA accession number and limits
   imports to 25gb per application call. Arkas can ingest successfully
   imported samples avoiding all raw data handling. For example, if one
   were interested in re-analyzing an experiment from SRA with reads
   totaling 141.3GB, Arkas facilitates SRA processing and state-of-the-art
   pseudoalignment by reducing raw sequencing data to summary
   quantifications totaling 1.63GB and includes an extensive analysis
   report of less than 10MB. The total data reduction exceeds 4 orders of
   magnitude with little or no loss of user-visible information. Moreover,
   the untouched original data is never discarded unless the user
   explicitly demands it. The appropriate placement of Arkas applications
   adjacent to the origin of sequencing data removes cumbersome data
   relocation costs and greatly facilitates sequencing archive re-analysis
   using state-of-the-art pseudoalignment.

   Arkas, encapsulates Kallisto, automates the construction of composite
   transcriptomes from, quantifies transcript abundances, and implements
   reproducible rapid differential expression analysis coupled with gene
   set enrichment analysis. The Arkas workflow is versionized into Docker
   containers and publicly deployed within Illumina's BaseSpace platform
   which ingests raw RNA sequencing data and completes a full analysis in
   approximately 2 hours.

Methods

   The first step in the Arkas pipeline requires Arkas-Quantification to
   transform all the raw RNA sequencing data of an entire experiment into
   Kallisto pseudoaligned quantification output data. The second step,
   Arkas-Analysis, requires the pseudoaligned data to be input with
   respect to a comparison and control group, and returns a comprehensive
   analysis including differential expression, and gene-set enrichment.

   If the user selects the defaults, Arkas-Quantification will complete
   pseudoalignment in approximately 43–60 minutes. Arkas-Quantification
   completion time is independent of the number of samples input, but is
   restricted to node availability (AWS EC2 node availability is fairly
   high). Arkas-Analysis, will complete using a single node in
   approximately 1–1.5 hours for moderate sample group sizes (N ≤ 20), and
   under 2–2.5 hours for much larger designs.

Arkas-Quantification implementation

   Arkas is a two-step cloud pipeline. Arkas-Quantification is the first
   step, which reduces the computational steps required to quantify and
   annotate large numbers of samples against large catalogs of
   transcriptomes. Arkas-Quantification calls Kallisto for on-the-fly
   transcriptome indexing and quantification recursively for numerous
   sample directories. Kallisto quantifies transcript abundance from input
   RNAseq reads by using pseudoalignment, which identifies the
   read-transcript compatibility matrix ^[39]1 . The compatibility matrix
   is formed by counting the number of reads with the matching alignment;
   the equivalence class matrix has a much smaller dimension compared to
   matrices formed by transcripts and read coverage. Computational speed
   is gained by performing the Expectation Maximization (EM) algorithm
   over a smaller matrix.

   For RNAseq projects with many sequenced samples, Arkas-Quantification
   encapsulates expensive transcript quantification preparatory routines,
   while uniformly preparing Kallisto execution commands within a
   versionized environment encouraging reproducible protocols. The
   quantification step automates the index caching, annotation, and
   quantification associated while running the Kallisto pseudoaligner
   integrated within the BaseSpace environment. For users interested in
   quality control checks, BaseSpace offers an independent application
   FastQC which performs [40]fastqc on sequencing data. The first step in
   the pipeline can process raw reads into transcript and pathway
   collection results within Illumina’s BaseSpace cloud platform,
   quantifying against default transcriptomes such as ERCC spike-ins,
   ENSEMBL non-coding RNA, or cDNA build 88 for both Homo sapiens and Mus
   musculus; further the first step supports user uploaded FASTA files for
   customized analyses. Arkas-Quantification can support microRNAs
   (miRNA), however we encourage users to analyze miRNAs separately
   because pseudoalignment requires reducing k-mer size in the
   Target-DeBruijn Graph (TDBG) to miRNA sequence lengths (ranging from
   16–22) which can increase path ambiguities. Arkas-Quantification is
   packaged into a Docker container and is publicly available as a cloud
   application within BaseSpace.

Arkas-Analysis implementation

   Previous work ^[41]2 has revealed that filtering transcriptomes to
   exclude lowly-expressed isoforms can improve statistical power, while
   more-complete transcriptome assemblies improve sensitivity in detecting
   differential transcript usage. Based on earlier work by Bourgon et al.
   ^[42]3, we included this type of filtering for both gene- and
   transcript-level analyses within Arkas-Analysis. The analysis pipeline
   automates annotations of quantification results, resulting in more
   accurate interpretation of coding and transcript sequences in both
   basic and clinical studies by just-in-time annotation and
   visualization.

   Arkas-Analysis integrates quality control analysis for experiments that
   include Ambion spike-in controls, multiple normalization selections for
   both coding gene and transcript differential expression analysis, and
   differential gene-set analysis. If ERCC spike-ins, defined by the
   External RNA Control Consortium ^[43]4, are detected then
   Arkas-Analysis will calculate Receiver Operator Characteristic (ROC)
   plots using 'erccdashboard' ^[44]5. The ERCC analysis reports average
   ERCC Spike amount volume, comparison plots of ERCC volume amount, and
   normalized ERCC counts ( [45]Figure 1).

Figure 1. Arkas-Analysis ERCC spike-in Controls Report.

   [46]Figure 1.
   [47]Open in a new tab

   A) The Receiver Operator Characteristic plot of (detected) ERCC ratios
   in gene expression experiments. The X-axis shows the False Positive
   Rate, the Y-axis shows True Positive Rate. B) and C) shows the dynamic
   range of abundances of ERCC RNA amounts with a linear model fit, and
   ERCC RNA counts. D) shows a dispersion of mean transcript abundances
   and the estimated dispersion.

   Subsequent analyses import the data structure from SummarizedExperiment
   and creates a sub-class titled KallistoExperiment that preserves the S4
   structure and is convenient for handling assays, phenotypic and genomic
   data. KallistoExperiment includes GenomicRanges ^[48]6, preserving the
   ability to handle genomic annotations and alignments, supporting
   efficient methods for analyzing high-throughput sequencing data. The
   KallistoExperiment sub-class serves as a general-purpose container for
   storing feature genomic intervals and pseudoalignment quantification
   results against a reference genome called by Kallisto. By default
   KallistoExperiment couples assay data such as the estimated counts,
   effective length, estimated median absolute deviation, and transcript
   per million count where each assay data is generated by a Kallisto run;
   the stored feature data is a GenomicRanges object from [49]6, storing
   transcript length, GC content, and genomic intervals.

   Given a KallistoExperiment containing the Kallisto sample abundances,
   principal component analysis (PCA) is performed ^[50]7 on trimmed mean
   of M-value (TMM) normalized counts ^[51]8 ( [52]Figure 2A).
   Differential expression (DE) is calculated on the library normalized
   transcript expression values, and the aggregated transcript bundles of
   corresponding coding genes using limma/voom linear model ^[53]9 (
   [54]Figure 3A). In addition to library normalization, we wished to add
   an optional data driven normalization. In the analysis pipeline, an
   unsupervised normalization method would not require more than a two
   group experimental design which was favorable due to its simplicity.
   Alternatively, supervised data driven normalization is a specialized
   task which requires users to define batch groups, and/or additional
   experimental groups. Further, the adjusted data must be evaluated in
   the context of the experiment. In-silico normalization, using factor
   analysis, effectively removes unwanted variation driven entirely by
   data ^[55]10.

Figure 2. Arkas-Analysis Normalization Report: Normalization Analysis Using
TMM and RUV.

   [56]Figure 2.
   [57]Open in a new tab

   A) TMM normalization is performed on sample data and depicts the sample
   quantiles on normalized sample expression, PCA plot, and histogram of
   the adjusted p-values calculated from the DE analysis. Orange is the
   comparison group and green is the control group. B) A similar analysis
   is performed with RUV in-silico normalization.

Figure 3. Arkas-Analysis Differential Expression Report: DE using TMM and
RUV.

   [58]Figure 3.
   [59]Open in a new tab

   A) DE analysis using TMM normalization. The X-axis is the sample names
   (test data), the Y-axis are Gene symbols (HUGO). Expression values are
   plotted in log [10] 1+TPM. B) Similar analysis using RUV normalization.
   C) The design matrix with the RUV adjusted weights. The sample names
   are test data used in demonstrating the general analysis report output.

   The analysis report returns comprehensive visualization results. PCA
   and DE analysis of both transcripts and coding genes is performed with
   easily interpretable images ( [60]Figure 2B, [61]Figure 3B, [62]Figure
   3C). In each DE analysis FDR filtering method is defaulted to
   'Benjamini-Hochberg', if there are no resultant DE genes/transcripts
   the FDR methods is switched to 'none'. Arkas-Analysis consumes the
   Kallisto data output from Arkas-Quantification, and automates DE
   analysis using TMM normalization and in-silico normalization on both
   transcript and coding gene expression in a defaulted two group
   experimental design, allowing customized selections. One must examine
   and compare the PCA sample clustering, and sample boxplots between the
   two methods to determine the improvement of in-silico normalization (
   [63]Figure 2). If RUV improves the PCA clustering within the context of
   an experiment, and reduces the number of outliers observed in boxplots
   then it is likely that the normalization weights are useful.

   Gene set differential expression, which includes gene-gene correlation
   inflation corrections, is calculated using Qusage ^[64]11. Qusage
   calculates the variance inflation factor, which corrects the inter-gene
   correlation that results in high type 1 errors using pooled or
   non-pooled variances between experimental groups. The gene set
   enrichment is conducted using [65]Reactome pathways constructed using
   ENSEMBL transcript/gene identifiers ( [66]Figure 4 and [67]Table 1);
   REACTOME gene sets are not as large as other databases, so
   Arkas-Analysis outputs DE analysis in formats compatible with more
   exhaustive databases such as [68]Advaita. The DE files are compatible
   as a custom upload into Advaita iPathway guide, which offers an
   extensive Gene Ontology (GO) pathway analysis. Pathway enrichment
   analysis can be performed from the BaseSpace cloud system downstream
   from parallel differential expression analysis and can integrate with
   other pathway analysis software tools.

Figure 4. Arkas-Analysis Gene-Set Enrichment Plot.

   [69]Figure 4.
   [70]Open in a new tab

   Gene-Set enrichment output report, each point represents the
   differential mean activity of each gene-set with 95% confidence
   intervals. The X-axis are individual gene-sets. The Y-axis is the log
   [2] fold change.

Table 1. Arkas-Analysis Gene-Set Enrichment Statistics.

   The columns represent the Reactome pathway name corresponding to the
   depicted pathways in [71]Figure 4, the log [2]fold change, p-value,
   adjusted FDR, and an active link to the Reactome website with visual
   depictions of the gene/transcript pathway. Arkas-Analysis will output a
   similar report testing transcript-level sets.
   Pathway name Log fold
   change P.value FDR Gene URL
   R-HAS-1989781 -0.87 0.0008 0.06
   [72]http://www.reactome.org/PathwayBrowser/#/R-HSA-1989781
   R-HAS-2173796 -0.51 0.007 0.217
   [73]http://www.reactome.org/PathwayBrowser/#/R-HSA-2173796
   R-HAS-6804759 -1.62 0.009 0.217
   [74]http://www.reactome.org/PathwayBrowser/#/R-HSA-6804759
   R-HAS-381038 -0.43 0.013 0.226
   [75]http://www.reactome.org/PathwayBrowser/#/R-HSA-381038
   R-HAS-2559585 -0.4 0.032 0.341
   [76]http://www.reactome.org/PathwayBrowser/#/R-HSA-2559585
   R-HAS-4086398 -0.95 0.033 0.341
   [77]http://www.reactome.org/PathwayBrowser/#/R-HSA-4086398
   R-HAS-4641265 -0.95 0.033 0.341
   [78]http://www.reactome.org/PathwayBrowser/#/R-HSA-4641265
   R-HAS-422085 -1.17 0.04 0.361
   [79]http://www.reactome.org/PathwayBrowser/#/R-HSA-422085
   R-HAS-5467345 -0.56 0.069 0.389
   [80]http://www.reactome.org/PathwayBrowser/#/R-HSA-5467345
   R-HAS-6804754 -0.57 0.07 0.389
   [81]http://www.reactome.org/PathwayBrowser/#/R-HSA-6804754
   R-HAS-6803204 -1.19 0.081 0.389
   [82]http://www.reactome.org/PathwayBrowser/#/R-HSA-6803204
   [83]Open in a new tab

Data variance between software versions

   We wished to show the importance of enforcing matching versions of
   Kallisto when quantifying transcripts because there is deviation of
   data between versions. Due to updated versions and improvements of
   Kallisto software, there obviously exists variation of data between
   algorithm versions ( [84]Figure 5, [85]Supplementary Table 1,
   [86]Supplementary Table 2). We calculated the standardized mean
   differences, and the variation of the differences between the same 5
   samples from Kallisto (setting bootstraps = 42) versions 0.43 and
   0.43.1 ( [87]Supplementary Table 2), and found large variation of
   differences between raw values generated by differing Kallisto
   versions, signifying the importance of version analysis of Kallisto
   results.

Figure 5. Quantile-Quantile Plots of Data Variation Comparing Differences in
Kallisto Data from Versions 0.43.1 and 0.43.0.

   [88]Figure 5.
   [89]Open in a new tab

   The X-axis depicts the theoretical quantiles of the standardized mean
   differences. The Y-axis represents the observed quantiles of
   standardized mean differences.

   [90]Supplementary Table 1 shows the variation of the errors of the raw
   values such as estimated counts, effective length, and estimated median
   absolute deviation using the same Kallisto version 0.43.0. As expected,
   Kallisto data generated by the same Kallisto version had very low
   variation of errors within the same version 0.43.0 for every transcript
   across all samples. However, upon comparing Kallisto version 0.43.1 to
   version 43.0 using the raw data such as estimate abundance counts,
   effective length, estimated median absolute deviation, and transcript
   per million values, we found, as expected, large variation of data.

   [91]Supplementary Table 2 shows that there is large variation of the
   differences of Kallisto data calculated between differing versions.
   [92]Figure 5 depicts the standardized mean differences, i.e. errors,
   between Kallisto versions fitted to a theoretical normal distribution.
   The quantile-quantile plots show that the errors are marginally normal,
   with a consistent line centered near 0 but also large outliers (
   [93]Figure 5). As expected, containerizing analysis pipelines will
   enforce versionized software, which benefits reproducible analyses.

   The Dockerization of Arkas BaseSpace applications versionizes the
   Kallisto reference index to enforce that the Kallisto software versions
   are identical, and further documents the Kallisto version used in every
   cloud analysis. The enforcement of reference versions and Kallisto
   software versions prevents errors when comparing experiments.

Operation

   Arkas-Quantification instructions are provided within BaseSpace
   (details for new users can be found [94]here). Arkas is a web style
   format, but can also be launched using the command line using
   [95]BaseSpace Command Line Interface. The inputs are RNA sequencing
   samples, which may include SRA imported reads, and the outputs include
   the Kallisto data, .tar.gz files of the Kallisto sample data, and a
   report summary ( [96]Supplementary Figure 1 and [97]Supplementary
   Figure 2). Users may select for species type ( Homo sapiens or Mus
   musculus), optionally correct for read length bias, and optionally
   select for the generation of pseudoBAMs. More significantly, users have
   the option to use the default transcriptome (ENSEMBL build 88) or to
   upload a custom FASTA of their choosing. For users that wish for local
   analysis, they can download the sample .tar.gz Kallisto files and
   analyze the data locally.

   The Arkas-Analysis instructions are provided within the BaseSpace
   environment. The input for the analysis app is the Arkas-Quantification
   sample data, and the output files are separated into corresponding
   folders. The analysis also depicts figures for each respective analysis
   ( [98]Figure 1– [99]Figure 4) and the images can be downloaded as a
   HTML format.

Annotation of coding genes and transcripts

   The extraction of genomic and functional annotations directly from
   FASTA contig comments, eliding sometimes-unreliable dependencies on
   services such as [100]BioMart, are calculated rapidly. The annotations
   were performed with a run time of 2.336 seconds ( [101]Supplementary
   Table 3) which merged the previous Kallisto data from 5 samples,
   creating a KallistoExperiment class with feature data containing a
   GenomicRanges ^[102]11 object with 213782 ranges and 9 metadata
   columns. The system runtime for creating a merged KallistoExperiment
   class for 5 samples was 23.551 seconds ( [103]Supplementary Table 4).

Discussion

Complete transcriptomes enrich annotation information, improving downstream
analyses

   The choice of catalog, and the type of quantification performed,
   influence the results of sequencing analysis. ENSEMBL reference genomes
   are provided to GENCODE as a merged database from Havana's manually
   curated annotations with ENSEMBL's automatic curated coordinates.
   AceView, UCSC, RefSeq, and GENCODE have approximately twenty thousand
   protein coding genes, however AceView and GENCODE have a greater number
   of protein coding transcripts in their databases. RefSeq and UCSC
   references have less than 60,000 protein coding transcripts, whereas