Abstract

   RNA-Sequencing data offers an opportunity to enable precision medicine,
   but most methods rely on gene expression alone. To date, no methodology
   exists to identify and interpret alternative splicing patterns within
   pathways for an individual patient. This study develops methodology and
   conducts computational experiments to test the hypothesis that pathway
   aggregation of subject-specific alternatively spliced genes (ASGs) can
   inform upon disease mechanisms and predict survival. We propose the
   N-of-1-pathways Alternatively Spliced (N1PAS) method that takes an
   individual patient’s paired-sample RNA-Seq isoform expression data
   (e.g., tumor vs. non-tumor, before-treatment vs. during-therapy) and
   pathway annotations as inputs. N1PAS quantifies the degree of
   alternative splicing via Hellinger distances followed by two-stage
   clustering to determine pathway enrichment. We provide a clinically
   relevant “odds ratio” along with statistical significance to quantify
   pathway enrichment. We validate our method in clinical samples and find
   that our method selects relevant pathways (p < 0.05 in 4/6 data sets).
   Extensive Monte Carlo studies show N1PAS powerfully detects pathway
   enrichment of ASGs while adequately controlling false discovery rates.
   Importantly, our studies also unveil highly heterogeneous
   single-subject alternative splicing patterns that cohort-based
   approaches overlook. Finally, we apply our patient-specific results to
   predict cancer survival (FDR < 20%) while providing diagnostics in
   pursuit of translating transcriptome data into clinically actionable
   information. Software available at
   [29]https://github.com/grizant/n1pas/tree/master.

   Keywords: RNA-Seq, precision medicine, isoform, alternative splicing,
   systems biology, pathways, local false discovery rate, Hellinger
   distance

Introduction

   RNA-Sequencing (RNA-Seq) data offers an opportunity to enable precision
   medicine, but most methods rely on gene expression alone ([30]Sørlie et
   al., 2001; [31]Weigelt et al., 2005; [32]Peppercorn et al., 2008;
   [33]Bastien et al., 2012; [34]Prat et al., 2014). RNA-Seq, however,
   provides even greater resolution, including messenger RNA (mRNA)
   diversity for the same protein-coding genomic region – corresponding to
   distinct protein isoforms, created by alternative splicing of exons.
   Most RNA-Seq analytics ignore alternative splicing patterns despite
   recent evidence that alternative splicing is implicated in nearly a
   third of common diseases. In cancer, a tumor often displays
   dysregulation of the cellular machinery that controls alternative
   splicing ([35]Yoshida et al., 2011; [36]Kaida et al., 2012;
   [37]Ladomery, 2013; [38]Forootan et al., 2016). Yet, the clinical
   interpretation of alternative splicing patterns lies largely
   unexplored.

   This study develops methodology and conducts computational experiments
   to test the hypothesis that pathway aggregation of subject-specific
   alternatively spliced genes (ASGs) can inform upon disease mechanisms
   and predict survival, thereby providing clinical interpretation of
   alternative splicing patterns. By “alternative splicing” patterns, we
   mean that the distribution of isoforms of a certain gene differs
   between two samples. Specifically, in the context of cancer, our
   hypothesis is driven by the high likelihood that comparing non-cancer
   (“normal”) tissue to cancer tissue will unveil cell-type specific
   expression in cell-type specific pathways. This, in turn, will affect
   the proportion of the ASGs in the overall comparison between tissues
   (where the cell-type elements have changed) and will distribute in
   pathways. Two facts taken together form this opinion: (1)
   cellular-specific splicing occurs and (2) if a differentially expressed
   gene (DEG) occurs between paired samples, it is in part due to the
   change in activated pathways within the concordant cell types that have
   become cancerous, and in part due to the change of cell-type
   proportions in the cancer tissue vs. normal tissue (e.g., the stroma
   may contain more immune cells that were previously absent).

   We and others have recently developed methodological frameworks to
   clinically interpret individualized signals from molecular data
   ([39]Chen et al., 2012; [40]Yang et al., 2012; [41]Ahn et al., 2014).
   In particular, we introduced a statistical framework, N-of-1-pathways,
   to provide subject-specific interpretations of the transcriptome
   ([42]Gardeux et al., 2014; [43]Li et al., 2017a,[44]b, [45]Schissler et
   al., 2015, [46]2018). The methodology focuses on quantifying an
   individual’s dynamic transcriptional response within cellular pathways,
   along with providing uncertainty quantification for these metrics. To
   this end, paired samples (e.g., normal/tumor, before, and after
   treatment) are obtained from a patient, and gene set analysis
   ([47]Subramanian et al., 2005; [48]Goeman and Bühlmann, 2007;
   [49]Khatri et al., 2012) is conducted for the individual without the
   requirement of large cohorts.

   In this study, we propose a novel methodology to improve the clinical
   interpretation of subject-specific alternative splicing patterns
   derived from paired RNA-Seq samples. The N-of-1-pathways Alternatively
   Spliced (N1PAS) method transforms a patient’s paired-sample RNA-Seq
   isoform expression data (e.g., tumor vs. non-tumor, before-treatment
   vs. during-therapy) into a pathway enrichment profile of ASGs. N1PAS
   quantifies the degree of alternative splicing using gene-wise Hellinger
   distances followed by two-stage clustering to determine pathway
   enrichment using a robust, existing procedure testing procedure – local
   false discovery rate (locFDR). The single-subject output provides an
   interpretable odds ratios describing the overrepresentation of ASGs
   along with uncertainty quantification through locFDR.

   The article continues with some brief mathematical background and
   description of the proposed method. Then, several computational
   experiments explore and validate our proposed methods in clinical
   samples. In this proof of concept study, we demonstrate the potential
   for alternative splicing interpretation as one of the Omics signals
   which should be considered for predicting cancer survival. We also
   compare the proposed method with alternative approaches. Finally, we
   conduct extensive simulation studies to explore empirical operating
   characteristics of N1PAS. A discussion concludes the article.

Mathematical Background

   This section motivates the use of and describes two mathematical
   concepts employed in the proposed method.

Hellinger Distance

   Our method quantifies alternating splicing between a pair of samples
   using the Hellinger distance. Such an approach has been shown to be
   useful in the quantification of alternative splicing in the context of
   clustering ([50]Johnson and Purdom, 2017). Let the estimates of isoform
   (mRNA) expression for sample A be denoted as x[gA1],…, x[gAK[g]], for
   the K[g] distinct isoforms annotated to gene g. We define the relative
   isoform usage as the vector of relative proportions of each isoform,
   [MATH:
   <mrow><mi>P</mi><mi>g</mi><mi>A</mi><mo>=</mo><mrow><mo>(</mo><mrow><mf
   rac><mrow><msub><mi>x</mi><mrow><mi>g</mi><mi>A</mi><mn>1</mn></mrow></
   msub></mrow><mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mi>K</mi><mi>g</mi></mrow></msubsup><mrow><msub><mi>x
   </mi><mrow><mi>g</mi><mi>A</mi><mi>k</mi></mrow></msub></mrow></mstyle>
   </mrow></mfrac><mo>,</mo><mn>...</mn><mo>,</mo><mfrac><mrow><msub><mi>x
   </mi><mrow><mi>g</mi><mi>A</mi><mi>K</mi></mrow></msub><msub><mrow></mr
   ow><mrow><msub><mrow></mrow><mi>g</mi></msub></mrow></msub></mrow><mrow
   ><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mi>K</mi><mi>g</mi></mrow></msubsup><mrow><msub><mi>x
   </mi><mrow><mi>g</mi><mi>A</mi><mi>k</mi></mrow></msub></mrow></mstyle>
   </mrow></mfrac></mrow><mo>)</mo></mrow></mrow> :MATH]
   . Where
   [MATH: <mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><mi>K</mi><mi>g</mi></mrow></msubsup><mrow><msub><mi>x
   </mi><mrow><mi>g</mi><mi>A</mi><mi>k</mi></mrow></msub></mrow></mstyle>
   </mrow> :MATH]
   is the total gene expression (summed over all isoforms) for gene g. The
   Hellinger distance between two proportions derived from samples A and B
   within gene g is given by:
   [MATH: <mrow><msub><mi>H</mi><mi>g</mi></msub><mo
   stretchy="false">(</mo><msub><mi>p</mi><mrow><mi>g</mi><mi>A</mi></mrow
   ></msub><mo>,</mo><msub><mi>p</mi><mrow><mi>g</mi><mi>B</mi></mrow></ms
   ub><mo
   stretchy="false">)</mo><mo>−</mo><mfrac><mn>1</mn><mrow><msqrt><mn>2</m
   n></msqrt></mrow></mfrac><mstyle
   displaystyle="true"><munderover><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn
   >1</mn></mrow><mrow><msub><mi>K</mi><mi>g</mi></msub></mrow></munderove
   r><mrow><msup><mrow><mrow><mo>(</mo><mrow><msqrt><mrow><mfrac><mrow><ms
   ub><mi>x</mi><mrow><mi>g</mi><mi>A</mi><mi>k</mi></mrow></msub></mrow><
   mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><msub><mi>K</mi><mi>g</mi></msub></mrow></msubsup><mro
   w><msub><mi>x</mi><mrow><mi>g</mi><mi>A</mi><mi>k</mi></mrow></msub></m
   row></mstyle></mrow></mfrac></mrow></msqrt><mo>−</mo><msqrt><mrow><mfra
   c><mrow><msub><mi>x</mi><mrow><mi>g</mi><mi>B</mi><mi>k</mi></mrow></ms
   ub></mrow><mrow><mstyle
   displaystyle="true"><msubsup><mo>∑</mo><mrow><mi>k</mi><mo>=</mo><mn>1<
   /mn></mrow><mrow><msub><mi>K</mi><mi>g</mi></msub></mrow></msubsup><mro
   w><msub><mi>x</mi><mrow><mi>g</mi><mi>B</mi><mi>k</mi></mrow></msub></m
   row></mstyle></mrow></mfrac></mrow></msqrt></mrow><mo>)</mo></mrow></mr
   ow><mn>2</mn></msup></mrow></mstyle></mrow> :MATH]
   (1)

   Simply stated, the Hellinger distance quantifies dissimilarity between
   the two distributions of proportions. The result is a real number H[g]
   that resides in the unit interval, with 0 indicating perfect agreement
   in isoform usage and values tending to 1 indicating an increasing
   difference in relative isoform distribution of the two samples.
   Notably, a DEG can also been alternatively spliced by still displaying
   a large Hellinger distance. Also, the Hellinger distance is symmetric
   (Equation 1) by definition. To establish a convention, if a gene is not
   expressed in both samples, we choose to record a missing value for the
   distance.

Local False Discovery Rates Through Mixture Modeling

   Efron’s local false discovery rates provides a flexible and robust tool
   for multiple hypothesis testing under correlated test statistics or
   effect sizes ([51]Efron, 2004, [52]2007, [53]2013). RNA-seq data
   quantifying gene and isoform expression are correlated, both due to the
   nature of the counting process and biological considerations. Most
   statistics (including p-values) derived from these measurements will
   also be correlated. So, we need a model that either specifically
   accounts for this co-expression or does not assume independence. Efron
   discusses the statistical issues in detail in [54]Efron (2007),
   including the close relationship of locFDR to other false discovery
   rates, such as Benjamini–Hochberg ([55]Benjamini and Hochberg, 1995).

   Local FDR results from modeling test statistics as arising from a
   two-component mixture density. Formally, let z[i] be an observed test
   statistic from i = 1,…, N testing procedures. N must be large to ensure
   quality locFDR estimates, say at least in the hundreds. But the z[i]
   need not be independent. We assume that the N-values can be sorted into
   two classes (“null” and “non-null”), occurring with prior probabilities
   of p[0] or p[1] = 1 - p[0]:
   [MATH:
   <mtable><mtr><mtd><msub><mi>p</mi><mn>0</mn></msub><mo>=</mo><mi>Pr</mi
   ><mo>{</mo><mi>n</mi><mi>u</mi><mi>l</mi><mi>l</mi><mo>}</mo><mo>.</mo>
   <msub><mi>f</mi><mn>0</mn></msub><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mtext> </mtext><mi>d</mi><mi>e</mi><mi>n</mi><m
   i>s</mi><mi>i</mi><mi>t</mi><mi>y</mi><mtext> </mtext><mi>i</mi><mi>f</
   mi><mtext> </mtext><mi>n</mi><mi>u</mi><mi>l</mi><mi>l</mi></mtd></mtr>
   <mtr><mtd><msub><mi>p</mi><mn>1</mn></msub><mo>=</mo><mi>Pr</mi><mo>{</
   mo><mi>n</mi><mi>o</mi><mi>n</mi><mi>n</mi><mi>u</mi><mi>l</mi><mi>l</m
   i><mo>}</mo><mo>,</mo><msub><mi>f</mi><mn>1</mn></msub><mtext> </mtext>
   <mo stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mtext> </mtext><mi>d</mi><mi>e</mi><mi>n</mi><m
   i>s</mi><mi>i</mi><mi>t</mi><mi>y</mi><mtext> </mtext><mi>o</mi><mi>t</
   mi><mi>h</mi><mi>e</mi><mi>r</mi><mi>w</mi><mi>i</mi><mi>s</mi><mi>e</m
   i></mtd></mtr></mtable> :MATH]

   Define the null subdensity as:
   [MATH: <mrow><msubsup><mi>f</mi><mn>0</mn><mo>+</mo></msubsup><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mo>=</mo><msub><mi>p</mi><mn>0</mn></msub><msub
   ><mi>f</mi><mn>0</mn></msub><mo stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow> :MATH]

   And the mixture density:
   [MATH: <mrow><mi>f</mi><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mo>=</mo><msub><mi>p</mi><mn>0</mn></msub><msub
   ><mi>f</mi><mn>0</mn></msub><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mo>+</mo><msub><mi>p</mi><mn>1</mn></msub><msub
   ><mi>f</mi><mn>1</mn></msub><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo></mrow>
   :MATH]

   Then, define the local false discovery rate (locFDR; Equation 2) as the
   Bayes posterior probability that a case is null given z:
   [MATH: <mrow><mi>f</mi><mi>d</mi><mi>r</mi><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mo>=</mo><mi>Pr</mi><mo
   stretchy="false">[</mo><mi>n</mi><mi>u</mi><mi>l</mi><mi>l</mi><mo>|</m
   o><mi>z</mi><mo
   stretchy="false">]</mo><mo>=</mo><mfrac><mrow><msub><mi>p</mi><mn>0</mn
   ></msub><msub><mi>f</mi><mn>0</mn></msub><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow><mrow><mi>f</mi><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo></mrow></mfrac><mo>=</mo><msubsup><mi>f</mi><mn>
   0</mn><mo>+</mo></msubsup><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo
   stretchy="false">)</mo><mo>/</mo><mi>f</mi><mtext> </mtext><mo
   stretchy="false">(</mo><mi>z</mi><mo stretchy="false">)</mo></mrow>
   :MATH]
   (2)

   The definition provides a straightforward interpretation: It is the
   probability an observed value came from the null density. In practice,
   Efron indicates that a locFDR < 0.2 provides strong statistical
   evidence that the case is from the non-null distribution.

Methodology: N-of-1-Pathways Alternatively Spliced (N1PAS)

   Here, we describe our proposed method, N1PAS. The approach aims to
   transform a single subject’s paired transcriptome data into an
   interpretable, mechanism-based profile of alternatively splicing
   patterns ([56]Figure 1). We begin by computing a Hellinger distance for
   each gene (Equation 1) to quantify differential isoform usage between
   the paired samples ([57]Figure 1A). Once isoform data are transformed
   into gene-level distances, we then classify genes as either
   alternatively spliced vs. not using conventional 2-means (as in
   k-means) clustering ([58]Figure 1B). Next, we quantify an enrichment of
   ASGs within a gene set (pathway). That is, odds ratios (OR; Equation 3;
   [59]Figure 1C) compare the relative abundance of ASGs within the
   pathway vs. the genes not in the pathway:

FIGURE 1.

   [60]FIGURE 1
   [61]Open in a new tab

   Workflow of N1PAS. (A) Isoform-specific mRNA-Seq data are obtained from
   an individual. Gene-level distances between the two samples indicates
   the magnitude of alternative splicing. (B) Gene-level distances are
   aggregated across the whole transcriptome and unsupervised clustering
   classifies genes as alternatively spliced genes (ASGs; blue) and not
   (red). The vertical axis shows the count of genes with Hellinger
   distances binned together to form the histogram. Note that since this
   is a univariate setting, 2-means simply finds a threshold Hellinger
   distance to classify the genes into two groups. Gene set (pathway)
   enrichment analysis is conducted by first (C) computing the odds ratio
   (OR) to quantify the relative abundance of ASGs in the pathway vs.
   genes not in the pathway (background). ORs are calculated for each
   pathway in the data-base to produce an empirical, subject-specific
   distribution (D). A local false discovery procedure provides
   uncertainty quantification (FDR) and classified pathways as enriched
   using a simple threshold at FDR < 20%. (E) Results are tabulated to
   provide an individualized profile of alternative spliced enrichment.
   [MATH:
   <mrow><mi>O</mi><msub><mi>R</mi><mrow><mtext>pathway</mtext></mrow></ms
   ub><mo>=</mo><mfrac><mrow><mrow><mrow><mo>#</mo><mi>o</mi><mi>f</mi><mt
   ext> </mtext><mi>A</mi><mi>S</mi><mi>G</mi><mi
   mathvariant="normal">s</mi><mtext> </mtext><mi>i</mi><mi>n</mi><mtext>
   </mtext><mi>p</mi><mi>a</mi><mi>t</mi><mi>h</mi><mi>w</mi><mi>a</mi><mi
   >y</mi></mrow><mo>/</mo><mrow><msub><mrow></mrow><mrow><mo>#</mo><mi>o<
   /mi><mi>f</mi><mtext> </mtext><mi>n</mi><mi>o</mi><mi>n</mi><mo>−</mo><
   mi>A</mi><mi>S</mi><mi>G</mi><mi
   mathvariant="normal">s</mi><mtext> </mtext><mi>i</mi><mi>n</mi><mtext>
   </mtext><mi>p</mi><mi>a</mi><mi>t</mi><mi>h</mi><mi>w</mi><mi>a</mi><mi
   >y</mi></mrow></msub></mrow></mrow></mrow><mrow><mrow><mrow><mo>#</mo><
   mi>o</mi><mi>f</mi><mtext> </mtext><mi>A</mi><mi>S</mi><mi>G</mi><mi
   mathvariant="normal">s</mi><mtext> </mtext><mi>n</mi><mi>o</mi><mi>t</m
   i><mtext> </mtext><mi>i</mi><mi>n</mi><mtext> </mtext><mi>p</mi><mi>a</
   mi><mi>t</mi><mi>h</mi><mi>w</mi><mi>a</mi><mi>y</mi></mrow><mo>/</mo><
   mrow><msub><mrow></mrow><mrow><mo>#</mo><mi>o</mi><mi>f</mi><mtext> </m
   text><mi>n</mi><mi>o</mi><mi>n</mi><mo>−</mo><mi>A</mi><mi>S</mi><mi>G<
   /mi><mi
   mathvariant="normal">s</mi><mtext> </mtext><mi>n</mi><mi>o</mi><mi>t</m
   i><mi></mi><mi>i</mi><mi>n</mi><mtext> </mtext><mi>p</mi><mi>a</mi><mi>
   t</mi><mi>h</mi><mi>w</mi><mi>a</mi><mi>y</mi></mrow></msub></mrow></mr
   ow></mrow></mfrac></mrow> :MATH]
   (3)

   Then, we calculate locFDR values (Equation 2) by fitting the
   two-component mixture model to the distribution of pathway odd ratios
   ([62]Figure 1D). This whole process results in a single-subject,
   mechanistic profile of alternative splicing, along with effect size and
   statistical significance ([63]Figure 1E).

Methods for Computational Experiments

   This section describes computational experiments to explore, validate,
   and apply N1PAS. We conduct these studies using RNA-Seq data derived
   from clinical samples housed in The Cancer Genome Atlas (TCGA). Pathway
   annotations are based on the Kyoto Encyclopedia Genes and Genomes
   (KEGG; [64]Kanehisa and Goto, 2000). Survival data were also retrieved
   from the TCGA.

Data Set Acquisition, Preprocessing, Pathway Ontology

   Data sets were selected based on the availability of: (1) paired
   normal-tumor isoform-level quantification from each cancer patient, (2)
   a KEGG pathway annotated to the same cancer, and (3) survival data. Six
   TCGA data sets were identified ([65]Table 1) meeting that criteria. The
   Broad GDAC Firehose was employed to retrieve RNA-Seq data in the form
   of RSEM normalized isoform expression (downloaded 25/7/2017)^[66]1. The
   UCSC Table Browser ([67]Goldman et al., 2015) was used to map isoform
   identifiers to the corresponding HGNC host gene symbol. In total,
   73,599 isoform measurements were associated with 29,181 unique gene
   symbols. Since for many genes, it is non-trivial to estimate isoform
   levels correctly due to ambiguity in assigning reads, we used the RSEM
   adjusted expression values.

Table 1.

   Characteristics from six TCGA RNA-Seq data sets with paired
   normal-tumor data, survival data, and associated target KEGG pathway.
   TCGA Cancer Target KEGG
   pathway description Number of
   patients Number
   deceased Isoforms
   measured
   LUSC Lung squamous cell carcinoma Non-small cell lung cancer 51 32
   73,599
   LUAD Lung adenocarcinoma Non-small cell lung cancer 58 26 73,599
   PRAD Prostate adenocarcinoma Prostate cancer 52 0 73,599
   THCA Thyroid carcinoma Thyroid cancer 59 4 73,599
   UCEC Uterine corpus endometrioid carcinoma Endometrial cancer 7 2
   73,599
   BLCA Bladder carcinoma Bladder cancer 19 11 73,599
   [68]Open in a new tab

   The RNA-Seq data sets were filtered to include patients with paired
   normal-tumor data. Such patients were identified via the R library
   TCGA2STAT ([69]Wan et al., 2015). Further, clinical information
   including survival data for the subjects was queried using this
   library. The data were normalized using transcripts per million (TPM)
   to make library size adjusted comparisons between samples derived from
   the same patient.

   Genes were annotated to KEGG gene sets (pathways) using the
   Bioconductor database KEGG.db version 2.3.5, downloaded 16 Sep 2009. In
   total, 230 gene sets were downloaded. To improve efficiency of the
   algorithms developed, the 73,599 isoforms measured were filtered to
   only those that mapped to a gene annotated to a KEGG pathway (5,879
   unique genes), resulting in 18,823 isoform-level quantities for the
   5,757 genes measured in the TCGA data set. Following standard practice
   in alternative splicing analytics ([70]Johnson and Purdom, 2017) only
   genes with at least 2 and no more than 30 isoforms were retained,
   leaving 17,088 isoform measurements on 4,133 genes. Lastly, in order to
   maintain interpretability and stability, the pathways were filtered to
   have at least 15 and no more than 500 genes – resulting in 206 pathways
   considered.

KEGG Target Pathway Validation Study

   We aim to validate our methods by exploring N1PAS results within KEGG
   target pathways. Our strategy here was inspired by the work of [71]Diaz
   et al. (2017). Similar to [72]Diaz et al. (2017), we define a target
   pathway as a KEGG pathway whose description matches the disease
   associated with the RNA-Seq data set (see [73]Table 1). The validation
   study will use descriptive statistics and empirical (permutation-based)
   significance assessments to determine to what extent target pathways
   were identified as enriched. Differing from [74]Diaz et al. (2017), the
   analysis is conducted for each individual patient, and, thus patient
   heterogeneity within target pathways will also be studied. Moreover, we
   broaden the concept of target pathways to cancer pathways, as defined
   by any KEGG pathway with cancer contained in the description. In total,
   there are nine cancer pathways, five target pathways listed in
   [75]Table 1 and four additional KEGG pathways, described as pathways in
   cancer, small cell lung cancer, pancreatic cancer, and colorectal
   cancer. To form descriptive statistics, a pathway capture rate is
   defined as the proportion of patients that found the pathway
   significantly enriched at locFDR < 20% and odds ratio > 1. In the case
   of cancer pathways, a cancer pathway capture rate is the proportion of
   patients that found at least one of the cancer pathways enriched. Note
   that N1PAS was carried out using all 206 KEGG pathways and the
   detection rate of target and cancer pathways was explored to validate
   our methodology.

   The above-mentioned empirical significance assessment entails producing
   2000 null binary matrices of size N (patients) × P (pathways). The
   assessment begins with the original matrix of alternatively splicing
   calls – 1 indicating that a pathway had both an odds ratio greater than
   1 and locFDR was less than 0.20, and 0 otherwise. Then each patient’s
   values (rows) are shuffled. This procedure will preserve the number of
   0’s and 1’s within each patient while disrupting the correspondence
   across individuals and pathways. Thus, each column forms a synthetic
   null pathway with values not tied to any particular pathway annotation
   while preserving the patient distributional characteristics. Once
   shuffled, a null pathway is randomly selected and the number of 1’s
   (significant pathways) are counted to calculate the null capture rate.
   The procedure is repeated 2000 times to form an approximate null
   distribution for the capture rate for a single pathway. An empirical
   p-value is the proportion of null pathways with a higher capture rate
   than the observed target capture rate for a given data set. To assess
   the capture rate for at least one of nine selected pathways (mimicking
   the cancer-annotated capture rate), the capture rate of at least one of
   nine pathways is computed by selecting nine columns at random, without
   replacement, and counting a “success” as at least one “1” in the vector
   of 9 values. This procedure is repeated 2000 to produce a null
   distribution corresponding to the chance of capturing at least 1 out of
   9 cancer-annotated pathways. The empirical p-value for this assessment
   is proportion of times the null capture rate is larger than the
   observed cancer capture rate. The assessment is limited as it makes no
   adjustment for the fact that smaller pathways correspond to more
   variable odds ratios (and therefore different probabilities of
   alternative splicing calls).

   There are caveats to using local FDR for testing pathways in the KEGG
   database. The locFDR implementation in R (locfdr) contains default
   parameters that assume there are a large number of tests (at least
   1000). Since there are approximately 200 pathways under consideration,
   custom configurations needed to be developed. Based on practical
   experience with the pathway odds ratio distributions, the model fits
   reasonably well using the following heuristic procedures: (1) filter
   outlier odds ratios and set FDR[loc] to 0 for those pathways, (2) the
   number of breaks in the histogram is set to 25, (3) nulltype is set to
   “Central Matching” to provide less conservative results (to compensate
   for small sample issues), and (4) the mixture density estimate’s
   degrees of freedom is set to 4. In larger ontologies, program defaults
   should be adequate, but the model fit should always be inspected. It is
   important to note that the number of genes in each pathway does not
   play into any locFDR calculations. We chose to forgo any formal
   inferences at the gene level as it is rate prohibitive to manually
   inspect and adjust the fit at the first stage of clustering ([76]Figure
   1B). Instead, we employ a 2-means strategy to classify genes as ASGs.

Disease Subtyping Pipeline Using N1PAS Single-Subject Metrics

   We now describe a pipeline to produce disease survival subtypes from
   the output of our proposed method ([77]Figure 2). The three pipeline
   inputs include: (i) paired-sample isoform measurements, (ii) a database
   of pathway annotations, and (iii) survival data. First, odds ratios and
   locFDR values are calculated using N1PAS (see [78]Figure 2A) for the N
   patients in the data set ([79]Table 1). Next, the odds ratios across
   all P pathways in the database are aggregated into an N × P matrix
   ([80]Figure 2B). To reduce noise from non-informative pathway signal,
   the pathways are filtered to only pathways in which at least one
   patient is significantly enriched with ASGs (FDR[loc] < 20%). This
   produces a new odds-ratio matrix ([81]Figure 2C) with P′pathways (P
   minus the number of filtered pathways). Now, patients are clustered
   (unsupervised) into two groups (for simplicity and potential clinical
   utility) using only the odds ratios for a single pathway ([82]Figure
   2D). To this end, we used partitioning around medoids, a robust version
   of k-means with the number of groups a priori set to two. For each
   clustering (one per pathway), Kaplain–Meier estimates ([83]Klein and
   Moeschberger, 1997) of the survival curves are computed ([84]Figure
   2E). A log-rank p-value assesses whether these two curves are distinct.
   Moreover, we devise the construction of an empirical null distribution
   of log-rank p-values for a pathway ([85]Figure 2F). This procedure is
   similar to the empirical p-value approach above. We begin with
   shuffling patient’s odds ratios across the P′pathways. Thus, this forms
   a synthetic null pathway with values not tied to any particular pathway
   annotation while preserving the patient distributional characteristics.
   The patients were then clustered using these null values, and survival
   log-rank p-values were computed. This process was repeated 2000 times
   to produce an empirical null distribution of odds ratios. The observed
   log-rank p-value for each pathway is compared to this null distribution
   (red line in [86]Figure 2F). This results in an empirical p-value for
   every pathway under consideration. Then the empirical p-values are
   adjusted using a Benjamini–Hochberg FDR adjustment (FDR[BH]). Pathways
   withFDR[BH] < 20% are identified as survival-relevant pathways
   ([87]Figure 2G). Clusters resulting using the N1PAS metrics within
   relevant pathways provide disease subtypes with distinct survival
   curves ([88]Figure 2H) with the added benefit of simple diagnostic
   tools – relevant pathway odds ratio thresholds as depicted in
   [89]Figure 2I.

FIGURE 2.

   [90]FIGURE 2
   [91]Open in a new tab

   Workflow of the disease-subtyping pipeline. Single-subject N1PAS
   pathway-level metrics (A) are aggregated for the patients in the cohort
   (B). Then, the dimension is reduced by filtering out pathways that are
   not significantly enriched in any patient (C). For each prioritized
   pathway remaining, patients are clustered (unsupervised) into two
   subtypes (D). Two Kaplan-Meier survival curve estimates are produced
   along with a log-rank p-value (E). An empirical distribution of
   p-values is computed to assess significance (F). The top-prioritized
   pathways are called survival relevant pathways (G). (H) displays the
   survival curves for patient clusters derived in relevant pathways. The
   curves presented here depict the results from a clustering that passed
   the empirical p-value threshold from (E). There is no difference in the
   clustering approach – just an illustration of a successfully identified
   survival-relevant pathway. Lastly, decision-friendly cutoffs are
   provided to inform at the point of care (I).

Alternative Approaches to Survival Subtyping

   To explore whether pairing and aggregating alternative spliced genes at
   the pathway level improves the detection of survival-relevant pathways,
   we modify the above pipeline by systematically modifying the input to
   the binary clustering in three ways: (1) use only the tumor isoform
   expression (as opposed to N1PAS odds ratios and locFDR), (2) use the
   difference in tumor isoform expression from the normal isoform
   expression, and (3) use the Hellinger distances corresponding to each
   of the 4133 KEGG-associated genes The first method differs from N1PAS
   in that for each pathway W with I[W] associated isoforms (all isoforms
   relating to the G genes annotated W), all I[W] isoform measurements are
   included in the call to the PAM clustering algorithm. The second method
   uses both the tumor and normal isoform expression, but is only
   concerned with differential expression not alternative splicing
   patterns. The third method quantifies differential isoform usage, but
   does not aggregate these patterns at the pathway level (as in N1PAS).
   As we defined the Hellinger distance only when both samples expressed
   the gene, care was needed to form a complete 51 × 4133 matrix of
   Hellinger distances. Specifically, any missing gene-wise distances were
   imputed with a patient’s mean Hellinger distance across all genes. Each
   method produces two patient clusters as in [92]Figure 2D and survival
   curves are fit for each of the 206 pathways. To construct an empirical
   null of survival p-values, each patient’s I[W] isoforms or genes were
   shuffled to disrupt correspondence with the specified pathway’s true
   annotations (similar to the other permutated distributions above) and
   two clusters are determined. Relevant pathways are identified using the
   FDR-corrected empirical p-values as shown in [93]Figure 2G.

Simulation Study of N1PAS Empirical Operating Characteristics

   To investigate the performance of our N1PAS methodology in practice, we
   conducted a series of Monte Carlo evaluations. We examined how two
   different inputs affect the test’s operating characteristics: (1)
   number of expressed genes, G, in the pathway and (2) proportion, π, of
   ASGs within the pathway over the background percentage of ASGs. We
   study the empirical false positive rates and (statistical) power to
   detect an enriched pathway based on permutations of patient-specific
   Hellinger distances for all 246 TCGA patients ([94]Table 1) while
   varying the two above inputs. Our focus lies in providing practitioners
   guidance to calibrate what effect size N1PAS can reliably detect. The
   pathway odds ratio serves as an effect size in N1PAS. This effect size
   corresponds to the proportion of ASGs within a pathway (π), relative to
   the background level of ASGs. We explain the details of the simulation
   below.

   To avoid over-simplistic parametric and statistical assumptions (e.g.,
   independent isoform counts) and to anchor simulation results to our
   studied setting, we restricted our Monte Carlo experiments to
   permutated TCGA patient-specific Hellinger distances for genes
   annotated to KEGG pathways. First, Hellinger distances within each gene
   were computed for all 246 patients across the 6 TCGA data sets. This
   produces patient-specific distributions of distances that are then used
   to classify genes into two groups: ASG or not (as in [95]Figure 1B) for
   each patient. The proportion of ASGs across all genes is what we refer
   to as the patient-specific background level and denote this quantity
   asπ[all]. Next, for each patient, we shuffle the gene labels to
   approximate a ‘null’ distribution of Hellinger distances as the values
   do not aggregate meaningfully into pathways. Then the number of
   expressed genes G (approximate gene set size) is selected from the set
   {15, 30, 50, 100} and an effect size π is selected from the set {0,
   0.05, 0.10, 0.15, 0.20}. Next, we randomly select one of the 206 KEGG
   pathways that has at least G expressed genes (but not more than 5 genes
   larger than the G, to give an approximate gene set size) to label
   Hellinger distances with specific gene labels to induce the effect size
   π. Call this selected pathway the specified pathway. To induce the
   specified effect size, G^∗(π + π[all]) gene labels within the specified
   pathway were randomly chosen and assigned randomly sampled Hellinger
   distances from the ASG group. The remaining genes were randomly
   assigned values from the non-ASG group. Then N1PAS was run on the
   permuted, modified Hellinger distance data for all 206 KEGG pathways.
   This process was completed 100 times across the 4 × 5 simulation
   configurations for a total of 2000 runs per patient. This results in
   246 patients × 2000 runs per patient for a total of 492,000 simulated
   N1PAS runs.

Results

   This section describes observations from the target KEGG pathway
   validation study across six TCGA data sets ([96]Table 1), an
   application of the disease-subtyping pipeline to the TCGA LUSC data set
   of 51 lung squamous cell carcinoma patients, and the Monte Carlo
   studies. Throughout, we focus on the interpretation of single-subject
   results to showcase the unique insights made possible by N1PAS.
   Overall, the results highlight the vast heterogeneity of splicing
   dysregulation among cancer patients, despite having the same disease.

Target KEGG Pathway Validation Study

   The frequency of significantly enriched pathways of ASGs varies greatly
   from data set to data set and patient to patient. [97]Table 2 compiles
   the results of the KEGG target pathway validation study. We only scored
   pathways with at least 15 expressed genes (in either sample from a
   patient). Of the 206 KEGG, the number of pathways scored across all
   data sets was fairly constant (with a median of approximately 185
   pathways). The locFDR significance threshold appears somewhat severe
   with the median percent significantly enriched ranging from 3 to 7.8%
   across data sets. So, it is quite difficult in this small ontology to
   call a pathway as enriched with ASGs. One should keep in mind that
   pathway can have a high proportion of ASGs yet still not be
   significantly enriched, especially if alternative splicing is rampant
   through the entire transcriptome.

Table 2.

   Summary for target capture validation study and empirical assessment.
   TCGA (N) Target KEGG
   pathway Median # scored
   pathways Median # of
   hit pathways Target pathway capture
   % (p-value) Median target
   pathway rank Cancer pathway capture
   % (p-value)
   LUSC (51) Non-small cell lung cancer 186 11 4% (ns) 58 47% (p = 0.03)
   LUAD (58) Non-small cell lung cancer 183 10 9% (ns) 49 52% (p = 0.001)
   PRAD (52) Prostate cancer 185 6 15% (p < 0.001) 28 42% (p < 0.001)
   THCA (59) Thyroid cancer 185 9 20% (p < 0.001) 37 39% (ns)
   UCEC (7) Endometrial cancer 182 14 9% (p < 0.038) 44 57% (ns)
   BLCA (19) Bladder cancer 181 11 47% (p < 0.001) 11 68% (p = 0.004)
   [98]Open in a new tab

   206 KEGG pathways were used as the input. There is one target pathway
   per data set and nine cancer pathways in the KEGG ontology (9/206 =
   0.044). A “pathway capture rate” is the proportion of patients with the
   target pathway (or set of pathways) found significantly enriched at
   FDRloc < 20% and odds ratio > 1. A “cancer pathway capture rate” is the
   proportion of patients with at least one of the nine cancer pathways
   that was found to be enriched. The “null pathway” capture is the rate
   of a randomly-constructed “pathway” being enriched and, similarly, the
   null “cancer pathway” is the rate of at least one of nine null pathways
   found enriched. See section Methods for Computational Experiments for
   details. The “median # scored pathways” contains the median of the
   distribution of the number of scored pathways across patients (as the
   pathway must have at least 15 expressed genes, this varies from patient
   to patient). Similarly, the “median target pathway rank” contains the
   median rank across the patients for the corresponding target pathway.
   ns, not significant.

   The target pathway capture rate (i.e., the proportion of patients with
   a significantly enriched target pathway) is greater than expected for
   most data sets. The KEGG pathway bladder cancer was the most
   often-captured target pathway, with 9 of the 19 (47%) Bladder Cancer
   (BLCA) patients significantly enriched. To put this in context, a
   random null pathway for the BLCA data was never captured at such a high
   rate in 2000 empirically-derived pathways ([99]Table 2; p < 0.001). In
   fact, the target capture rate is greater than expected for all data
   sets except for the two lung cancer data sets (LUSC and LUAD).
   Moreover, the rate of cancer pathway capture is higher than expected
   from four of the six data sets.

   While the capture rates on the surface appear lukewarm, we view the
   results as unveiling the need for subject-specific metrics. If one
   relied on cohort-based methods, the heterogeneity in splicing patterns
   would be missed. To explore this notion, we clustered patients into two
   groups using the odds ratios within target pathway. [100]Figure 3
   illustrates these grouping across the six data sets. Many patients show
   an enrichment of ASGs within patients (OR > 1) while others do not.
   Perhaps some patient’s disease mechanisms are not in a well-studied
   pathway and could be helped by innovative strategies.

FIGURE 3.

   [101]FIGURE 3
   [102]Open in a new tab

   Boxplots of patient-specific odds ratios (OR) of the target KEGG
   pathway for the six TCGA data sets. Patients (indicated by points) were
   clustered using only the OR of the target pathway into two groups to
   unveil transcriptional response subtypes. Note that some patients
   exhibit less alternatively spliced genes within the target pathway than
   the background transcriptome (see LUAD, LUSC, and UCEC data sets; OR <
   1).

Applying the Disease-Subtyping Pipeline to the Non-small Cell Lung Cancer
(LUSC) Data Set

   The low target-capture rate (3%) and high cancer pathway capture rate
   (47%) for the LUSC data set ([103]Table 3, first row) and apparent
   impoverishment of ASGs for some patients ([104]Figure 3) presents an
   interesting dilemma. Is the target pathway useful for these data? Or
   are there other, more interesting pathways related to patient outcomes?
   For this quandary, we applied the survival-subtyping pipeline to the 51
   LUSC paired normal-tumor isoform expression, using KEGG pathways, and
   clinical survival data (32 deaths observed). Following the pipeline
   workflow illustrated in [105]Figure 2, the 206 KEGG pathways meeting
   the filtering criteria were scored for each of the 51 patients. Next,
   the odds ratios were aggregated into an N × P matrix ([106]Figure 2B).
   P is determined by the number of pathways scored in at least one
   patient. For these data, there are 174 such pathways. The pathways were
   further filtered to reduce noise to include only the 101 pathways found
   significantly enriched in at least one patient ([107]Figure 2C). Next,
   each pathway is assessed one at a time to determine if unsupervised
   clustering of patients, based on odds ratios, into disease subtypes
   produces distinct survival curves ([108]Figure 2D–[109]G). [110]Table 3
   displays the discovered survival-relevant pathways ([111]Diaz et al.,
   2017). Survival curves for the top-four pathways are provided in
   [112]Figure 4.

Table 3.

   Non-small cell lung cancer (LUSC) pathways selected by subtyping
   pipeline.
   Rank KEGG description Log-rank
   p-value Empirical p-value
   (2000 reps) # genes % significant
   enrichments % genes shared w/
   target pathway
   1 Bladder cancer 0.0001 0.0005 42 18% 52%
   2 Melanoma 0.002 0.003 71 24% 47%
   3 Staphylococcus aureus infection 0.003 0.004 55 18% 0%
   4 Non-small cell lung cancer^∗ 0.005 0.007 54 4% 100%
   5 Renal cell carcinoma 0.006 0.009 70 18% 36%
   [113]Open in a new tab

   Top-five ranked pathways out of 101 KEGG pathways enriched for at least
   one of the 51 patients (FDR < 20%). Pathways are ranked by log-rank
   p-value between survival curves corresponding to clustering of
   enrichment odds ratios. As shown by the high prioritization of the
   biologically relevant LUSC target pathway in this table, the proposed
   method using ORs of spliced pathways outperforms alternate analytical
   processing of alternative splicing signal as input to clustering as
   shown in [114]Table 5. The columns headings explained: The “log-rank
   p-value” corresponds to the standard Kaplan-Meier p-value (unadjusted)
   for the pathway-based clustering. The “empirical p-value” results from
   our permutation test. The “# genes” is the number of annotated genes
   from KEGG. “% significant enrichments” column contains the percentage
   of subjects that were identified as having the significant enrichment
   of ASGs. The “% genes shared w/target pathway” contains the percent of
   genes annotated to the target pathway for each top-rank pathway ^∗ =
   Target pathway. Bolded values denote the summary statistics of the LUSC
   target pathway.

FIGURE 4.

   [115]FIGURE 4
   [116]Open in a new tab

   LUSC patient (N = 51, 32 deaths) survival curves. Subgroups determined
   by unsupervised clustering into two groups (Better and Worse) using the
   odds ratio for the corresponding KEGG pathway. (A) Bladder cancer
   pathway (N[Better] = 28), (B) Melanoma pathway (N[Better] = 33), (C)
   Staphylococcus aureus infection pathway (N[Better] = 21), and (D)
   Non-small cell lung cancer pathway (N[Better] = 33). Circles indicate
   censorship. All log-rank p < 0.01.

   Thus, the N1PAS odds ratios and locFDR values were able to predict
   survival in the LUSC data set, as five pathways were found atFDR[BH] <
   20% ([117]Table 3). Interestingly, four of these pathways relate to
   cancer with the fourth-ranked pathway being the LUSC target pathway.
   Surprisingly, the third-ranked pathway is the staphylococcus aureus
   infection KEGG pathway, which may present an orthogonal explanation for
   a poor survival outcome.

   [118]Figure 5 displays the odds ratio distribution within the top-four
   relevant pathways split by patient clusters (subtypes). Patient
   subtypes have been annotated as Better or Worse based on inspection of
   the survival curves. For the three cancer-annotated pathways, a higher
   odds ratio of enrichment with ASGs is associated with a better survival
   outcome. This paradoxical observation may be the result of drug
   efficiency directed at known cancer biology. The reverse pattern lies
   in the staphylococcus-related pathway – more abundant alternative
   splicing in the pathway results in poor survival. Simple diagnostic
   rules can be found by inspection of the boxplots in [119]Figure 5. For
   example, a patient-specific odds ratio less than 1 in the non-small
   cell lung cancer pathway (target pathway; [120]Figure 5D) indicates a
   poorer prognosis.

FIGURE 5.

   [121]FIGURE 5
   [122]Open in a new tab

   Boxplots overlaid with points indicating the odds ratios of pathway
   enriched with ASGs. Points indicate values for the 51 LUSC patients,
   split by unsupervised cluster assignment into two groups (“Better” and
   “Worse” as interpreted by survival data) for pathways (A) “Bladder
   cancer” (N[Better] = 28), (B) “Melanoma” (N[Better] = 33), (C)
   “Staphylococcus aureus infection” (N[Better] = 21), and (D) “Non-small
   cell lung cancer” (N[Better] = 33). Circles indicate censorship. Since
   the clustering is done on a single vector of odds ratios for each
   pathway, an odds ratio threshold can separate the patients (red dashed
   lines; midpoint between clusters).

   One practical issue with subtyping a patient using multiple
   survival-relevant pathways is that different clusterings may disagree
   with prognosis (Better using one pathway’s odd ratios and Worse using
   another’s). This will require care to uncouple for the patient at hand.
   For example, a patient may exhibit a worse prognosis in the
   staphylococcus aureus infection pathway, but a better prognosis in the
   cancer-annotated pathways. Hypothetically, this patient may then
   respond well to the standard treatment in combination with an
   innovative treatment to address the dysregulation in the staphylococcus
   pathway.

   To gain insight into subtype overlap within our LUSC case study, we
   explore agreement across the top-four survival-relevant pathways
   ([123]Table 3). To quantify the agreement, we compute the Jaccard index
   asJ[1,2] =
   [MATH:
   <mrow><mfrac><mrow><mo>|</mo><msub><mtext>G</mtext><mn>1</mn></msub><mo
   >∩</mo><msub><mtext>G</mtext><mn>2</mn></msub><mo>|</mo></mrow><mrow><m
   sub><mrow><mtext>|G</mtext></mrow><mn>1</mn></msub><mo>∪</mo><msub><mte
   xt>G</mtext><mn>2</mn></msub><mo>|</mo></mrow></mfrac></mrow> :MATH]
   , where G[1], G[2] are the sets of patients clustered using the 1st and
   2nd top pathways (for either the Better or Worse subtype). [124]Table 4
   displays the Jaccard indices for all pairs of top ranked pathways for
   both subtypes. The quantities suggest that agreement is stronger for
   the Better subtype (average Jaccard index of 0.4942, compared to the
   average Worse index of 0.3255). This is interesting as it may indicate
   that individual lung cancer patients display unique dysregulated
   pathways (motivating precise treatments). We also observe that
   subtyping based on the three cancer-associated pathways (ranks 1, 2, 4)
   generally agree well for the Better patients. But these Better subtypes
   from cancer pathways agree poorly with the staphylococcus aureus
   infection subtyping. The Worse subtypes generally agree less well than
   the Better subtypes, and with a similar trend in disagreement with the
   staphylococcus pathway. This suggests a distinct survival-related
   signal in this pathway from the cancer-annotated dysregulation. Of
   course, conflicting subtypes will complicate clinical application and
   slow any decision process, thus limiting our proposed approach.

Table 4.

   Subtype agreement across the top-four LUSC survival-relevant pathways.
   Subtype J[12] J[13] J[14] J[23] J[24] J[34]
   Better  0.605 0.289 0.605 0.370 0.784 0.313
   Worse   0.464 0.159 0.464 0.147 0.636 0.083
   [125]Open in a new tab

   The Jaccard index measures the proportion of overlapping patients
   selected in either the Better or Worse group by each pair of top-ranked
   pathways ([126]Table 3). The first, second, and fourth pathways are all
   associated with cancer annotated genes and their comparisons show
   highest agreement scores.

Comparing Alternative Approaches to Survival Subtyping

   As detailed in section Alternative Approaches to Survival Subtyping,
   the input data to the clustering step of our proposed subtyping
   pipeline was modified using three straightforward alternative
   approaches for the 51 LUSC patients: (1) using tumor isoform
   expression, (2) using the difference in isoform expression between
   tumor and normal samples, and (3) Hellinger distance data for the 4133
   genes annotated to KEGG pathways. [127]Table 5 summaries the pathway
   rankings and statistical significance of survival prediction for the
   three alternative approaches. We’ll discuss the results of each method
   in turn.

Table 5.

   Complementary study of LUSC survival-relevant pathway rankings by
   subtyping pipeline using three alternative analytical transformations
   (first column; clustering inputs) before clustering.
   Clustering input Rank KEGG description Log-rank
   p-value Empirical
   p-value Empirical
   FDR[BH]
   Tumor isoform expression only 1 Phenylalanine metabolism 0.0017 0.005
   0.515
   2 Complement and coagulation cascades 0.0024 0.0030 0.515
   3 Pentose phosphate pathway 0.0106 0.0245 0.750
   4 Vasopressin-regulated water reabsorption 0.0112 0.018 0.750
   … … … … …
   64 Non-small cell lung cancer (target pathway) 0.130 0.217 0.750
   Difference in tumor and normal isoform expression 1 Phenylalanine
   metabolism 0.0014 0.003 0.618
   2 RNA degradation 0.0034 0.0140 0.747
   3 GnRH signaling pathway 0.0037 0.017 0.747
   4 Glycolysis/Gluconeogenesis 0.0130 0.0285 0.747
   … … … … …
   144 Non-small cell lung cancer (target pathway) 0.571 0.6025 0.854
   Hellinger distance within genes across isoform expression 1 Chronic
   myeloid leukemia < 0.001 0.0005 0.103
   2 Osteoclast differentiation 0.0035 0.0060 0.618
   3 Steroid biosynthesis 0.0111 0.0150 0.762
   4 Progesterone-mediated oocyte maturation 0.0130 0.0155 0.762
   … … … … …
   30 Non-small cell lung cancer (target pathway) 0.1186 0.136 0.899
   [128]Open in a new tab

   In order to demonstrate that the proposed method evaluated in
   [129]Table 3 outperforms straightforward analytical transformations
   (first column, clustering inputs), we conducted these additional
   substudies (section Methods for Computational Experiments). The
   non-small cell lung cancer pathways of KEGG is the most relevant to
   LUSC survival and is ranked higher (#4, [130]Table 3) in the proposed
   method than in the alternate ones presented here. Ranked pathways of
   the 206 KEGG pathways with at least 15 and no more than 500 genes.
   Empirical p-values are constructed by comparing to survival curves
   generated from clustering on permutations of the clustering input data
   expression. Bolded values denote the rank of the LUSC target pathway.

   The tumor-expression method clusters solely on the isoform expression
   and uses no dimension reduction techniques prior to clustering. This
   could perhaps provide more information in the cluster assignments. No
   pathways, however, were found to be statistically significant at FDR <
   50%. But this approach uses only half of the expression data input into
   N1PAS; thus, it may suffer reduced ability to detect survival-relevant
   pathways based on that fact alone. The top pathways’ descriptions
   appear to be less relevant than the results for N1PAS ([131]Table 3).
   The non-small cell lung carcinoma target pathway is ranked relatively
   poorly (64 out of 206 pathways, compared to the fourth-ranked pathway
   using N1PAS in [132]Table 3). These results seem to imply that
   clustering on the dynamic, individualized metrics of differential
   isoform usage within pathways provides higher resolution for survival
   prediction than tumor isoform expression alone.

   Next, we use the difference in tumor and normal isoform expression as
   input to the binary clustering of LUSC patients. The top-ranked
   survival-relevant pathway, phenylalanine metabolism, agrees with the
   tumor-only results. The other top pathways disagree, suggesting a
   distinct signal. The non-small cell lung carcinoma target pathway’s
   rank dropped substantially in this approach to a rank of 144 out of 206
   KEGG pathways when compared to the N1PAS results reported in [133]Table
   3. No pathways were found to significantly produce separate survival
   curves at FDR < 50%.

   Finally, we use the Hellinger distances for each gene across the 51
   LUSC patients to cluster patients. This signal is somewhat nearer to
   N1PAS and is concerned with alternative splicing patterns. N1PAS takes
   these distances one step further by aggregating the signal into
   pathway-level enrichment of ASGs (by operating with the odds ratios;
   [134]Figure 1C). The non-small cell lung carcinoma target pathway rank
   is also low (30 out of 206 KEGG pathways) when compared to the N1PAS
   method reported in [135]Table 3. The top hit pathway, chronic myeloid
   leukemia, was significant at FDR < 20% – in contrast to the two other
   alternative methods.

   Put together, we see that the use of N1PAS odds ratios and significance
   assessment increases the ability to find survival-relevant subtypes
   within pathways. The results also suggest that alternative splicing
   analyses within pathways present a complementary viewpoint to
   expression-based workflows.

Simulation Results

   Our simulation results provide insight into N1PAS empirical operating
   characteristics, specifically simulated false-positive rates and
   statistical power. Among the 492,000 N1PAS runs, there were 625
   algorithm failures (0.127%) due to misfit odds ratio mixture modeling.
   This may happen in practice, although rarely, and the parameters of
   locFDR will have to be manually adjusted. Here, we discarded these runs
   from the results. Each patient displays a different Hellinger distance
   distribution with their own background level of ASGs (π[all]). The
   average π[all] across the 246 patients is 0.1914, but vary from 0.1088
   to 0.4417 with the middle 50% of the rates vary from 0.1613 to 0.2122.

   Empirical false-positive error rates correspond to the π = 0 simulation
   setting (no pathway is specified to have an enrichment of ASGs). ASGs
   can aggregate in pathways by chance in this setting. We calculated the
   simulated false-positive rate as the number of detected pathways among
   the 206 KEGG pathways under permuted patient-specific Hellinger
   distances. We performed 100 simulation replicates per patient and
   computed the patient-specific average false-positive rates. Further, we
   pooled these mean rates within each TCGA data set and computed the
   empirical standard error of the mean to assess variability across
   patients. Finally, we computed pointwise 95% confidence intervals for
   the data set mean false-positive rate using a normal approximation
   centered at the overall mean with the observed standard error.
   [136]Figure 6 displays these mean false-positive rate estimates. The
   patient-specific average false-positive rates vary tightly around
   0.04–0.0525 with centers between 0.045 and 0.050. This is interesting
   since the decision threshold used in N1PAS was locFDR > 0.20, and we
   observe rates much lower than this specification. This trend in
   false-positive rates persists across TCGA data sets. In sum, the
   simulated false-positive rate data suggests adequate (if not
   conservative) method performance with respect to false discovery rates.

FIGURE 6.

   FIGURE 6
   [137]Open in a new tab

   Extensive simulations of false-positive rates under the assumption of
   locFDR < 20% shows conservative 4–5.25% false-positive rates for the
   N1PAS method. Empirical false-positive rates are represented as dots
   for each TCGA data set, based on simulated patient-specific Hellinger
   distance data. Results reported for π = 0 (see text). Horizontal bars
   are 95% confidence intervals around the overall mean false-positive
   rate (across patients within the data set). For example, the confidence
   interval for LUSC was constructed by first finding the average
   patient-specific false-positive rate for all 51 patients (51 patients ×
   100 replicates/patient = 5100 simulations for LUSC). Then, the standard
   error of the overall mean was estimated using the sample standard
   deviation across patients and a normal approximation is used to
   determine the confidence bounds.

   Empirical power estimates, as detection rates, from our simulations are
   graphed in [138]Figure 7. Power calculations correspond to simulations
   withπ > 0. Here we call power the observed detection rate of a
   specified pathway, with an induced enrichment of ASGs. [139]Figure 7
   presents the results as a function of the ASG proportion above
   background (π), and stratifies power curves by expressed genes in
   pathway (G). We increment the effect size π from 0.05 to 0.20 to
   calibrate how sensitively N1PAS can detect an enriched pathway. 95%
   confidence intervals for mean power were constructed at each pair (π G)
   within each data set as above in the false-positive rate study.

FIGURE 7.

   [140]FIGURE 7
   [141]Open in a new tab

   Extensive simulations show substantial statistical power of N1PAS
   method in detecting pathways enriched with Alternatively Spliced Genes
   (ASGs) when the specified pathway’s proportion (π) is 15% higher than
   subject-specific background level of ASGs (π[all]). Approximately
   400,000 simulations were conducted to assess power by varying expressed
   genes in the specified pathway (G), proportions π, and gene labels
   assignment for patient-specific Hellinger distances. Empirical power
   estimates are displayed as jittered dots for each TCGA data set, based
   on the detection rate of a specified pathway. Horizontal bars are 95%
   confidence intervals around the overall mean power (across patients
   within the data set, stratified by G and π). Power curves were
   constructed using linear interpolation between pointwise power
   estimates (dots) at fixed values of simulation configurations.

   Patterns in [142]Figure 7 show a trend toward increasing power while
   increasing the effect size π. N1PAS rarely detects a 5% enrichment of
   ASGs over the background level. At 10%, however, detection is possible
   but highly variable. At 15% ASGs above background, power becomes more
   reliable with 75% of average power estimates above 0.9. At 20%, the
   specified pathway in almost always detected (minimum patient-specific
   power = 0.9916). The trends in simulated power are generally consistent
   across TCGA data sets at extreme values of π, 0 or 0.2. There are,
   however, some differences in central tendency and variability across
   TCGA data sets atπ = 0.1, 0.15. For example, PRAD patients show more
   power to detect a specified pathway with a 15% ASG enrichment than
   BLCA. As one may expect, power estimates are more variable with a
   smaller number of expressed genes G for immediate values of π (e.g.,
   0.1 or 0.15).

   In general, these results suggest that N1PAS procedure exhibits good
   false-positive error control and excellent power, at least under the
   settings chosen for these simulations.

Discussion

   This study creates a first look at personalized alternatively splicing
   patterns within pathways. As such, these patterns are likely
   complementary to other ‘Omics measures, and thus the proposed N1PAS
   strategy could become an additional tool to bridge the gap between
   RNA-Seq data and clinical translation. For example, one could
   investigate whether splicing events occur in a coordinated way in
   response to a stimulus or perturbation like a medication. A practical
   limitation is the relative rarity of paired RNA-Seq data (compared to
   single-sample expression data). It’s true that paired unaffected and
   cancer tissues are currently uncommon. Yet novel experimental designs
   and corresponding analyses will drive data collection protocols.
   Indeed, as the National Institute of Health and National Cancer
   Institute announced ([143]Collins and Varmus, 2015), there is interest
   in promoting and developing creative new assays and analytics for
   predicting individualized disease mechanisms and treatments. The N1PAS
   methods is a proof of concept that demonstrates, with the blessing of
   high dimensionality and integration of external knowledge, that signal
   can aggregate within gene sets of paired samples and improve their
   mechanistic interpretation.

   To make clear of a potential clinical use of N1PAS within the survival
   subtyping pipeline of Omics signals, imagine the following scenario: a
   patient suffering from non-small cell lung cancer consents to paired
   tumor-normal RNA-sequencing. Patient-specific N1PAS odds ratios are
   first computed in the five survival-relevant pathways. Each
   survival-relevant pathway has an associated odds ratio threshold (red
   dashed lines in [144]Figure 5). Based on this threshold, the patient is
   stratified into either the Better or Worse survival groups. This
   informs on both prognosis and on patient-specific disease mechanisms.
   Our proposed method, however, only considers one pathway at a time and
   therefore only partially explains survival, compared to using all the
   odds ratios. Future studies could improve survival prediction through
   aggregation of pathway metrics and potentially have greater clinical
   utility in terms of prognosis in addition to combining multiple ‘Omics
   measurements.

   There are, of course, limitations and caveats to the methodology. The
   proportion clustering approach seeks to quantify differential relative
   isoform usage and not differential gene expression. However, a gene
   could be differentially expressed based on the magnitudes of the sum
   across isoforms (typical gene expression) as well as exhibit
   differential isoform usage. As such, the signal obtained by N1PAS may
   not be purely alternative splicing as traditional DEGs may still
   contribute. Another issue is that different isoforms may not always be
   indicative of differential protein structure or activity and the
   biological impact may be minimal in these situations. The model does
   not currently account for any noise that may be present in RNA-Seq
   measurements, an important consideration in the N-of-1 setting. Along
   those lines, the estimated proportions p[gA] in Equation (1) depend on
   the initial accuracy of the isoform abundances. This accuracy depends
   on several aspects such as read depth, number of expressed isoforms,
   and the specific method that has been used for estimating the
   transcript abundance. Therefore, the estimated proportions may have
   different variances and other statistical properties. Our method does
   not explicitly account for these differences. An important extension of
   our model could include this uncertainty in estimating the proportions
   for a more holistic and realistic formulation.

   We acknowledge some limitations to the computational experiments in the
   study. The significance of the survival curves has been assessed on a
   single retrospective dataset and could be over-optimistic. Future
   studies should include prospective independent datasets. Our choice of
   pathway database is outdated and more current databases may provide
   more informed discoveries. One could imagine a variety of alternative
   methods to form comparisons against N1PAS. For example, p-value
   aggregation methods (such as Fisher’s method) could be explored. Care
   must be taken, however, as these methods often assume independent
   measurements. Future studies could develop more sophisticated and novel
   p-value aggregation approaches in this setting. Additionally, it would
   be interesting to vary the pathway definitions across several databases
   in future studies. Lastly, this proof of concept study was designed to
   demonstrate the utility of alternative splicing signals in absence of
   other signal to avoid confounders. Therefore, in future studies, this
   method should be opportunistically combined with any ‘Omics as well as
   clinical signals in order to translate clinically useful predictions
   with high accuracy.

Conclusion

   We proposed a single-subject methodology, N-of-1-pathways Alternatively
   Spliced, to quantify differential mRNA isoform usage within biological
   pathways from paired-sample RNA-Seq data. A target pathway validation
   study on paired normal-tumor samples from TCGA reveals that in most
   data sets the identified pathways generally concur with the annotated
   disease (bladder cancer pathway was more likely to be enriched with
   ASGs). More than just providing validation, this study also highlights
   the vast patient-to-patient heterogeneity in alternative splicing
   patterns. This heterogeneity actually motivates our N-of-1 approach:
   despite having the same disease, patients vary greatly in their
   splicing dysregulation. Our identification of subject-specific splicing
   dysregulation offers targets for personalized interventions and
   monitoring plans. Next, we applied our N1PAS single-subject metrics to
   predict survival within a novel subtyping pipeline. The output of this
   pipeline contains easy-to-interpret diagnostics to enable precision
   medicine from transcriptome data. Finally, we showed adequate
   statistical power and false-positive rates in simulation studies.

Author Contributions

   AS and YA convinced the foundational concepts, designed experiments,
   and explored the results. AS developed the methodology and drafted the
   manuscript. AS and DA conducted the computational experiments. YA
   contributed sections. AS and CK developed the figures. All authors
   contributed to manuscript revision, read and approved the submitted
   version.

Code Availability

   All code used to conduct this study are freely available at
   [145]https://github.com/grizant/nof1-splice. An R package to conduct
   N1PAS, including vignettes, is freely available at
   [146]https://github.com/grizant/n1pas/tree/master.

Conflict of Interest Statement

   The authors declare that the research was conducted in the absence of
   any commercial or financial relationships that could be construed as a
   potential conflict of interest.

Acknowledgments