Graphical abstract
graphic file with name fx1.jpg
[32]Open in a new tab
Highlights
* •
ACSNI uses deep neural network to decompose transcriptomic
measurements
* •
ACSNI maps pathway components to target biological or clinical
phenotypes
* •
ACSNI robustly extracts regulatory output of transcription factors
* •
ACSNI enables annotation of genes with unknown functions
The bigger picture
Methods that group genes into functional units to quantify pathway
activities are critical in the analysis of biological systems. Although
many components of biological pathways have been described in detail,
these tend to be limited to well-studied genes. In contrast, the
majority of possible components remain unexplored. Here, we present a
machine-learning tool for constructing and predicting tissue-specific
components of biological pathways from large biological datasets. Our
algorithm, ACSNI, can tackle incomplete pathway descriptions and
enhance current pathway analysis methods' performance. We anticipate
that, by dissecting the complex signals in biological data in a
flexible and context-specific manner, ACSNI can facilitate the full
characterization of physiological systems of interest.
__________________________________________________________________
ACSNI software presents a flexible approach to discover unknown
components of a biological pathway. We demonstrate through simulation
and biological user cases that ASCNI has superior performance in
inferring pathway activities compared to other approaches. In
particular, it is able to identify unknown components of pathways in a
tissue-specific manner, facilitating the full characterization of
physiological systems of interest.
Introduction
One feature common to all cells is the dynamic ability to coordinate
activities through many pathways that receive and process signals from
the environment and different cell regions.[33]^1 Hence, there has been
a persistent interest in developing pathway analysis approaches that
group genes into functional units and elucidate molecular mechanisms.
Most approaches, especially the techniques of pathway enrichment
analysis and network-based modeling, require accurate and comprehensive
pathway descriptions with annotated regulatory components (genes).
However, the identities of the components within these biological
pathways and their regulatory interactions remain either wholly unknown
or partially understood.[34]^2 Even among well-studied pathways,
detailed annotations are scarce for their disease- and tissue-specific
components. This unmet need argues for an expanded effort to identify
these unknown components.
Genetic, biochemical, and biophysical techniques such as small
interfering RNAs, small molecular inhibitors, immune precipitation, and
gel filtration are common procedures for identifying molecular
functions and pathway components.[35]3, [36]4, [37]5 The first step
toward designing these experiments is the identification of candidate
genes and their interacting partners. The large volume of public gene
expression profile (GEP) datasets, such as the tissue RNA expression
profiles in the Genotype-Tissue Expression (GTEx)[38]^6 and The Cancer
Genome Atlas (TCGA)[39]^7 projects, allow for initial inference of
pathway circuits and prediction of their unknown components in a
tissue- and disease-specific manner. One popular approach applied to
these large datasets involves using pairwise correlation to represent
the relationships between genes.[40]^8 Visualizing these interactions
as a network annotated with external functional information such as
STRING[41]^9 and GeneMANIA[42]^10 reveals interactions of known pathway
genes and helps to discover additional related genes. However, these
approaches make assumptions that do not account for the multivariate
nature of gene regulation.[43]^2 For example, the transcription of a
single gene proceeds through the binding of transcription factors on
enhancer regions and the subsequent recruitment of many co-activators
and complexes that modify chromatin structures and promote the assembly
of the basal transcriptional machinery.[44]^11 These complexities are
not easily captured by simple pairwise analysis between genes and the
resultant binary interaction networks.
Another approach for identifying unknown components of pathways
involves first estimating a summarized pathway activity from GEPs based
on prior knowledge of the target pathway using gene set enrichment
analysis tools, such as gene set variation analysis (GSVA),
pathway-level analysis of gene expression (PLAGE), single-sample gene
set enrichment analysis (ssGSEA), and the combined Z score (Z
score),[45]12, [46]13, [47]14, [48]15 then correlating this compact
value with the rest of the genes to identify new components of the
target pathway. However, genes function in more than one pathway, and
pathways contain subprocesses,[49]^1 meaning that a single variable
representation of a pathway is unlikely to capture the target pathway's
dynamics.
Here we provide software termed automatic context-specific network
inference (ACSNI), which leverages artificial intelligence for the
reconstruction of a biological pathway and aids the discovery of
pathway components and classification of the crosstalk between pathways
in specific tissues. The method draws on the principle that cell
signaling networks are organized into several small, highly connected
modules (herein called subprocesses) that combine hierarchically into
larger units to regulate cellular functions.[50]^16 ACSNI combines the
widely used and general-purpose open-source TensorFlow-Keras library
with a probabilistic ensemble approach in a manner that adapts to the
particular needs of systems biology projects. For example, it has
functions that enable flexible adjustment for gene weights within the
pathway and linking pathway subprocesses to a phenotype. We demonstrate
the utility of ACSNI through three different use cases, including the
investigation of the mTOR signaling pathway in kidney tumor and normal
samples, exploring the ATF2 network in the aorta, and identification of
HOXA transcript antisense RNA myeloid-specific 1 (HOTAIRM1) target
genes in kidney tissues. We find that ACSNI generates validated and
detailed, tissue-specific pathway circuitry and components. The ACSNI
predictions can guide researchers in designing targeted experiments to
study the molecular mechanisms that underlie a biological state.
Results
Overview of the ACSNI method
ACSNI is a three-step method to infer tissue-specific components
(genes) of a pathway from tissue GEPs. It performs (1) unsupervised
derivation of pathway subprocesses, (2) estimation of subprocess-gene
interaction scores, and (3) inference of tissue-specific components of
the target pathway ([51]Figure 1A; see [52]experimental procedures).
ACSNI requires two input datasets: a matrix of expression values, with
genes listed in rows and tissue samples listed in columns, and a file
detailing gene set membership, representing prior knowledge of gene
functions, such as a gene set from the Pathway Interaction Database
(PID)[53]^17 and the MSigDB database.[54]^18 Depending on the research
question, ACSNI can take an additional file containing weights (as
integer values) for the genes in the second input. These data may
represent transcription factor (TF) activity level or kinase function.
In the first step of ACSNI, we derive multivariate pathway
representation that defines the activity of the subprocesses in each
sample (step i, [55]Figure 1A). The step is performed with a deep
neural network (DNN) that decomposes the expression of the gene set
into pathway subprocesses under the assumption that genes function in
multiple subprocesses. In the second step, the subprocess-gene
interaction analysis is conducted on the whole transcriptome using a
linear model. We extract the model's coefficient of determination as
subprocess-gene interaction scores (step ii, [56]Figure 1A). We then
classify the interaction scores to identify the tissue-specific
components of the pathway (step iii, [57]Figure 1A). The final output
of ACSNI can be further mined to identify biologically relevant pathway
subprocesses and components.
Figure 1.
[58]Figure 1
[59]Open in a new tab
Overview of the ACSNI method and the robust signal reconstruction
capabilities
(A) ACSNI reduces the expression of a gene set into a small number of
subprocesses and derives corresponding gene interactions, while
constraining the optimization to reduce technical noise. Given two
inputs, the expression matrix, and the binary gene set membership
(pathway representation), the algorithm starts by splitting the
transcriptome into two parts: (1) expression of the genes in the gene
set and (2) the expression of the rest of the transcriptome
(transposed). It then extracts the subprocess activities across samples
(step i: W) from the expression profiles of the gene sets, interacts
the subprocesses with the expression of the rest of the transcriptome
to extract subprocess-gene interaction scores (step ii: N), and
classifies the scores to infer extended network of the pathway
represented by the gene set (step iii: P).
(B) Box plot of the percentage of the simulated signal recovered from
the analysis of 50 gene sets against five GTEx tissue expression
samples. Ex represents the gene set signal, In the expected signal, and
Rn the random signal.
(C) Box plot depicting the effect of random noise (log normal) on the
percentage of gene set signal recovered from the analysis of 50 gene
sets against five GTEx tissue expression samples.
(D) Box plot showing the false discovery rates (FDR) from the analysis
of 50 gene sets against five GTEx tissue expression samples (estimated
as the number of genes from the random signal divide the total number
of predicted genes in B).
(E) Scatterplot of the relationship between FDR and size of the gene
set.
(F) Box plot of Jaccard index of the similarity of predicted genes
between two independent split or with one shuffled split (Null) across
50 curated gene sets.
(G) Box plot comparing the Jaccard index of predicted genes between two
independent expression splits for randomly generated (R, n = 20) and
curated (C, n = 50) gene sets.
See also [60]Figures S1 and [61]S4.
Performance evaluation
To assess the performance of ACSNI in reconstructing signaling networks
and their subprocesses, we utilized pseudo-simulated and real
expression datasets. It is natural to ask to what extent the ACSNI
predictions improve on current methods for inferring the component of a
pathway from GEPs. Unfortunately, there are no existing tools that
provide such a granular view of pathway subprocesses. Therefore, we
focused our performance evaluations on reconstruction of simulated
signals and concordance between two cohorts of similar samples.
ACSNI has robust signal reconstruction capabilities with low type I error
rate
In biological systems (i.e., tissues or organs), multiple signaling
pathways are executed concurrently with a certain level of dispersion
(i.e., a measure of pathway activity) across a group of samples. ACSNI
has been designed to extract differences in pathway activity in a
cohort of samples and detect unknown components of the pathway. To
assess how well the algorithm achieves this aim, we first evaluated its
ability to reconstruct a known signal using pseudo-simulated GEPs under
different parameter settings (see [62]experimental procedures). We
simulated random GEPs with specific expected signals and noise signals,
as described in [63]datasets for simulation studies. Then, for each GEP
and the corresponding gene set, we tested the null hypothesis for no
difference in the proportion of the predicted genes overlapping the
expected or random signals. To examine the effects of different ACSNI
parameters, we ran the analysis under different parameter settings of
classification threshold (Alpha: 0.01, 0.05), latent space dimension
(Percentage: 5, 10, 15, 20, 25, 30), feature selection threshold
(Median absolute deviation: 1.2, 1.3, 3), and regularization (yes, no),
repeating the analysis 50 times with three ensemble models each to
obtain more stable results (see [64]experimental procedures). Testing
for the overlap between the predictions and the expected signals
revealed that ACSNI recovered most of the expected signals (average
percentage of overlap = 98%, n = 250, [65]Figure 1B) compared with 3.5%
random overlaps. The concordance of ACSNI predictions with the expected
signals is due to the robustness of the DNN extraction core that
dissects the gene set expression patterns into subprocesses (see
[66]experimental procedures). As expected, ASCNI's ability to recover
the expected signal is negatively associated with the level of
technical noise ([67]Figures 1C and [68]S1A, see [69]experimental
procedures). These results suggest that the quality of the prediction
is dependent on the level of technical noise (examples, errors due to
RNA quality, fragmentation, and amount of input) in the GEP
measurements. On average, more than 30% or 45% of the expected signal
was recoverable at 10% and 5% noise, respectively ([70]Figures 1C and
[71]S1A). Previous studies indicated that in RNA-sequencing (RNA-seq)
data sequenced to sufficient depth, the level of technical noise is
generally lower than 10%, while the rest of the expression variation is
explained by biology and experimental conditions.[72]^19^,[73]^20 Thus,
we believe that ACSNI retains its predictive power when analyzing
current real expression datasets across various experiments and
conditions (noise level <10%). Note that latent space dimension,
classification threshold, and feature selection threshold had no effect
on ACSNI's ability to recover expected signals.
To measure the type I error rate, we divided the number of random
predictions (false positives) by the total number of predictions (total
positives) in the above simulation and assessed the impact of different
parameter settings. We observed a median false discovery rate (FDR) of
0.0045 (range, 0.003–0.005) for default parameter settings across three
feature selection thresholds ([74]Figure 1D; see [75]experimental
procedures). This observation indicates that ACSNI has a low (<1%) rate
of false predictions. Applying a conservative feature selection
threshold (i.e., genes with median absolute deviation greater than 3
across the cohort) before running ACSNI resulted in tighter control of
the FDR (median = 0.0038; range, 0–0.01) ([76]Figure 1D). We next
tested how key parameters, including classification threshold,
dimension of the latent space, and gene set size, affected the overall
type I error rate of ACSNI. Relaxing the classification threshold 0.01
and 0.05 significantly increased the FDR (Wilcoxon test p = 3.4 ×
10^−13, [77]Figure S1B). Although regularization did not affect the FDR
([78]Figure S1C), increasing the dimension of the latent space (a
percentage of the gene set size) significantly increased the FDR, but
using 5% made the FDR unstable ([79]Figure S1E). We found a positive
correlation (Pearson's r = 0.93) between gene set size and FDR
([80]Figure 1E). These associations are expected because larger gene
sets and genes with a low variability are more likely to result in
aberrant predictions. However, even in the case of a larger gene set
(>80, [81]Figure 1E) we expect the FDR to be less than 1% (see
[82]experimental procedures).
We then tested whether the approach was a robust predictor of pathway
components across different cohorts of similar samples. To this end, we
randomly sampled two non-overlapping groups of 290 samples from the 580
samples of the GTEx healthy adult lung dataset to generate a training
cohort and a test cohort. We also randomly shuffled the test cohort to
generate a null model (see [83]experimental procedures). Next, we
individually analyzed these three cohorts (training, test, and null
model cohorts) with ACSNI using 50 gene sets from the MSigDB
database[84]^18 and quantified the similarity of the inferred circuitry
between the training cohort and the test cohort compared with the
similarity between the training cohort and the null model. Predicted
circuitry from the training cohort was significantly more similar to
the test cohort compared with the null model (t test, p = 4.6 × 10^−7,
[85]Figure 1F). We repeated the above analysis using 20 randomly
assigned gene sets and observed no difference between the real cohorts
and the null model, as the random inputs lacked underlying regulatory
mechanisms. Consistently, manually curated gene sets show a
significantly higher similarity between training cohort and test cohort
compared with the simulated gene sets (t test, p = 1.3 × 10^−3,
[86]Figure 1G). The level of similarity between two expression cohorts
is independent of the type of pathway and the size of the gene set
([87]Figure S2A). Combined, these results indicate that ACSNI
predictions are reproducible, with datasets of similar biological and
regulatory information.
ACSNI identified the mTOR crosstalk with KLF6 and EPAS1 signals in ccRCC
To evaluate the performance of ACSNI on real data, we simulated a
condition whereby known context-dependent signaling crosstalk was
expected. Specifically, a recurrent event in clear-cell renal cell
carcinoma (ccRCC) is the hyperactivation of the mTOR signaling that
promotes oncogenic metabolic programs.[88]^21^,[89]^22 The crosstalk
between mTOR, KLF6, and EPAS1 (HIF2A) signaling contributes to this
metabolic program that drives ccRCC progression. KLF6 is a zinc finger
DNA-binding TF activator of mTOR signaling that co-regulates lipid
metabolism, cell growth, and cell fate in ccRCC.[90]^23 mTOR regulates
transcription of EPAS1,[91]^24 which in turn activates a superenhancer
that supports KLF6 expression,[92]^23 creating a complex feedback loop
between mTOR, KLF6 and EPAS1 signaling.
Consistent with the above, we hypothesized that mTOR signaling network
derived from GEPs in ccRCC patients could enrich for both KLF6 and
EPAS1 signals. To test this hypothesis, we collected data of 73 ccRCC
samples with matched normal adjacent tissues from the TCGA ccRCC
project (KIRC, n = 146). We applied ACSNI to the RNA-seq data using a
publicly available curated mTOR signaling gene set from the PID
(n = 67).[93]^17 We derived ten subprocesses of mTOR signaling that
were strongly associated with 1,166 genes ([94]Figure S2B). We next
evaluated the biological processes associated with mTOR signaling in
this context. Using gene ontology analysis, we found a strong
enrichment of catabolic processes and amino acid biosynthesis
ontologies ([95]Figure 2A), consistent with the role of mTOR signaling
in cell metabolisms.[96]^25 To assess whether the ACSNI-predicted mTOR
signaling captured KLF and EPAS1 signal, we first defined KLF6 and
EPAS1 signals as differentially expressed (DE) genes from publicly
available gene perturbation datasets: (1) KLF6 knockout (KLF6-KO, DE =
1,378 genes) and (2) EPAS1 knockout (EPAS1-KO, DE = 303 genes) in the
786 ccRCC cell line.[97]^23^,[98]^26 The inferred network had
significant over-representation of KLF6 (1.47-fold, chi-squared p =
1.763 × 10^−12, [99]Figure 2B) and EPAS1 (2.07-fold, chi-squared p =
1.009 × 10^−3, [100]Figure 2B) signals compared with background
enrichment (i.e., all genes in the expression dataset). Inspecting the
ACSNI subprocesses and their associated genes, we found unique
associations with KLF6 and EPAS1 signals ([101]Figure 2C).
Subprocesses, w3, w7, and w9 were exclusive to KLF6 signal and w0 was
exclusive to EPAS1, while w4, w5, and w2 captured both signals but have
relatively more enrichment of EPAS1 signal ([102]Figure 2C). Three
subprocesses (w8, w6, and w1) appear to capture signals unrelated to
KLF6 and EPAS1. These observations demonstrate that ACSNI models the
complexities of a signaling network as represented by an experimentally
curated gene set.
Figure 2.
[103]Figure 2
[104]Open in a new tab
ACSNI identified pathway components and signaling crosstalk
(A and B) (A) Bar plot of the top 25 significantly enriched biological
processes associated with the predicted 1,166 genes mTOR signaling
components. Bars are ordered from top to bottom according to FDR
values. (B) Association of ACSNI-predicted mTOR signaling genes (TCGA)
with DE genes in KLF6 and EPAS1 knockout in ccRCC cells, divided into
datasets (KLF6: 786 = GEO: [105]GSE115763, EPAS1: 786 = GEO:
[106]GSE115389). Chi-squared tests (Chisq, p) and ACSNI-predicted (P)
and background (B) are indicated on top of the panel. DE and nDE
represent differentially or non-differentially expressed genes at
adjusted p value of <0.05, respectively.
(C) Bar plot comparing enrichment of DE genes from EPAS1-KO (red) and
KLF6-KO (blue) across the different ACSNI-derived subprocesses of mTOR
signaling (TCGA).
(D) (Left) Heatmap of the coefficient of determination (R^2) of the
linear model of ACSNI-derived mTOR activity and disease status (cancer
or normal adjacent tissues) across ten subprocesses. The higher the
R^2, the more significantly related is the subprocess to disease
status. (Right) Expression of mTOR subprocess w8-associated genes in
ccRCC cell lines following the inhibition of mTOR pathway with
everolimus (Eve) compared with vehicle control (Cont) (GEO:
[107]GSE106819). Within the plot differential expressions at FDR < 0.05
(DE) are indicated (upregulated, red; downregulated, blue; unchanged,
white).
(E) Ranked dot plot of the ratio of transcription factor (TF) ChIP
density at the promoter regions (± 1 kb from TSS) of the
ACSNI-predicted ATF2 signaling genes relative to background genes in
artery aorta. DNA-binding domains are highlighted in red.
(F) Bar plot comparing ratio of TF ChIP density at the promoter regions
(±1 kb from TSS) of the ACSNI-predicted ATF2 signaling genes across the
different ACSNI-derived subprocesses from artery aorta.
(G) Gene ontology analysis of biological processes associated with
predicted HOTAIRM1 genes and DE genes in HOTAIRM1.
See also [108]Figures S2–S4.
We next investigated the specificity of ACSNI by comparing our
predictions with KLF6 and EPAS1 signals from non-ccRCC cells lines as
non-specific controls (including endothelial cells [human umbilical
vein endothelial cells—HUVECs] and primary peripheral blood mononuclear
cells [PB-CD34^+]). We found an insignificant over-representation for
DE genes (n = 119) derived from EPAS1-KO in HUVECs (0.91-fold,
chi-squared p = 7.482 × 10^−1, [109]Figure S2C), underlining the
advantage of detecting context-specific signaling crosstalk. We
observed a significant enrichment for DE genes (n = 8,487) derived from
KLF6-KO in PB-CD34^+ (1.2-fold, chi-squared p = 1.56 × 10^−9,
[110]Figure S2C), suggesting that the mTOR-KLF6 axis may be applicable
to multiple tissues. However, these data had an unusually high level of
DE genes (60%, 8,487 out of 14,260), which may be related to the fact
that the experiment was performed in primary cells. To further validate
our predictions, we applied ACSNI to an independent dataset of GEPs of
tumor and normal adjacent tissues from ccRCC patients (n = 22, GEO:
[111]GSE102101)[112]^27 using the same mTOR gene set described above.
We evaluated the robustness of the initial predictions against the new
predictions using a randomization test (see [113]experimental
procedures). We found that 19% (98/516, only genes measured in both
datasets) of the predicted genes among these data overlap with the KIRC
predictions ([114]Figure S3A, left panel), which is significantly
higher than would be expected by random chance (empirical p = 0.0, 95%
confidence interval 23–44) ([115]Figure S3A, right panel). There was no
enrichment of KLF6 and EPAS1 signals in the corresponding null models
([116]Figure S3C; see [117]experimental procedures), confirming that
our predictions and validation were robust. As an additional layer of
performance evaluation, we next compared ACSNI global predictions
against predictions based on pathway activities estimated from GSVA,
PLAGE, ssGSEA, and Z score (see [118]experimental procedures).[119]12,
[120]13, [121]14, [122]15 ACSNI recovered the strongest enrichment of
KLF6 (1.91 mean fold enrichment) and EPAS1 (2.17 mean fold enrichment)
signaling crosstalk with mTOR pathway compared with mean fold
enrichment of 1.71 (KLF6) and 1.44 (EPAS1), 1.17 (KLF6) and 1.16
(EPAS1), 0.77 (KLF6) and 0.55 (EPAS1), and 0.94 (KLF6) and 0.89 (EPAS1)
mean fold enrichment for GSVA, PLAGE, ssGSEA, and Z score, respectively
([123]Figure S3B). These results indicate that our framework can
provide additional context-specific information to aid the
interpretation of GEPs.
ACSNI-derived mTOR subprocesses predict clinical traits in ccRCC
The identification of biologically or clinically essential pathway
subprocesses is a crucial aim of many signaling network analyses. ACSNI
provides additional functions to map the derived network to a target
biological or clinical variables. To demonstrate the utility of this
function, we analyzed the distribution of the mTOR subprocesses across
the tumor and normal adjacent tissue from the TCGA datasets (variable:
disease status). We found that three subprocesses (w8, w6, and w3) were
strongly associated with the disease status ([124]Figure 2D, left
panel). Interestingly, two of these (w8 and w6) were not associated
with KLF6 nor EPAS1 signal ([125]Figure 2C), suggesting that they may
harbor new insight into mTOR-dependent mechanisms of ccRCC development.
To investigate this hypothesis, we first sought to confirm the
regulation of the w8-associated genes (the top disease-associated
subprocesses) by mTOR. To this end, we analyzed published GEP data
(GEO: [126]GSE106819)[127]^28 which utilized the mTOR inhibitor,
everolimus, to characterize the differential gene expression changes
that occur due to mTOR signaling in ccRCC cell lines. Among the 13
genes associated with w8 in ACSNI predictions, 77% (10/13 genes) were
significantly altered by mTOR inhibition ([128]Figure 2D, right panel),
confirming that w8 is within the mTOR signaling network. Of the total
ten w8 genes altered by everolimus, 70% (GINS2, CCDC130, PRR22, CYP4F2,
PGLS, LY9, and BCHE) have been identified as essential ccRCC genes in a
CRISPR genome-wide loss-of-function screen[129]^29 ([130]Figure S2D).
These observations suggest that these genes may co-regulate oncogenic
programs in ccRCC through mTOR signaling.
ACSNI identified ATF2-dependent targets of bZIP transcription factors in
artery aorta
Identification of functional TFs and their regulatory output is
essential for understanding gene-regulatory mechanisms in many cellular
processes.[131]^30 ACSNI can provide information about the
context-specific regulatory output of TFs in a gene set. As an example,
here we focus on cAMP response element-binding proteins (CREBs) and
CREB-like TFs that dominate ATF2 signaling (10 out of 13 TFs). These
genes represent a large family of TFs that bind cAMP-responsive
elements and share a basic leucine zipper DNA-binding (bZIP) domain.
Complex dimerization between these TFs increases the selectivity of
bZIP-DNA interactions and binding diversity, allowing them to regulate
multiple vascular functions.[132]^31 We hypothesized that ATF2 network
circuitry derived from GEPs of artery aortas should enrich targets of
the bZIP TFs.
We analyzed the ATF2 TF signaling gene set from the PID (n =
31)[133]^17 in artery aorta (n = 433) GEPs from the GTEx
project.[134]^6 We produced four subprocesses associated with 683
genes, representing the context-specific activity of bZIP TFs
([135]Figure S2E). Using an independent dataset of human TF chromatin
immunoprecipitation (ChIP)-sequencing measurements from primary human
aortic endothelial cell lines from chip-atlas
([136]https://chip-atlas.org), we evaluated the enrichment of the
available eight TFs (bZIP = 4, others = 4) at the promoter regions (±1
kb from transcription start site [TSS]) of the predicted genes compared
with the background (enrichment = mean peak density of predicted
genes/mean peak density of background genes). ACSNI-predicted genes
were strongly enriched for bZIP transcription factor binding (ratio >
1) but depleted of other families of TFs (ratio < 1) ([137]Figure 2E).
We observed no significant enrichment of bZIP TF for predictions from a
null model derived by random shuffling of the expression matrix
([138]Figure S3E; see [139]experimental procedures), confirming that
our predictions and validation were robust. On average, the predicted
genes had significantly higher binding of JUNB (Wilcoxon test, p =
2.083 × 10^−9), CEBPD (p = 1.278 × 10^−4), NFE2L2 (p = 1.168 × 10^−3),
and JUN (p = 1.041 × 10^−2) compared with the background genes.
Conversely, we observed significantly lower binding for EP300 (p =
7.582 × 10^−32), RELA (p = 7.132 × 10^−10), ERG (p = 1.841 × 10^−5),
and IRF1 (p = 3.148 × 10^−2) compared with the background genes.
Interrogating the ACSNI-derived subprocesses of ATF2 signal and their
associated genes revealed unique associations with different bZIP
dimerization patterns ([140]Figure 2F). Specifically, w0, w1, and w3
were exclusive to CEBPD, JUN, and JUNB, respectively. While w2 enriched
for NFE2L2, JUNB, and CEBPD, with higher JUNB, w4 enriched for all four
bZIP TFs measured in these data ([141]Figure 2F). These observations
suggest that ACSNI can segregate different TF signals into specific
subprocesses of the W matrix by leveraging external knowledge to anchor
RNA expression to TF activities underlying complex cell processes.
To benchmark our results, we analyzed the enrichment of these TF
signals on the genes predicted based on pathway activities estimated by
GSVA, PLAGE, ssGSEA, and Z score (see [142]experimental procedures).
Predictions derived from pathway activity scores generated from these
methods showed high enrichment rank for non-bZIP TFs (EP300, RELA, ERG,
and IRF1) and largely low in signal for the bZIP TFs (JUNB, CEBPD,
NFE2L2, and JUN) ([143]Figure S3D). Compared with ACSNI, the
alternative approach based on these methods showed
non-significant enrichment for all the TFs regardless of the
DNA-binding domain. These results demonstrate how ACSNI predicts
context-specific TF targets and enables discovery of pathway-level
regulatory mechanism and interpretation of expression profiles.
ACSNI annotates the function of HOTAIRM1 using de novo derivation of gene
sets
ACSNI also provides tools to help annotate genes with unknown
functions, such as long non-coding RNAs (lncRNAs). To demonstrate this
utility, we focused on the annotation of the lncRNA HOTAIRM1.
Specifically, HOTAIRM1 is a natural antisense transcript of the HOXA1
gene that plays roles in kidney differentiation and regulation of
HIF1-dependent angiogenic pathways.[144]^32 We hypothesized that the
ACSNI-inferred network of HOTAIRM1 could reveal its kidney-specific
functions.
To this end, we leveraged the GTEx project of GEPs in normal kidney
tissues and investigated the network of HOTAIRM1. Because this case
deals with a single gene, we utilized the ACSNI-derive module to first
generate a de novo gene set (n = 50) before running two iterations of
ACSNI (see [145]experimental procedures). We identified 729 genes that
were strongly associated with the network of HOTAIRM1 in kidney. To
evaluate the functional relevance of the predictions, we initially
analyzed published data (GEO: [146]GSE136604)[147]^32 that utilized
knockdown of HOTAIRM1 to characterize the differential gene expression
changes that occur due to its activity in human kidney cells
(HOTAIRM1-KO). We then cross-referenced the ACSNI-predicted genes to
establish whether the predictions enriched the lncRNA activity compared
with random backgrounds. For the genes measured in both datasets (n =
390), the DE genes identified in the HOTAIRM1-KO data (n = 85) showed a
significant overlap with ACSNI-predicted HOTAIRM1 genes (12/85, 14.11%;
empirical p = 8 × 10^−5, only genes measured in both datasets,
[148]Figure S3F). To assess the biological processes regulated by
HOTAIRM1 in this context, we performed gene ontology analysis on the
ACSNI-predicted genes compared with the ontologies associated with the
DE genes in the HOTAIRM1-KO. The results showed strong concordance in
the enrichment of ontologies involved in vascular functions and
angiogenesis ([149]Figure 2G). We also observed an enrichment of
metabolic processes, suggesting that HOTAIRM1 may play a role in cell
metabolism in the kidney ([150]Figure 2G).
We next evaluated the robustness of the predictions by comparing the
enrichment with predictions from a null model derived by random
shuffling of the expression matrix. We found no significant enrichment
of genes differentially expressed in the HOTAIRM1-KO data
([151]Figure S2F), confirming that our predictions and validation using
the non-shuffled data did not occur by chance. Correlation and
differential gene expression analysis are the two commonly used methods
for identifying functional lncRNA targets from GEPs. Thus, we compared
ACSNI predictions with direct correlation or differential expression
test for pairwise gene interaction analysis on the same input
expression data above (see [152]experimental procedures). ACSNI
predictions recovered the most overlap with HOTAIRM1 signal (14.11%)
compared with zero overlap and 1.19% overlap for differential
expression-based and correlation approaches, respectively
([153]Figure S3G). Consistently, genes predicted by both methods were
not enriched in ontologies involved in vascular functions and
angiogenesis. These results suggest that ACSNI can leverage the
annotation of protein coding genes to provide functional annotation for
poorly annotated genes.
Discussion
Our knowledge of biological pathways and their components is far from
complete. This situation limits the accuracy of the current pathway and
network analysis methods that depend on accurate and detailed pathway
annotations. ACSNI enables the identification of tissue-specific
components of biological pathways from gene expression data. This
approach overcomes the limitations of using composite scores or
representative values, which are unable to fully capture the process or
handle the inherent variability in GEPs within a gene set. By contrast,
ACSNI can capture pathway modules while reducing the impact of
technical noise. This capability was demonstrated in our identification
and validation of mTOR signaling crosstalk with KLF6 and EPAS1 in
ccRCC. ACSNI can also reveal the context-specific output of TFs in a
gene set and identify lncRNA network in expression data. This utility
is especially vital in mechanistic studies where the target is
context-specific gene regulation. By using prior knowledge and
interrogating associated expression variability in GEPs, ACSNI can
identify biological network within a specific tissue and has the
flexibility to apply weights to the prior information.
Using an autoencoder the core of ACSNI decomposes GEP into multiple
pathway subprocesses, which are then used to infer unknown pathway
components. Other dimension-reduction methods, including principal
component analysis (PCA) and non-negative matrix factorization (NMF),
could also theoretically be used to estimate pathway subprocesses from
GEPs. The key limitation of these methods is that the mapping from the
gene expression space to the low-dimensional representation is
restricted to be linear.[154]^33 However, an accurate model of the
activity of biological pathways encoded in GEPs requires a mix of both
linear and non-linear transformations. Our comparative analysis showed
that replacing the autoencoder in step 1 with PCA or NMF resulted in no
enrichment of the expected signal ([155]Figure S4C) or extraction of
non-specific signals ([156]Figure S4D). Therefore, the autoencoder
approach used for step 1 is more capable of extracting biologically
meaningful results compared with simple linear reductions ([157]Figures
S4C and S4D), consistent with previous studies.[158]^34
One potential limitation is a lack of prior knowledge in the form of a
gene set; this can be partially remedied by ACSNI-derive, which enables
de novo generation of gene sets as demonstrated in the annotation of
HOTAIRM1 functions. In addition to the availability of prior
information, ambiguous annotations, sample size, gene set size, and
expression variance can all affect the performance of ACSNI. Currently,
ACSNI depends on RNA expression across a cohort of samples to infer
pathway circuitry and will miss post-translation interactions, since
RNA levels may not correlate with protein output.[159]^35 Despite these
limitations, ACSNI supplements the existing methods to identify pathway
components and extract biological meaning from transcriptomic data and
can be applied to study processes that have little prior knowledge.
Independent validation of the inferred network is an essential step in
the discovery of pathway components and their functional interactions.
Ideally, such analysis should involve independent genetic perturbation
data, reporter assays, or DNA occupancy analyses as demonstrated in our
use cases. However, genetic perturbation or DNA occupancy data may not
always be available. Literature mining can be an alternative approach,
but the results may be confounded by publication bias. Our results show
that this framework effectively infers biological networks from GEP
datasets and enables the discovery of novel functional interactions.
Analysis of publicly available GEP data using ACSNI will lead to new
insights into the molecular basis of a phenotype and its deregulation
in disease.
It is well established that signaling pathways are coordinated by
complex interdependences between chromatin structure, DNA
modifications, RNA abundance and modifications, and protein abundance
and modifications. While ACSNI includes the option to account for one
additional piece of regulatory information as an integer variable, this
is unlikely to model the complete interdependences. We plan to further
develop the ACSNI software by expanding the utilization of other
genomic information.
Conclusion
ACSNI offers the cell and molecular biologist a flexible approach to
infer potential pathway components that can be used to guide functional
analysis in a target signaling network. The ACSNI package provides
command-line functions (ACSNI-run and ACSNI-derive) for predicting the
components of a signaling pathway from GEPs using prior knowledge. It
also provides a function (ACSNI-get) to describe the correlation
between the derived pathway subprocesses and biological traits. The
ACSNI package includes a function (ACSNI-split) to process GEPs for
bootstrap analysis. Users can install the Python package and its
dependencies through the pip package installer (pip install ACSNI).
Experimental procedures
Resource availability
Lead contact
Further information and requests for the implementation and
troubleshooting the ACSNI software should be directed to and will be
fulfilled by the lead contact, Chinedu Anthony Anene
([160]a.anene@qmul.ac.uk).
Materials availability
This study did not generate new materials.
Data and code availability
This paper analyzes existing, publicly available data. The ACSNI Python
package is publicly available at GitHub,
[161]https://github.com/caanene1/ACSNI, and can be installed through
the pip python installer. The code used to process and generate the
figures reported in this paper is available at GitHub,
[162]https://github.com/caanene1/ACSNI. Any additional information
required to reproduce this work is available on the package
documentation at GitHub, [163]https://github.com/caanene1/ACSNI, and
from the lead contact.
Publicly available datasets and software
REAGENT OR RESOURCE SOURCE IDENTIFIER
Deposited data
__________________________________________________________________
TCGA ccRCC dataset Cancer Genome Atlas Research Network, 2013[164]^7
[165]https://portal.gdc.cancer.gov
GTEx expression datasets Lonsdale et al., 2013[166]^6
[167]https://gtexportal.org/home/v8_RNASeQ_v1.1.9
Validation ccRCC dataset Yao et al., 2017[168]^27
[169]https://www.ncbi.nlm.nih.gov/geo
[170]GSE102101
KFL6-ko data in ccRCC cells Syafruddin et al., 2019[171]^23
[172]https://www.ncbi.nlm.nih.gov/geo
[173]GSE115763
EPAS1-ko data in ccRCC cells Zou et al., 2019[174]^36
[175]https://www.ncbi.nlm.nih.gov/geo
[176]GSE115389
KFL6-ko data in blood cells Adelman et al., 2019[177]^26
[178]https://www.ncbi.nlm.nih.gov/geo
[179]GSE121560
EPAS1-ko data in endothelial cells Yoo et al., 2015[180]^37
[181]https://www.ncbi.nlm.nih.gov/geo
[182]GSE62974
mTOR-inhibition data in ccRCC cells Kornakiewicz et al., 2018[183]^28
[184]https://www.ncbi.nlm.nih.gov/geo
[185]GSE106819
HOTAIRM1-ko data in kidney cells Hamilton et al., 2020[186]^32
[187]https://www.ncbi.nlm.nih.gov/geo
[188]GSE136604
TF ChIP-Seq datasets Oki et al., 2018[189]^38
[190]https://chip-atlas.org
Cancer-cell dependency data Tsherniak et al., 2017[191]^29
[192]https://depmap.org/portal
Pathway interaction gene sets Schaefer et al., 2009[193]^17
[194]https://github.com/NCIP/pathway-interaction-database/tree/master/d
ownload
MSigDB v7.2 Liberzon et al., 2011[195]^18
[196]https://www.gsea-msigdb.org/gsea/msigdb
__________________________________________________________________
Software and algorithms
__________________________________________________________________
ASCNI This paper [197]https://github.com/caanene1/ACSNI
Trimmomatic v0.39 Bolger et al., 2014[198]^39
[199]http://www.usadellab.org/cms/?page=trimmomatic
HISAT2 v2.1.0 Pertea et al., 2016[200]^40
[201]http://daehwankimlab.github.io/hisat2
HTSeq v0.11.1 Anders et al., 2015[202]^41
[203]https://htseq.readthedocs.io/en/release_0.11.1
Python v3.8.6 PSF [204]https://www.python.org
R v4.0.3 CRAN [205]https://www.r-project.org
[206]Open in a new tab
Method details
ACSNI algorithm
We introduce ACSNI, an algorithm to infer the components of a
biological pathway from bulk tissue expression profiles and prior
knowledge of the pathway ([207]Figure 1A). Let P denote the activity of
the target pathway in a given biological sample and K (minimum of 4
genes) a curated set of genes associated with P based on current
knowledge of their biological functions (gene set).[208]^42 We assume
that P is organized into several small, highly connected subprocesses
(S[i–n]) that combine hierarchically into larger units.[209]^16 A
natural model is to have
[MATH: P=∑i=1nSi, :MATH]
(Equation 1.1)
where S[i] is the contribution of the i^th subprocess to the pathway P
and n is the total number of subprocesses.
While the hierarchical model is a faithful description of cell
signaling pathways, a practical issue is that the number of
subprocesses (n) is unknown a priori. Our best estimate of n is the
current knowledge of the target biological pathway (P), as specified by
the gene set K. To address this issue, we let n depend on the
cardinality of the gene set |K|, defined as a percentage of |K| and
estimated using a recent algorithm explicitly developed for
autoencoders (described below).[210]^43 Alternatively, n can be
specified by the user based on their knowledge of the target pathway.
For each k ∈ K, we have an estimate of its expression level across a
cohort of N samples indexed by j. We let V[k] be the scaled expression
vector for gene k and V[kj] is the expression of k in the j^th sample;
determined by RNA-seq. Since genes act in multiple pathways and
subprocesses,[211]^44 it is the additive change in expression within
subprocesses that leads to the difference in pathway activity. Thus, we
can model s[n] in the sample cohort as a function of the expression of
k:
[MATH: Si=∑j=1|K<
mo>|kjwj+bj, :MATH]
(Equation 1.2a)
where
[MATH: bj :MATH]
is a constant error term for the j^th sample, w[j] is the influence of
the j^th gene on the subprocess S[i] and k[j] is the expression value
of k gene in the j^th sample. To account for additional prior
knowledge, we include an optional weight adjustment to our model below:
[MATH: Si=∑j=1|K<
mo>|kjwj⋅rj+bj, :MATH]
(Equation 1.2b)
where r denotes the gene-level weights of the new information, such as
TF activity, protein family, or kinase group.
Unsupervised derivation of pathway subprocesses (S[n])
ACSNI starts from the RNA expression values for the K genes and
estimates the activities of the subprocesses across the samples. To
this end, we implement a DNN for estimating S[nj] through a pair of
encoder and decoder layers (autoencoder):
[MATH: Vk¯=d[f(Vk)
], :MATH]
(Equation 1.3)
where f (encoder) is a function of the expression to the pathway (P), d
(decoder) is a function of the pathway to the expression, and
[MATH: V¯k
:MATH]
is the reconstructed version of the expression vector. We train the
network to find a solution to the optimization problem:
[MATH:
min|d,f|<
mover accent="true">Vk¯−d[f(Vk)
]||,
:MATH]
(Equation 1.4)
where ||. || is the l1 and l2 norms that disallow the identity map and
forces the model to learn a sparse representation. The variation across
the samples informs the compression and reconstruction of the
expressions for all k ∈ K. We restrict feature selection to the removal
of uninformative genes that have a median absolute deviation below a
threshold (user-defined value >1, default 2.5). [212]Figure S4A
illustrates the mapping between the biological system, mathematical
model and DNN layers.
To determine the optimal number of dimensions (i.e., the unknown n
subprocesses in [213]Equation 1.1), we implement a recent algorithm
explicitly developed for estimating latent space dimension in
autoencoders.[214]^43 Specifically, we determine the optimal latent
space for each gene set expression profile by an iterative process,
including:
* 1
Run the autoencoder above using 50% of the gene set's cardinality
(50 × |K|/100) and extract the latent structure.
* 2
Apply the Bahadur and Paffenroth algorithm to this latent structure
to estimate the optimal dimension.
* 3
Fix and run ACSNI with the estimated dimension.
In step 1, we initiated the search at 50% of the gene set based on
simulation analysis that showed that although increasing the dimension
of the latent space does not affect the ability to recover the expected
signal ([215]Figure S1D), it comes at the cost of increasing the FDRs
([216]Figure S1E), which we want to minimize. Latent dimension below
50% of the gene set size appear to achieve the optimal balance between
FDR and recovery of expected signals. Since our ultimate aim is to
prioritize biological components without inflating false discovery, a
small dimension is preferred.
Other dimension-reduction methods, including PCA and NMF, have been
applied to gene expression datasets and could theoretically be used to
estimate pathway subprocesses (S[nj]). Therefore, we also implement PCA
and NMF, two linear dimension-reduction methods, into the ACSNI
software, allowing users to cross-compare the results.
Estimation of subprocess-gene interaction scores
The neural network above minimizes the error between the input
expression vector
[MATH: Vk :MATH]
and the reconstructed version (
[MATH: V¯k
:MATH]
). However, it is, not the values of
[MATH: V¯k
:MATH]
that are of interest to us but rather the subprocesses in each sample.
Thus, after training the autoencoder, we extract the learned weights (W
matrix) of the first hidden layer, which represents the reduction of
the gene set expression profiles to latent structures. W has a
dimension of m × n (m = the total number of samples and n = the number
of estimated subprocesses) and is the activity of the subprocesses
defined in [217]Equation 1.2a or [218]1.2b.
For the rest of genes in the transcriptome, the strength of the linear
relationship between their expression and the vector of W above
corresponds to per-gene interaction between individual latent vectors
and the gene. We model this interaction as
[MATH: a=βSi+ε, :MATH]
,(Equation 1.5)
where
[MATH: a :MATH]
is the expression profile of a gene in the rest of the transcriptome, β
is the slope, S[i] is the activity of i^th subprocess, and
[MATH: ε :MATH]
is the intercept, estimated automatically.
Next, for each run of [219]Equation 1.5, we extract the R^2 (here
called interaction score e); this can be interpreted as the coefficient
of determination for the proportion of the variance in that gene's
expression across the samples that is predictable from each vector of
the W matrix, defined as
[MATH: e=1−∑(aj−ajˆ)2
∑(a
j−a¯)2
, :MATH]
(Equation 1.6)
where
[MATH: e :MATH]
is the interaction score,
[MATH: aj :MATH]
is the expression value for sample j,
[MATH: ajˆ :MATH]
is the predicted expression value for sample j and
[MATH: a¯
:MATH]
is the mean expression value.
The [220]Equations 1.5 and [221]1.6 were estimated using the Linear
model function implemented in the sklearn Python package, using default
parameters. We define the matrix of interactions scores (e) with genes
in the row and subprocesses in the column as the (N matrix), which is
used for prediction of unknown pathway components below.
Further, we extract the learned weights of the second hidden layer
(Code), which represents the reduction of the subprocesses to the
pathway. Code has a dimension of m × n (m is the number of estimated
subprocesses and n is the number of estimated higher pathway units) and
is the activity of the pathway defined in [222]Equation 1.1. These data
can be used for diagnostics and dimension plots.
Inference of tissue-specific components of the target pathway
Having estimated the multiple interactions between the subprocesses of
the target pathway and genes in the rest of the transcriptome, we
classify each subprocess independently and combine the predictions into
a network. This process intuitively captures the signal contained in
the gene set and thus represents a robust measure of the
context-specific circuitry of the target pathway. Let e[n] be the
scores for the interaction between the i^th subprocess and the genes in
the rest of the transcriptome (bound 0–1).
[MATH: e :MATH]
has a beta distribution parameterized by α and β ([223]Figure S1F; see
also [224]analysis of probability distribution),
[MATH: α=(1−μσ2<
/mfrac>−1μ)μ2, :MATH]
(Equation 1.7)
[MATH: β=α(1μ−1),
:MATH]
(Equation 1.8)
where
[MATH: μ :MATH]
is the mean interaction across the rest of the transcriptome with
variance
[MATH: σ2 :MATH]
.
For a future experiment, it is the regularized incomplete beta function
I[j](
[MATH: α, β :MATH]
) that defines the impact of a subprocess on the j gene and in turn the
pathway network. However, we find a cumulative distribution function to
be an impractical network measure and therefore discretize it by
defining the interactions to be I[j](
[MATH: α, β :MATH]
) < 0.01 (P matrix). The cutoff can be interpreted as the probability
threshold to reject the null hypothesis that such an extreme value is
expected less than 1% of the time. Simulation analysis showed that the
0.01 cutoff reduces the FDR without affecting the ability to recover
the expected signal compared with the traditional 0.05 for rejecting a
null hypothesis ([225]Figure S1B). The user can also adjust the cutoff
depending on their aims.
Next, for each subprocess ([226]Equation 1.2a or [227]1.2b), we extract
the genes that reach the above threshold as the individual subprocess
network. Finally, we represent our global network for the pathway
([228]Equation 1.1) as the union of all the genes predicted across the
subprocesses, thus a global network across all subprocesses. Since the
neural network optimization process is stochastic, we implement an
ensemble learning and combine predictions using summary statistics.
Here, we count the number of times a gene is predicted across the
ensemble and retain genes predicted in at least two models. We suggest
using at least an ensemble of three models to increase the stability of
the predictions.
Mapping subprocess activity to external variable
Here, we define a subprocess significance measure as a function SS that
assigns a non-negative value (0–1) to each subprocess; the higher is
SS[w], the more biologically significant is subprocess w for the given
variable. We derive SS using logistic regression for categorical
variable or simple linear regression for numerical variables,
implemented as the ACSNI-get module (see code at
[229]https://github.com/caanene1/ACSNI).
Architecture and hyperparameters
The neural network used in ACSNI consists of two fully connected layers
and a mirror image of this neural network as a decoder ([230]Figure S4A
shows the map between the biology, equations, and the network layers).
We used x neurons in the first layer, and (x/2) neurons in the second
layer of the network. x is automatically determined, as described in
[231]estimation of subprocess-gene interaction scores, or is user
defined based on prior knowledge. During the subprocess-gene
interaction analysis, we extracted the learned weights of the first
layer (autoencoder) or the loading of the linear dimension-reduction
controls (PCA and NMF controls) and used these to predict gene
expression values. We used RMSprop optimizer with learning rate of 1 ×
10^−3 for training. Activities of the neurons were normalized using
layer normalization that calculates the normalization statistics for
all units in the same layer. The RELU function, defined as RELU(x) =
max(0, x), was used as a non-linear activation in the two layers. In
the mirror image, the Sigmoid function, defined as Sigmoid(x) = 1/1 +
e^−x, was used as the activation function. We train the model for 3,000
epochs and configure the checkpoint to save at the end of every epoch,
if it is the best model seen (i.e., the lowest mean squared error in
the ACSNI objective function [1.4]). Regularizers l1 and l2 in the
ACSNI objective function were set to 1 × 10^−16 and 1 × 10^−9,
respectively. ACSNI is implemented in Python 3 and relies on the
tensorflow, sklearn, numpy, and pandas packages.
Requirements
The validity of the approach is related to the quality of the starting
gene set; the higher the quality of the sources that link the genes in
a pathway, the better the subprocess estimates. The changeable and
partially arbitrary nature of gene annotations should be considered
before treating gene sets as units of biological processes. It is a
further requirement that the sample cohort should minimally contain 50
samples and come from the same type of tissue.
De novo generation of gene sets (ACSNI-derive)
The requirement for a pre-annotated gene set is a challenge because
many biological processes remain completely unknown. For example, few
gene sets are available for the pathways of non-coding RNAs. To address
this problem, we develop the ACSNI-derive tool for de novo creation of
a gene set given a single gene. First, the expression of the gene of
interest is correlated with the expression of other genes (this can
also be set for a specific biotype only, for example, lncRNA,
microRNAs, or protein coding). The topmost correlated targets are then
selected to construct a search space and negative controls (top 50 with
r > 0.6 and r < −0.6). We search this space using multiple iterations
of ACSNI and extract genes predicted in more than 60% of the iterations
in the presence of the target as the functional gene set (see code at
[232]https://github.com/caanene1/ACSNI).
Null models
For the evaluation of randomness in the ACSNI predictions, we randomly
shuffled the gene expression values of each sample. This process
generates random expression profiles without altering the samples'
library sizes. We then apply the ACSNI model as described before under
the assumption that the predictions must be false positives,
originating from an uninformative pattern extraction in the W matrix
and interaction prediction (P matrix). This process is included in the
ACSNI software and generated automatically.
Datasets for simulation studies
To evaluate how well ACSNI reconstructs a known signal and its FDR, we
created pseudo-simulated RNA-seq data. We did not use purely synthetic
data because we wanted to ensure that our simulation reflects the
expected results from real RNA-seq datasets. Specifically, we extracted
the RNA expression values from the GTEx project (here called expression
data) and 50 gene sets from the MSigDB database.[233]^18 The expression
data have genes in rows and samples in columns (here called rows and
columns, respectively). The gene sets are sets of gene names (we refer
to their expression profiles as gene set signal). Next, we generated a
pair of 50 simulated expression data and gene set in three independent
data transformations:
* 1
For a given gene set, split the expression data into two groups:
(1) expression of genes in the gene set (data 1) and (2) expression
of the rest of the genes (data 2).
* 2
.Individually shuffle the column values of data 2, such that the
row values are randomly assigned without altering the original
column sums (this is the random signal).
* 3
For each row in data 1, individually reverse the ordering of the
values (right to left), whereby the last column gets the expression
in the first column and the first column gets the expression in the
last column (this is the expected signal). This signal is expected
because ACSNI is invariant to the direction and maintains the order
of the columns.
* 4
Create ten versions of data 1 with different log-normal noise
levels (5%, 6%, 7%, 8%, 9%, 10%, 20%, 30%, 40%, and 50%) randomly
added to each value (this is the set of expected-noise signals).
This signal can be interpreted as the technical noise caused by
RNA-seq and alignment errors. Each of the ten versions has one
level of noise and independent of the other.
* 5
Finally, for the given gene set merge the transformed datasets from
steps 2 to 5 with data 1 to form one expression matrix (this is the
pseudo-simulated dataset with known signals). Output the row names
of data 1 as the new gene set, creating expression and gene set
pairs.
Note that these datasets were analyzed with ASCNI as pairs because each
expected signal is specific to a given gene set. We provide details of
the ACSNI parameters used to analyze the simulated datasets in the
[234]results section.
Quantification and statistical analysis
Comparison of methods for estimating pathway activity
Other methods are available for estimating sample-wise pathway activity
scores that can be used to infer pathway components comparable with
global ACSNI predictions. To benchmark the global predictions, we
compared performance on the validation datasets (genetic perturbations
or DNA occupancy data) with four alternative approaches: GSVA, PLAGE,
ssGSEA, and Z score.[235]12, [236]13, [237]14, [238]15 These methods
estimate a single activity score for each gene set and individual
sample. In the presented mTOR and ATF2 examples, we used the version of
GSVA, PLAGE, ssGSEA, and Z score implemented in the GSVA R package to
estimate pathway activity for each sample. We then correlated the
pathway activities from each method with the expression of each gene
across the samples. The gene-wise correlation coefficients were
classified at >0.4 and < −0.4 to predict pathway genes (see code at
[239]https://github.com/caanene1/ACSNI). We did not apply the ACSNI
custom classification method to correlation coefficients because our
approach required a beta distribution, and the correlation coefficients
have a different distribution ([240]Figure S4B).
For the single gene example (HOTAIRM1 annotation using ACSNI-derive
module), we compared the ACSNI prediction with the pairwise correlation
analysis and differential expression analysis between discretized
expression groups (HOTAIRM1-high: >median versus HOTAIRM1-low:
0.6 and <−0.6 or two-tailed adjusted p values of <0.05
(see code at [241]https://github.com/caanene1/ACSNI).
We estimated the subprocesses activity encoded in the gene set
expression profiles by PCA and NMF dimension reduction to compare
ACSNI-step 1 with linear dimension reduction. We used the default PCA
and NMF functions implemented in the sklearn Python package. The
optimal number of dimensions was determined by the number of components
required to explain 95% variation in the expression profiles of the
gene set. The loading for each component was extracted (i.e.,
subprocess activity per sample) and used in the rest of the ACSNI
calculations.
For all methods, the input data were either RSEM or transcripts per
million (TPM), which are the preferred data transformations and are
also used in ACSNI.
RNA-seq analysis
For GEO RNA-seq datasets, the raw reads were filtered to remove the
adaptors and the low-quality reads (Q < 20) using Trimmomatic.[242]^39
Filtered reads were aligned to the human reference genome GRCh38/hg38
assembly using HISAT2 (v2.1.0) in default settings.[243]^40 The counts
in different genomic features were then generated using HTSeq
(v0.11.1)[244]^41 on human GRCh38 reference annotation (GENCODE Release
32). The expression levels were normalized by TPM. Differential
expression analyses between two groups were performed using the limma R
package. The DE genes were defined at adjusted p value of <0.05.
Gene ontology analysis
The R ClusterProfiler and the human Bioconductor annotation database
(org.Hs.eg.db) were used to investigate the biological processes
associated with the predicted components of the studied pathway.
Ontologies were filtered for redundancies by semantic similarity
analysis[245]^45 and considered significant at q value <0.01 or
otherwise stated in the figure legends.
Randomization test
We utilized a non-parametric randomization test to evaluate the
significance of the overlap between the genes predicted by ACSNI and
genes in a validation signal, or between two ACSNI predictions. For the
comparator variable, we randomly reassigned the labels and calculated
the overlap statistic with the target labels. We repeated this process
10,000 times and counted the number of times the overlap statistic was
greater than the observed statistic. We then estimated the empirical
probability (empirical p) as the count divided by the total number of
randomizations (i.e., 10,000). The R implementation of the
randomization test analysis is available at
[246]https://github.com/caanene1/ACSNI.
Analysis of probability distribution
The R fitdistrplus and logspline packages were used to investigate the
interaction scores' distribution ([247]Equation 1.6). We computed
descriptive parameters of the score vector and visualized the
skewness-kurtosis plot to assess the possible distribution. We then
fitted the set of possible distributions to the scores and compared
their density, quantiles, and probabilities with theoretical values. We
selected the best distribution for the data based on the lowest Akaike
information criterion (AIC) across all possible distributions (i.e., we
used distribution with the lowest AIC). The R script for distribution
analysis is available at [248]https://github.com/caanene1/ACSNI.
Acknowledgments