Abstract
With the increasing availability and dropping cost of high-throughput
technology in recent years, many-omics datasets have accumulated in the
public domain. Combining multiple transcriptomic studies on related
hypothesis via meta-analysis can improve statistical power and
reproducibility over single studies. For differential expression (DE)
analysis, biomarker categorization by DE pattern across studies is a
natural but critical task following biomarker detection to help explain
between study heterogeneity and classify biomarkers into categories
with potentially related functionality. In this paper, we propose a
novel meta-analysis method to categorize biomarkers by simultaneously
considering the concordant pattern and the biological and statistical
significance across studies. Biomarkers with the same DE pattern can be
analyzed together in downstream pathway enrichment analysis. In the
presence of different types of transcripts (e.g., mRNA, miRNA, and
lncRNA, etc.), integrative analysis including miRNA/lncRNA target
enrichment analysis and miRNA-mRNA and lncRNA-mRNA causal regulatory
network analysis can be conducted jointly on all the transcripts of the
same category. We applied our method to two Pan-cancer transcriptomic
study examples with single or multiple types of transcripts available.
Targeted downstream analysis identified categories of biomarkers with
unique functionality and regulatory relationships that motivate new
hypothesis in Pan-cancer analysis.
Keywords: biomarker categorization, differential expression,
meta-analysis, pan-cancer, transcriptomics
Introduction
The revolutionary advancement of high-throughput technology in recent
years has generated large amounts of omics data of various kinds (e.g.,
genetics variants, gene expression and DNA methylation, etc.), which
improves our understanding of human disease and enables the development
of more effective therapies in personalized medicine ([39]Richardson et
al., 2016). As more studies are conducted on a related hypothesis,
meta-analysis, by combining evidence from multiple studies, has become
a popular choice in genomic research to improve upon the power,
accuracy, and reproducibility of individual studies ([40]Ramasamy et
al., 2008; [41]Begum et al., 2012; [42]Tseng et al., 2012). One of the
main purposes of transcriptomics studies is to identify genes or RNAs
that express differently between two or more conditions (e.g., diseased
patients vs. healthy controls), also known as differential expression
(DE) analysis or candidate biomarker detection. Many meta-analysis
methods have been developed or applied to DE analysis, including
combining p-values ([43]Fisher, 1992) or effect sizes ([44]Choi et al.,
2003) and rank-based approaches ([45]Hong et al., 2006). One may refer
to [46]Tseng et al. (2012) for an overview of the major meta-analysis
methods in transcriptomic studies and [47]Ma et al. (2019) for an
overview of available software tools. Yet, a majority of conventional
meta-analysis methods only generate a list of differentially expressed
genes with strong aggregated evidence without further investigating in
what studies are the genes differentially expressed.
Study or population heterogeneity always exists and has been critical
to biomarker detection ([48]Di Camillo et al., 2012). For example, The
Cancer Genome Atlas (TCGA) consortium completed a Pan-Cancer Atlas of
multi-platform molecular profiles spanning 33 cancer types in an effort
to provide insights into the commonalities and differences across tumor
lineages ([49]Weinstein et al., 2013; [50]Hoadley et al., 2018). When
meta-analysis is performed on Pan-cancer transcriptomic studies, we
expect to see both DE genes common in all tumor types as well as genes
differentially expressed in some tumor types but not others. Biomarker
categorization according to their DE patterns across studies is
demanding in genomic studies for three reasons. First, biomarkers that
share unique cross-study DE patterns are potentially involved in
related functions ([51]Berger et al., 2018). Such unique categories of
genes with similar function can be used to generate new biological
hypotheses. Second, biomarker categorization can make high dimensional
genomic data more tractable. For example, in cancer transcriptomic
studies, which frequently detect thousands of DE genes, downstream
analysis methods such as pathway enrichment analysis or network
analysis cannot be applied directly. By partitioning the original large
set of DE genes into smaller subsets, biomarker categorization
facilitates more focused downstream analysis. Third, RNA sequencing
(RNA-seq) technology has led to an explosion of transcriptomic studies
profiling both coding (i.e., mRNA) and noncoding RNAs (i.e., miRNA,
rRNA, lncRNA, etc.) ([52]Di Bella et al., 2020). Joint analysis of
different RNA types with the same cross-study DE patterns can improve
understanding of their regulatory relationships, which may lead to
inferences about the underlying mechanisms of complex human diseases
like cancer.
[53]Li and Tseng (2011) first proposed an adaptively weighted Fisher
(AW-Fisher) method for biomarker categorization that assigns a binary
weight of 0 or 1 to each study and searches for the pattern of weights
that minimizes the aggregate statistics for each gene. Though the
method incorporates statistical significance by combining two-sided
p-values across studies, it does not take into account the direction of
regulation (e.g., up-regulated or down-regulated). Other methods
incorporate biomarker categorization within the Bayesian framework and
combine one-sided p-values or Bayesian posterior probabilities ([54]Ma
et al., 2017; [55]Huo et al., 2019) but not the magnitudes of effect
sizes. In practice, biological significance (i.e., large effect size)
and statistical significance (i.e., small p-value) do not always occur
in tandem (depending on sample size and variance) though they are
equally important in interpreting study results ([56]Sullivan and
Feinn, 2012; [57]Solla et al., 2018).
In this paper, we propose a novel meta-analysis method to detect and
categorize biomarkers by simultaneously considering concordant pattern
(i.e., direction of regulation), biological and statistical
significance across studies. In addition, we develop a permutation test
to assess the uncertainty of the proposed statistics and to control the
false discovery rate (FDR). When only coding genes are included, after
categorization we perform downstream pathway enrichment analysis with
topological information on each category of genes for more biological
insights ([58]Figure 1A). In the presence of diverse RNAs, we jointly
analyze all RNA species in the same category using miRNA/lncRNA target
enrichment analysis and lncRNA-mRNA and miRNA-mRNA causal regulatory
network analysis ([59]Figure 1B). We show by simulation that our method
detects both concordant and discordant biomarkers and assigns the
correct weights. We apply our method to two Pan-cancer transcriptomic
data examples: (1) Pan Gynecologic cancer (Pan-Gyn) data with coding
genes only; (2) Pan Kidney cancer (Pan-Kidney) data that include mRNA,
miRNA as well as lncRNA. The identified biomarker categories show
unique functionality and informative regulatory relationships and could
suggest new hypotheses about mechanisms underlying exclusive and shared
features of different cancer types.
FIGURE 1.
[60]FIGURE 1
[61]Open in a new tab
Conceptual framework of our method. (A) The scenario with mRNA (or
coding genes) only. The heatmap shows the gene expression of all
samples from three studies. Rows refer to genes sorted by the specified
weight category, columns refer to samples, and solid white lines are
used to separate different conditions (control vs. case). Colors of the
cells correspond to scaled expression level. The green/red indicates
lower/higher expression. Pathway enrichment analysis is applied to
genes belonging to the same weight category with topological
information to visualize the cross-study DE patterns at the molecular
level. (B) The scenario with diverse RNA species (e.g., mRNA, miRNA,
and lncRNA). The three heatmaps show the expression of different types
of transcripts of all samples from three studies, sorted by weight
category. In the presence of multiple types of RNA species, we will
perform integrative analysis on all the transcripts belonging to the
same weight category together. Possible downstream analysis includes
miRNA/lncRNA target enrichment analysis and lncRNA-mRNA and miRNA-mRNA
causal regulatory network analysis.
Materials and Methods
Popular Meta-Analysis Methods
[62]Tseng et al. (2012) reviewed the major types of meta-analysis
methods for DE gene detection in microarrays and classified the methods
into four main classes: combining p-values, combining effect sizes,
combining ranks, and direct merging. We will discuss selected
meta-analysis methods from the first two classes that are relevant to
our proposed method.
Combining P-Values
Fisher’s method ([63]Fisher, 1992)
The conventional Fisher’s method combines log transformed p-value from
each study with the statistic
[MATH:
TFisher=-2∑k=1Klog(pk), :MATH]
which follows a χ^2 distribution with 2K degrees of freedom under the
null hypothesis (i.e., genes not differentially expressed in all
studies), where K is the number of studies and p[k] is the p-value of
study k, 1 ≤ k ≤ K.
Stouffer’s method ([64]Stouffer, 1949)
The Stouffer’s method proposes inverse normal transformation of p-value
with the statistic
[MATH:
TStouffer
∑k=1KΦ-1(1-pk<
/mi>)/K, :MATH]
which follows a standard normal distribution under the null, where
Φ^−1(x) is the inverse cumulative distribution function of the standard
normal distribution.
Adaptively weighted fisher’s method (AW-Fisher) ([65]Li and Tseng, 2011)
Fisher’s method does not differentiate DE in a single study or multiple
studies as long as their aggregate contribution to the final statistics
remains the same. To overcome this and better explain the between study
heterogeneity, [66]Li and Tseng (2011) introduced an AW-Fisher’s method
as a modification of the original Fisher’s method. The AW-Fisher method
considers
[MATH: U(w→)=-2∑k=1Kwklog(pk)
:MATH]
for each gene, where
[MATH: w→=(w1,…,wK<
mo>) :MATH]
and each w[k] is a binary weight of 0 or 1 assigned to each study k.
Denote by
[MATH: p(U(w→)) :MATH]
the p-value when the weight
[MATH: w→ :MATH]
is given, the AW-Fisher statistic is defined as:
[MATH:
TAW=minw→p
mrow>(U(w→))
:MATH]
, where the optimal weight
[MATH: (w1^,<
/mo>…,wK^) :MATH]
that minimizes the p-value indicates the subset of studies that
contribute to the aggregate statistics and naturally categorizes the
biomarkers. There is no closed-form distribution for AW-Fisher
statistics under the null, so permutation tests and importance sampling
is used to obtain the p-value and control the FDR.
Combining Effect Size
Fixed effect model (FEM) and random effect model (REM) ([67]Choi et al.,
2003)
Fixed effect model (FEM) combines effect sizes across all studies for
each gene using a simple liner model:
[MATH: Tk=μ+εk,εk∼N<
mrow>(0,sk2) :MATH]
, where μ is the overall mean and the within-study variance
[MATH: sk2 :MATH]
represents the sampling error conditioned on study k. The combined
point estimate of μ is a weighted average of study-specific effect
sizes, where weights are equal to the inverse of
[MATH: sk2 :MATH]
. FEM will prioritize concordant genes with the same directionality
across all studies.
When strong between studies heterogeneity exists and the underlying
population effect size is assumed to be unequal across studies, an REM
is given hierarchically as
[MATH:
Tk=θk+εk,εk∼N<
mrow>(0,sk2
mn>);θk=μ+δk,δk∼N<
mrow>(0,τ2),<
/mo> :MATH]
where between-study variance τ^2 represents the additional source of
variability between studies. A homogeneity test can be performed to
test whether τ^2 is zero or not, and determine the appropriateness of
FEM or REM. Like FEM, REM also prioritizes concordant genes but with
more flexibility across studies. Neither of FEM nor REM produces
biomarker categorization results.
Remarks
P-value combination methods are powerful for detecting genes that have
non-zero effects in at least one study (HS[B] alternative hypothesis
setting as in [68]Chang et al. (2013) without considering the
magnitudes and directionality of effects across studies. Thus, p-value
methods cannot distinguish concordant genes (i.e., upregulated or
downregulated in all studies) from discordant genes (i.e., upregulated
in some studies but downregulated in others). In contrast, effect size
combination methods take directionality into account but favor only
concordant genes. Even so, discordant genes can still be of interest
in, for example Pan-cancer analysis, to understand between tumor
heterogeneity. We, therefore, propose a new meta-analysis method that
incorporates both p-value and effect size combination methods, and
considers concordant pattern as well as biological and statistical
significance simultaneously to assist biomarker detection and
categorization. Here we will introduce our method namely BCMC
(Biomarker Categorization in Meta-analysis by Concordance).
New Meta-Analysis Method for Biomarker Detection and Categorization
Suppose there are K transcriptomic studies, each study k (1 ≤ k ≤ K)
measures the gene expression of n[k] samples and G genes. We use gene
expression as example to introduce our method though the method is
ready to analyze other types of transcripts such as miRNA and lncRNA.
Our objective in meta-analysis is to detect candidate genes
differentially expressed between the case (e.g., patients diagnosed
with disease) and control (e.g., healthy subjects) group in multiple
studies and categorize the detected genes by their DE patterns across
studies. We first perform DE analysis using popular methods such as
limma ([69]Ritchie et al., 2015) for microarray or DESeq2 ([70]Love et
al., 2014) for RNA-seq in each study and obtain the summary statistics
including effect size estimates (log2 fold change or LFC[gk]) and
p-values (p[gk]) for each gene g (1 ≤ g ≤ G) in each study k. Effect
sizes and p-values represent biological and statistical significance,
respectively, and can be treated as DE evidence for single studies. The
smaller the p-value and the larger the magnitude of effect size, the
more likely a gene will be a DE gene in the study. In meta-analysis,
concordance (i.e., a gene having the same sign of effect size in
different studies) is regarded as additional piece of DE evidence. We
define gth gene as being up-regulated in kth study when LFC[gk] > 0
(i.e., having higher expression in case group) and being down-regulated
when LFC[gk] < 0 (i.e., having higher expression in control group).
When integrating multiple transcriptomic studies, DE genes may be
altered in study-specific patterns. For example, some genes are
differentially expressed in all studies while others are only
differentially expressed in specific subset of studies. Meta-analysis
methods also have different groups of targeted biomarkers as reflected
by different statistical hypothesis settings. The null hypothesis for
each gene in meta-analysis is commonly defined as: H[0]: θ[g1] = ⋯ =
θ[gK] = 0, where θ[gk] represents the true effect of gene g in study k.
Depending on the types of targeted biomarkers, three alternative
hypotheses have been proposed in the meta-analysis literature
([71]Birnbaum, 1954; [72]Tseng et al., 2012; [73]Song and Tseng, 2014).
The first setting (HS[A]) aims to detect DE genes that have non-zero
effect in all studies, i.e., θ[gk] ≠ 0 for all k. The second setting
(HS[B]) aims to detect DE genes that have non-zero effect in at least
one study, i.e., θ[gk] ≠ 0 for some k. The third setting (HS[r]) aims
to detect DE genes that have non-zero effect in at least r studies,
i.e.,
[MATH: ∑k=1KI{θgk≠
mo>0}≥r :MATH]
. As we show next, our method generally follows HS[r] setting with
specifically r = 2 (i.e., we detect DE genes that have non-zero effect
in at least two studies).
To detect DE genes and categorize them by cross-study DE patterns, we
propose the following two aggregate statistics for each gene that
combines DE evidence across up-regulated studies or down-regulated
studies, respectively:
[MATH: Tg(w→g+)+=∑LF
Cgk>0<
/mrow>;LFC<
/mi>gk′>0;k≠k′(wgk+wgk′+LFCgkLFC
gk′|log10<
/msub>pgk+log10pgk′|)∑kwgk+ :MATH]
[MATH: Tg(w→g-)-=∑LF
Cgk<0<
/mrow>;LFC<
/mi>gk′<0;k≠k′(wgk
-wgk′-LFC
gkLFCgk′|log
10pgk
+log10<
/msub>pgk′<
/mrow>|)∑wgk
-k, :MATH]
where
[MATH:
wgk+
:MATH]
and
[MATH:
wgk-
:MATH]
are binary weights of 0 or 1 assigned to the kth study for gth gene,
indicating whether a study is selected for inclusion in aggregate
statistics or not, +/− indicate upregulation or downregulation part,
[MATH: w→g+=(wg1+,…,wgK
+) :MATH]
and
[MATH: w→g-=(wg1-,…,wgK
-) :MATH]
. LFC[gk] is the log[2] fold change and p[gk] the corresponding p-value
for gene g in study k obtained from single study DE analysis.
For gth gene,
[MATH: Tg(w→g+)+ :MATH]
aggregates the information of single study summary statistics
(including both p-value and effect size) over up-regulated studies
(i.e., those studies with LFC[gk] > 0), while
[MATH: Tg(w→g-)- :MATH]
aggregates that over down-regulated studies (i.e., those studies with
LFC[gk] < 0). The binary weights are used to indicate what studies to
include to the aggregate statistics and the optimal weights that
maximize the statistics will be searched for each gene. In the proposed
aggregate statistics, we simultaneously account for concordant patterns
(whereLFC[gk] and LFC[gk’] have the same sign), biological significance
(estimated as the product of LFC[gk]) and statistical significance
[estimated as the sum of log[10](p[gk])]. This will encourage combining
studies with the same directionality to find the best evidence for DE,
which is consistent with the purpose of meta-analysis to identify more
reproducible genes in multiple studies. Similar statistics have been
proposed for concordant and discordant analysis of orthologous genes
between a pair of species ([74]Domaszewska et al., 2017). From the
formula, we can see that the proposed statistic is essentially a
weighted average of all study pairs with effect sizes in the same
direction. A weighted average of all studies instead of study pairs is
an alternative approach but it tends to exclude studies with moderate
effect sizes or p-values (see a toy example in [75]Supplementary Table
1).
By default, we assume
[MATH:
wgk+=0 :MATH]
for studies with LFC[gk] < 0 and
[MATH:
wgk-=0forLF
Cgk>0<
/mrow> :MATH]
to avoid conflict between the two statistics. When no studies are
up-regulated or down-regulated for a particular gene, we suppress the
corresponding
[MATH: Tg(w→g+)+ :MATH]
or
[MATH: Tg(w→g-)- :MATH]
to zero and assign zero weights. The statistics aggregates over study
pairs so we need to choose at least two studies to make it meaningful.
When only one study is up-regulated or down-regulated, we also suppress
the corresponding
[MATH: Tg(w→g+)+ :MATH]
or
[MATH: Tg(w→g-)- :MATH]
to zero.
We then search for the optimal weights to identify the subset of
studies that maximize each of the two aggregate statistics. Such
optimal weights describe the DE patterns of each gene across studies
and provide natural categorization of all genes with potential
biological interpretation. The corresponding maximum statistics are
defined as:
[MATH:
Rg+
=maxw→g+∈WTg(w→g+)+
;Rg-=<
/mo>maxw→g-∈WTg(w→g-)-
, :MATH]
where W is the pre-defined searching space of weights with
aforementioned restrictions. The resulting optimal weights are denoted
as
[MATH: w→g
+* :MATH]
and
[MATH: w→g
-* :MATH]
. The biomarkers are then categorized according to the distribution of
optimal weights among studies by merging the information of
[MATH:
wg+*
:MATH]
and
[MATH:
wg-*
:MATH]
, i.e., the final weights
[MATH: w→g*=1→∘w→g
+*+-1→∘w→g
-* :MATH]
For example, concordantly up-regulated genes with
[MATH: w→g
+*=(0,0,1,1,1) :MATH]
and
[MATH: w→g
-* :MATH]
= (0,0,0,0,0) will be in one category [
[MATH: w→g*=(0,0,1,1,1) :MATH]
], while concordantly down-regulated genes with
[MATH: w→g
+*=(0,0,0,0,0) :MATH]
and
[MATH: w→g
-* :MATH]
= (0,0,1,1,1) will be in the other category [
[MATH: w→g*=(0,0,-1,-1,-1) :MATH]
]. Note that the proposed statistics can describe both up-regulated and
down-regulated patterns in the same gene, thus also allowing the
detection of discordant genes. In cases both patterns exist and we want
to find a dominant pattern in the discordant gene, we can further
define R[g] = max (
[MATH: Rg+,Rg-<
/mrow> :MATH]
) and use the corresponding
[MATH: w→g
+* :MATH]
or
[MATH: w→g
-* :MATH]
for biomarker categorization.
To assess the uncertainty of
[MATH: Rg+ :MATH]
and
[MATH: Rg- :MATH]
and determine DE in meta-analysis, we develop a permutation-based test
to calculate the p-value and FDR adjusted p-value (also known as
q-value) of the statistics. We permute group labels (i.e., case or
control group) in each study B times and calculate the maximum
statistics in each permuted dataset. For each gene, we obtain two
p-values corresponding to
[MATH: Rg+ :MATH]
and
[MATH: Rg- :MATH]
, respectively:
[MATH: pg(Rg+)+=∑b=1B
∑g′=1
GI{R
g′+(b)≥R
g+}+1B*G+
1; :MATH]
[MATH: pg(Rg-)-=∑b=1B
∑g′=1
GI{R
g′-(b)≥R
g-}+1B*G+
1, :MATH]
where
[MATH: Rg′+(b) :MATH]
and
[MATH: Rg′-(b) :MATH]
are the maximum statistics for gth gene in bth (1 ≤ b ≤ B) permutation.
The value of one is added to both numerator and denominator to avoid
zero p-values. After p-values are generated, we further estimate the
proportion of null genes π[0] as:
[MATH: π^0+=
∑g=1G
I{pg<
mo
stretchy="false">(Rg+)+ϵ
A}G*
ℓ(A);π^0-=
∑g=1GI{pg<
mo
stretchy="false">(Rg-)-ϵ
A}G*
ℓ(A),
mrow> :MATH]
normally we choose A = [0.5,1] and ℓ(A) = 0.5 to estimate the null
proportion, following the guidance in the previous methods and the
literature of FDR ([76]Storey, 2002; [77]Storey and Tibshirani, 2003;
[78]Li and Tseng, 2011). In most cases, the density of p-values beyond
0.5 is fairly flat, implying most null p-values are located in this
region. In practice, depending on the problem, other common choices of
A = [0.05,1] or A = [0.025,1] can also be applied. The optimal A can be
empirically determined by minimizing some loss function, we do not
discuss further here and refer readers to [79]Storey (2002), [80]Storey
and Tibshirani (2003) for more details.
Then, q-values can be calculated as
[MATH: qg(Rg+)+=π^0+<
mo>∑b=1B∑g′=1
GI{R
g′+(
b)≥Rg+}+1B*∑
mo>g′=1
GI{R
g′+≥
mo>Rg+
}+1,
mrow> :MATH]
[MATH: qg(Rg-)-=π^0-<
mo>∑b=1B∑g′=1
GI{R
g′-(
b)≥Rg-}+1B*∑
mo>g′=1
GI{R
g′-≥
mo>Rg-
}+1 :MATH]
Likewise, p-value and q-value of the dominant pattern statistics R[g]
(i.e., p[g(R[g])] and q[g(R[g])]) can be obtained in the same way. In
real data application, we determine DE in meta-analysis using the
permuted p-value or q-value for the dominant pattern. Note that
p-values and q-values of a zero
[MATH:
Rg+
:MATH]
or
[MATH:
Rg-
:MATH]
are equal to one.
Downstream Analysis on Each Identified Categories of Biomarkers
Each transcriptomic study was carefully assessed for inclusion to
meta-analysis using objective criteria or systematic quality control
methods ([81]Kang et al., 2012). When only expression of mRNA data is
available for the K selected transcriptomic studies, we applied our
meta-analysis and identified multiple categories of mRNAs at certain
BCMC p-value or q-value cutoffs, each with a unique DE pattern across
the studies. DE analysis is useful to narrow down targets but focusing
on single gene change across datasets is not sufficient. We still need
to conduct further investigation on whether mRNAs belonging to the same
category contain unifying biological theme. For each unique category of
mRNAs, we then performed pathway enrichment analysis to gain more
insights into their unique functions (section “Pathway Enrichment
Analysis of mRNA Expression”). When expression data of mRNA, miRNA and
lncRNA are all available, we applied our meta-analysis method to each
type of transcripts separately and then analyzed each unique category
of differentially expressed mRNA, miRNA, and lncRNA (those with the
same weight or same cross-study DE pattern) together. Specifically, we
performed miRNAs/lncRNAs target gene enrichment analysis (section
“miRNAs/lncRNAs Target Gene Enrichment Analysis”) and LncRNA-mRNA and
miRNA-mRNA causal regulatory network analysis (section “LncRNA-mRNA and
miRNA-mRNA Causal Regulatory Network Analysis”).
Pathway Enrichment Analysis of mRNA Expression
For each category of mRNAs with unique DE pattern across the studies,
we looked for biological pathways that are enriched in each category of
genes more than would be expected by chance. The enriched pathways for
each category can infer the unique biological functions only associated
with specific study subsets and help generate new hypotheses. The
p-value for the enrichment of a pathway was calculated using Fisher’s
exact test ([82]Upton, 1992) and multiple testing was corrected by
Benjamini-Hochberg (BH) procedure ([83]Benjamini and Hochberg, 1995).
Multiple popular pathway databases were used including Gene Ontology
(GO) ([84]Ashburner et al., 2000), Kyoto Encyclopedia of Genes and
Genomes (KEGG) ([85]Kanehisa et al., 2017), Oncogenic signaling
Pathways ([86]Sanchez-Vega et al., 2018) and Reactome ([87]Fabregat et
al., 2016). Pathways in each pathway database was carefully selected
for their relatedness to the problem of interest and small pathways
(e.g., pathway size <10) were filtered out for the lack of power. For
pathways with topological information available (e.g., pathways in
KEGG), we apply the R package “Pathview” ([88]Luo and Brouwer, 2013),
to display the study-specific information (e.g., weights, effect sizes,
etc.) on relevant pathway topology graphs.
miRNAs/lncRNAs Target Gene Enrichment Analysis
Going beyond the traditional central dogma, non-coding RNAs such as
micro-RNA (or miRNA) and long non-coding RNAs (lncRNA) play important
regulatory roles in mRNAs expression ([89]Bartel, 2004; [90]Hubé and
Francastel, 2018). To understand whether miRNA/lncRNA target at mRNAs
in the same category with unique cross-study DE pattern, we analyzed
each unique category of mRNA, miRNA and lncRNA of the same cross-study
DE pattern together and performed miRNA/lncRNAs target gene enrichment
analysis on each category. Specifically, for each unique category, we
first used the miRTarBase database ([91]Chou et al., 2018) and
LncRNA2Target v2.0 database ([92]Cheng et al., 2019) to obtain common
target genes of each miRNA and lncRNA in this category. We then looked
for miRNA/lncRNA with target genes enriched in the gene list falling in
the same category more than would be expected by chance. The p-value
for the enrichment of miRNA/lncRNA was calculated using Fisher’s exact
test ([93]Upton, 1992) and multiple testing was corrected by BH
procedure ([94]Benjamini and Hochberg, 1995).
LncRNA-mRNA and miRNA-mRNA Causal Regulatory Network Analysis
In addition to target gene enrichment analysis, we are also interested
in investigating the causal regulatory relationship among the various
types of transcripts in the same category using network analysis. For
each unique category of mRNA and lncRNA with the same cross-study DE
pattern, we followed the MSLCRN pipeline to perform module-specific
lncRNA-mRNA regulatory network analysis ([95]Zhang et al., 2019). The
MSLCRN pipeline starts by using WGCNA ([96]Langfelder and Horvath,
2008) to construct lncRNA-mRNA co-expression networks and identify
modules that contain both lncRNA and mRNA. For each lncRNA-mRNA module,
parallel IDA ([97]Le et al., 2016) is then applied to learn the causal
structure and estimate the causal effect of lncRNA on mRNA. IDA
consists of two main steps. It first uses a parallel version of the PC
algorithm ([98]Spirtes et al., 2000; [99]Kalisch and Bühlman, 2007;
[100]Le et al., 2016), commonly used approach for learning the causal
structure of a Bayesian network, to obtain the directed acyclic graphs
(DAGs) for each module. Then, the causal effect of lncRNAs on mRNAs
(i.e., the lncRNA ≥ mRNA directed edges in the DAG) are estimated by
applying do-calculus ([101]Pearl, 2000), causal calculus that uses
Bayesian conditioning to generate probabilistic formulas for the causal
effect. Lastly, the module-specific causal regulatory networks are
integrated to form the global lncRNA-mRNA causal regulatory network and
visualized using Cytoscape ([102]Shannon et al., 2003). In constructing
the regulatory network, we use absolute values of the causal effects
cutoffs to assess the regulatory strengths and confirm the regulatory
relationships. More details on the use of MSLCRN to infer causal
regulatory network can be found in [103]Zhang et al. (2019).
Module-specific miRNA-mRNA causal regulatory networks can be obtained
in a similar way using the same tool.
Simulation
We conduct simulation studies to evaluate the performance of our method
in biomarker detection and categorization when compared to AW-Fisher
([104]Li and Tseng, 2011), FEM and REM methods ([105]Choi et al.,
2003). Only power is assessed for FEM and REM methods since they do not
categorize biomarkers. We assume a total of G = 2000 genes expressed in
K = 5 studies, each study has a total sample size of n = 100, evenly
split into control and case groups
[MATH:
(ncase<
/msub>=ncontrol=n2=50
mrow>) :MATH]
. The details on how data are simulated are described below:
* 1.
We generate 800 genes with 40 gene clusters (20 genes in each
cluster) and another 1,200 genes that do not belong to any cluster.
The cluster indexes for each gene g (1 ≤ g ≤ 2000) is randomly
sampled.
* 2.
For genes in cluster c (1 ≤ c ≤ 40) and study k(1 ≤ k ≤ 5), we
first generate a covariance matrix according to inverse Wishart
distribution
[MATH: Σck′
mo>∼W-1<
/mn>(Ψ,60) :MATH]
, where Ψ = 0.5I[20×20] + 0.5J[20×20], I is the identity matrix and
J is the matrix with all elements equal to one. Then, we
standardized
[MATH: Σck′
mo> :MATH]
into Σ[ck] to make sure all the diagonal elements are one.
* 3.
We sample baseline gene expression levels of the 20 genes in
cluster c for sample i in study k by
[MATH:
(Xgc1ik
′,…,Xgc20ik′)T∼
MVN(0,Σck
) :MATH]
, where 1 ≤ i ≤ n and 1 ≤ k ≤ K. For those 1200 genes that are not
in any cluster, we sample the baseline gene expression level
independently from
[MATH:
N(0,σk2<
/msubsup>) :MATH]
, where 1 ≤ k ≤ 5 and σ[k]∼Unif(σ−0.2,σ + 0.2) with σ = 2.
* 4.
Denote by δ[gk] ∈ {0,1,−1} that gene g is non-DE, up-regulated or
down-regulated in study k. We assume the first 800 genes to be DE
genes divided into four mutually exclusive parts:
+ (1)
Concordantly up-regulated genes (N = 225): randomly sample
δ[gk] ∈ {0,1,−1} such that
[MATH: ∑kI{δgk=1}≥2 :MATH]
and
[MATH: ∑kI{δgk=-1}≤1 :MATH]
.
+ (2)
Concordantly down-regulated genes (N = 225): randomly sample
δ[gk] ∈ {0,1,−1} such that
[MATH: ∑kI{δgk=-1}≥2 :MATH]
and
[MATH: ∑kI{δgk=1}≤1 :MATH]
.
+ (3)
Discordant genes with both up-regulated and down-regulated
patterns (N = 150): randomly sample δ[gk] ∈ {0,1,−1} such that
[MATH: ∑kI{δgk=1}≥2 :MATH]
and
[MATH: ∑kI{δgk=-1}≥2 :MATH]
.
+ (4)
Other genes that are DE in only one study without any
concordant patterns (N = 200): we randomly sample δ[gk] ∈
{0,1,−1} such that
[MATH: ∑k|δgk|=1 :MATH]
.
* 5.
To simulate effect size for DE genes in each study (when δ[gk] ≠
0), we sample from a uniform distribution μ[gk]∼Unif(1,3). The gene
expression level X[gik] are assumed to be
[MATH:
Xgik′ :MATH]
for control samples and
[MATH:
Xgik=
mo>Xg(i+n/2)k′+μgk⋅δgk :MATH]
for case samples, where 1 ≤ g ≤ 2000, 1 ≤ i ≤ n/2, and 1 ≤ k ≤ 5.
To assess power and biomarker categorization performance, we focus on
DE genes in the first three categories of genes with concordant
patterns in at least two studies (N = 600). We also simulate additional
scenario with smaller sample size and variance: n = 20 & σ = 1, results
are included in the Supplement ([106]Supplementary Figure 1 and
[107]Supplementary Table 2).
[108]Figure 2 shows the number of true DE genes detected among the top
genes ranked by p-value for each method. BCMC is more powerful than
AW-Fisher and FEM/REM by detecting more true DE genes among the top
ranked genes. [109]Table 1 summarizes the number of true DE genes
detected as well as with correct weight pattern in each of the three
categories of DE genes identified by each method. BCMC and FEM detect
more true DE genes than AW-Fisher for concordant genes. Due to the
model restriction, FEM and REM fail to detect most discordant genes.
AW-Fisher is equally powerful as BCMC in detecting discordant genes,
however, it ignores the directionality of effects, and thus assigns the
incorrect weights to genes with both up-regulated and down-regulated
patterns (basically they fail to distinguish w = −1 from w = 1). Our
method detects these discordant DE genes while at the same time assigns
the correct weights categorizing these genes.
FIGURE 2.
[110]FIGURE 2
[111]Open in a new tab
Plot of the number of true DE genes vs. top ranked genes by p-value of
each method.
TABLE 1.
Summary of number of true DE genes detected and with correct weight
patterns by the four methods in each of the three categories of DE
genes described in the simulation setting.
Methods BCMC
__________________________________________________________________
AW-Fisher
__________________________________________________________________
FEM REM
DE Gene categories Number of true DE genes Number of true DE genes with
correct weight Number of true DE genes Number of true DE genes with
correct weight
Concordant up (N = 225) 206 116 195 106 203 151
Concordant down (N = 225) 210 119 195 108 201 144
Discordant (N = 150) 148 135 148 0 47 2
Total (N = 600) 564 370 538 214 451 297
[112]Open in a new tab
Real Data Application
Gene Expression Analysis in Pan-Gynecologic (Pan-Gyn) Studies
We applied our method to the gene expression data of TCGA Pan-Gyn
studies including high-grade serous ovarian cystadenocarcinoma (OV),
uterine corpus endometrial carcinoma (UCEC), cervical squamous cell
carcinoma and endocervical adenocarcinoma (CESC), uterine
carcinosarcoma (UCS), and invasive breast carcinoma (BRCA) ([113]Berger
et al., 2018). [114]Berger et al. (2018) identified 23 genes (e.g.,
BRCA1, PTEN, TP53, etc.) that were mutated at higher frequency across
all Pan-Gyn cancers than non-Gyn cancers, highlighting the similarities
across Pan-Gyn cohort. We focused on 19 of these genes and split
samples in each study into a mutation “carrier” group and a mutation
“non-carrier” group depending on whether subjects gained mutations in
at least one of the genes ([115]Supplementary Figure 2). Since no or
very few samples were assigned to the mutation carrier group for UCS
(N[mutation] = 0) and UCEC (N[mutation] = 8), we excluded those two
studies and restricted our meta-analysis to only three gynecologic
cancer types (i.e., number of studies K = 3) including OV (mutation
carrier vs. non-carrier: 217/90), BRCA (692/408) and CESC (109/197).
The purpose is to detect differentially expressed genes between
mutation carrier and non-carrier groups and categorize them according
to their cross-study DE patterns. We found the overall survival
differed significantly between the two groups for each cancer type
([116]Supplementary Figures 3–[117]5). This implied the differentially
expressed biomarkers between these two groups can have potential
prognostic values related to mutational processes and serve as optimal
therapeutic intervention targets ([118]Helleday et al., 2014;
[119]Lawrence et al., 2014).
The RNA-seq data in Transcripts Per Million (TPM) values of each cancer
type were downloaded from LinkedOmics ([120]Vasaikar et al., 2018). We
first merged the three datasets by matching the gene symbols and
removed genes with mean TPM < 5. A total of 9,900 mRNAs remained and
were log[2] transformed for analysis. We performed DE analysis by limma
([121]Ritchie et al., 2015) and obtained the p-value and LFC from each
of the three studies. We then performed meta-analysis using BCMC and
the other methods.
All methods detected thousands of DE genes at both q-value cutoffs (for
BCMC, q-value for dominant pattern was used so we focused on concordant
genes only), which is common in Pan-cancer studies ([122]Table 2). It
becomes imperative task to partition these DE genes into smaller
subsets by cross-study DE patterns before performing downstream
analysis. BCMC categorized these DE biomarkers (q < 0.05) into eight
groups according to the optimal weight assignments, each displaying a
unique expression pattern across the different studies ([123]Figure 3
and [124]Supplementary Table 3). We then merged genes with equal |
[MATH: w→g*<
/msubsup> :MATH]
| into the same group (i.e., genes with
[MATH: w→g*<
/msubsup>=(0,1,1) :MATH]
and those with
[MATH: w→g*<
/msubsup>=(0,-1
mrow>,-1) :MATH]
are merged into the same group, allowing both up-regulated and
down-regulated genes in the same pathway) and performed pathway
enrichment analysis on each of the four merged groups using four
pathway databases: GO ([125]Ashburner et al., 2000), KEGG
([126]Kanehisa et al., 2017), Oncogenic ([127]Sanchez-Vega et al.,
2018) and Reactome ([128]Fabregat et al., 2016). The top 100 pathways
enriched by each category have little overlap partly validating our
speculation in motivation that the different categories of biomarkers
may play different functional roles ([129]Figure 4). For example, top
pathways for
[MATH: |w→g*<
/msubsup>|=(1,0,1) :MATH]
(i.e., DE in OV and CESC but not in BRCA) are mainly involved in cell
junction and adhesion related functions ([130]Supplementary Table 4 in
[131]Supplemental File 1). Top pathways for
[MATH: |w→g*<
/msubsup>|=(1,1,0) :MATH]
(i.e., DE in OV and BRCA but not in CESC) are mainly involved in immune
and defense response. [132]Figure 5 shows the topology of one example
KEGG pathway “Antigen processing and presentation” enriched by the
genes with
[MATH: |w→g*<
/msubsup>|=(1,1,0) :MATH]
. The highlighted DE genes showed strong DE signals (signed LFC) in OV
and BRAC but not in CESC. These genes colocalized and interacted with
each other as a functional unit inside the pathway.
TABLE 2.
Summary of numbers of DE genes detected by each method at different
cutoffs for the Pan-Gyn study example. For BCMC, q-values for the
dominant pattern are used.
Methods
__________________________________________________________________
q-value BCMC AW-Fisher FEM REM
q < 0.05 1,345 3,113 2,866 983
q < 0.15 3,931 4,743 4,342 1,641
[133]Open in a new tab
FIGURE 3.
[134]FIGURE 3
[135]Open in a new tab
Heatmap of standardized expression values of differentially expressed
genes (BCMC q < 0.05) sorted by weight patterns for the Pan-Gyn cancer
example.
FIGURE 4.
[136]FIGURE 4
[137]Open in a new tab
Venn diagram of top 100 pathways enriched by each of the four
categories [
[MATH: |wg
*|=(0,1,1),(1,0,1),(1,1,<
/mo>0),and
(1,1,1)<
/mrow>; :MATH]
corresponding to OV, BRCA and CESC, respectively] for the Pan-Gyn study
example.
FIGURE 5.
[138]FIGURE 5
[139]Open in a new tab
Visualization of the topology plot of a KEGG pathway “Antigen
processing and presentation” enriched by the genes with
[MATH: |wg
*|=(1,1,0) :MATH]
(corresponding to OV, BRCA and CESC, respectively) for the Pan-Gyn
example. Each box that represents a gene is split into three parts to
represent the three studies. Colors indicate the signed LFC of the
mapped DE genes in the three studies.
These unique gene sets of different cross-cancer DE patterns and the
associated pathways enriched help gain more insights into the
homogeneous and heterogenous molecular mechanism of different
Gynecologic cancer and assist the development of useful diagnostic and
therapeutic strategies common or specific to cancer types.
Understanding commonality and difference in drug targets can also guide
the drug repurposing strategy in cancer drug development ([140]Li et
al., 2021).
Integrative Analysis of mRNA, lncRNA, and miRNA in Pan-Kidney Studies
We also used BCMC to perform integrative analysis of three different
types of transcripts (mRNA, lncRNA, and miRNA) in the TCGA Pan-Kidney
cohort including kidney chromophobe (KICH), kidney renal clear cell
carcinoma (KIRC), and kidney renal papillary cell carcinoma (KIRP).
LncRNA and miRNA have been found playing important regulatory roles on
gene expression in kidney cancers ([141]Linehan et al., 2010;
[142]Linehan, 2012; [143]Ricketts et al., 2018). The integrative
analysis of these multi-omics data provides additional insights into
the biological mechanism underlying the multiple histologic subtypes of
kidney cancers. We aimed to detect the differentially expressed
biomarkers (mRNA, miRNA, or lncRNA) that drive the progression of
kidney cancer by comparing samples from early pathologic stage (stage I
and II) to late stage (stage III and stage IV) for three kidney cancer
types (i.e., number of studies K = 3) and investigating the regulatory
relationships among these biomarkers. Number of subjects in the two
pathologic stages of each kidney cancer available in mRNA, miRNA and
lncRNA expression data were summarized in [144]Supplementary Table 5.
We downloaded mRNA (in Reads Per Kilobase of transcript per Million
mapped reads or RPKM) and miRNA (in Reads Per Million mapped reads or
RPM) sequencing data from LinkedOmics ([145]Vasaikar et al., 2018) and
lncRNA sequencing data (in RPM) from The Atlas of Noncoding RNAs in
Cancer (TANRIC) ([146]Li et al., 2015) for all the three kidney cancer
subtypes. We first merged the three subtypes by matching RNA
symbols/IDs. We then separately filtered each of the three types of
biomarkers by removing mRNAs with mean RPKM < 5, lncRNAs with mean RPM
< 0.1, and miRNAs with mean RPM = 0, followed by log[2] transformation.
A total of 15,332 mRNAs, 2,415 lncRNAs and 719 miRNAs remained for
analysis. We performed DE analysis by limma ([147]Ritchie et al., 2015)
in each study and then meta-analysis to categorize biomarkers according
to cross-study DE patterns for each RNA species. For different types of
RNA belonging to the same category, we further performed miRNA target
gene enrichment analysis and lncRNA-mRNA causal regulatory network
analysis to understand their complex interacting relationships in
kidney cancer.
Both BCMC and AW-Fisher methods detected thousands of differentially
expressed biomarkers (including mRNA, lncRNA, and miRNA) at both
q-value cutoffs with high proportion of overlap ([148]Table 3).
Biomarkers detected by BCMC tend to have both significant p-values and
large effect sizes in the studies indicated by optimal weights
([149]Supplementary Figure 6). These biomarkers (q < 0.05) were
partitioned into eight categories by different weight patterns
([150]Supplementary Table 6). We merged biomarkers with the same
[MATH: |w→g*| :MATH]
into the same group. We focused on the group with
[MATH: |w→g*<
/msubsup>|=(1,1,1) :MATH]
to understand the common multi-omics regulatory among all histologic
subtypes of kidney cancer and performed downstream analysis. In miRNA
target gene enrichment analysis, we found the target gene sets of two
DE miRNAs “miR-655” and “miR-326” were enriched in the DE gene list in
the same group (p < 0.05; [151]Supplementary Table 7 in the
[152]Supplementary File 1), implying the potential regulatory
relationship between different biomarker types consistent in all kidney
cancer subtypes. The gene ATAD2 targeted by miR-655 was reported as a
prognostic marker for kidney disease ([153]Chen et al., 2017). In
causal network analysis, we identified two lncRNA-mRNA regulatory
networks ([154]Supplementary Figure 8 and [155]Supplementary Table 8).
[156]Figure 6 shows the network with two hub lnRNAs, the hub lncRNA
ENSG00000267449 and several mRNAs belonging to the ribosomal protein
family in the same network were found consistently differentially
expressed in all three subtypes, implying their potentially joint role
in promoting the development of kidney cancers ([157]Zhou et al., 2015;
[158]Dolezal et al., 2018).
TABLE 3.
Summary of number of differentially expressed biomarkers among each of
the three RNA species detected by each method at different cutoffs for
the Pan-Kidney study example. For BCMC, q-values for the dominant
pattern are used.
Type of biomarkers mRNA
__________________________________________________________________
lncRNA
__________________________________________________________________
miRNA
__________________________________________________________________
q-value BCMC AW-Fisher BCMC AW-Fisher BCMC AW-Fisher
q < 0.05 7,317 9,472 764 1,281 239 283
Intersection 6,391 622 206
q < 0.15 11,810 11,440 1,468 1,464 358 358
Intersection 10,057 1,244 292
[159]Open in a new tab
FIGURE 6.
[160]FIGURE 6
[161]Open in a new tab
One example lncRNA-mRNA regulatory network identified from biomarkers
with |
[MATH:
wg*|=(1,1,1) :MATH]
(corresponding to KICH, KIRC, and KIRP, respectively) for the
Pan-Kidney example. The circle shapes represent lncRNAs highlighted in
green and diamond shapes represent mRNAs highlighted in purple. The
arrows indicate the network relationships between lncRNAs and mRNAs.
These results demonstrate the power of our method to detect biomarkers
of different types in Pan-cancer meta-analysis and to categorize them
into functionally relevant biomarkers by DE patterns, which could
suggest commonalities and differences in underlying mechanisms of
multiple cancer types.
Discussion
In this paper, we proposed a novel meta-analysis method for candidate
biomarker detection in multiple transcriptomic studies that further
categorizes biomarkers by concordant patterns as well as by biological
and statistical significance across studies. Numerous downstream
analysis tools including pathway analysis and causal network analysis
are applied to each category of biomarkers with either single or
multiple types of RNA species. Simulations and real data application to
two Pan-cancer multi-omics studies showed the advantage of our method
in classifying differentially expressed biomarkers into classes with
unique biological functions and relationships that can be further
investigated in future studies.
Meta-analysis is a set of statistical analytical methods and tools that
combine multiple related studies to improve power and reproducibility
over a single study. In recent years, we have witnessed the development
of many useful meta-analysis methods applied to genomic studies for
different biological purposes ([162]Choi et al., 2003; [163]Shen and
Tseng, 2010; [164]Li and Tseng, 2011; [165]Huo et al., 2016, [166]2020;
[167]Kim et al., 2016, [168]2018; [169]Zhu et al., 2017; [170]Ma et
al., 2019; [171]Zeng et al., 2020). Genomic data is usually of high
dimension and the between study heterogeneity is large due to both
technological and cohort effects. In addition to improving power,
post-hoc categorization of biomarkers into smaller subsets by
cross-study patterns for subsequent analysis is important in genomic
meta-analysis. Our meta-analysis method that aggregates over both
p-value and effect size is a fast and intuitive solution for this
purpose. Compared to other popular meta-analysis methods that include
biomarker categorization, our method considers concordant pattern, and
biological and statistical significance simultaneously. By calculating
statistics separately for up-regulated and down-regulated parts, we can
detect both concordant genes that have consistent patterns across all
studies and discordant genes that are up/down regulated in some studies
while down/up regulated in others. Both of these kinds of genes can be
of interest in Pan-cancer analysis. For example, high expression of
some genes might worsen the prognosis of all cancer types, while high
expression of other genes might worsen prognosis for some cancers but
be beneficial to other cancer types.
Our method also applies to the scenario when there is more than one RNA
species present and proposes to jointly analyze different types of
biomarkers under the same category for more biological insights. As
more omics data are accumulated in the public domain, similar
strategies can be applied for integrative analysis, for example with
epigenomic (e.g., DNA methylation, histone modification), proteomic and
metabolomic data. Unique features of each omics data type need to be
addressed and will be considered as a future direction to extend our
method.
Like most other two-stage meta-analysis methods, our method is based on
summary measures such as p-values and log[2] fold changes from each
study. In addition, the method assigns a single optimal weight to each
gene without quantifying the uncertainty in weight assignment. A more
comprehensive Bayesian hierarchical model can be applied to raw data
and summary measures to better capture the stochasticity and provide
soft weight assignment. Our method requires the DE genes to be
concordant in at least two studies to be detected, consistent with the
purpose of meta-analysis in prioritizing more reproducible biomarkers.
As the number of studies becomes large, the likelihood of being
differentially expressed in only one study decreases. Thus, we expect
the method to perform well as the number of studies increases. Since
the method relies on summary measures, increasing the number of studies
will not materially increase the computational burden. Additionally,
use of more sophisticated parallel computing techniques will improve
the speed of permutation tests. An R package called “BCMC” is available
at [172]https://github.com/kehongjie/BCMC to implement our method.
Data Availability Statement
Publicly available datasets were analyzed in this study. This data can
be found here: [173]https://github.com/kehongjie/BCMC.
Author Contributions
ZY and HK developed the method, performed the analysis, and wrote the
manuscript. TM supervised the project and took the lead in editing the
manuscript. SC, RC-C, XH, JZ, JD, and DM contributed to manuscript
writing and polishing. All authors provided critical feedback and
helped shape the research, analysis and manuscript.
Conflict of Interest
The authors declare that the research was conducted in the absence of
any commercial or financial relationships that could be construed as a
potential conflict of interest.
Footnotes
Funding. Research reported in this publication was supported by the
National Institute on Drug Abuse (NIDA) of National Institute of Health
under award number 1DP1DA048968-01 to SC and TM, and the University of
Maryland Faculty-Student Research Award (FSRA) to ZY, HK, and TM.
Supplementary Material
The Supplementary Material for this article can be found online at:
[174]https://www.frontiersin.org/articles/10.3389/fgene.2021.651546/ful
l#supplementary-material
[175]Click here for additional data file.^ (55KB, xlsx)
[176]Click here for additional data file.^ (1.4MB, docx)
References