Abstract
Approaches systematically characterizing interactions via
transcriptomic data usually follow two systems: (i) coexpression
network analyses focusing on correlations between genes and (ii) linear
regressions (usually regularized) to select multiple genes jointly.
Both suffer from the problem of stability: A slight change of
parameterization or dataset could lead to marked alterations of
outcomes. Here, we propose Stabilized COre gene and Pathway Election
(SCOPE), a tool integrating bootstrapped least absolute shrinkage and
selection operator and coexpression analysis, leading to robust
outcomes insensitive to variations in data. By applying SCOPE to six
cancer expression datasets (BRCA, COAD, KIRC, LUAD, PRAD, and THCA) in
The Cancer Genome Atlas, we identified core genes capturing interaction
effects in crucial pan-cancer pathways related to genome instability
and DNA damage response. Moreover, we highlighted the pivotal role of
CD63 as an oncogenic driver and a potential therapeutic target in
kidney cancer. SCOPE enables stabilized investigations toward complex
interactions using transcriptome data.
__________________________________________________________________
A statistical model identifies core genes and pathways from
transcriptome data robust to changes of data.
INTRODUCTION
Understanding the process of pathogenesis and discovering previously
unidentified therapeutic targets require discovery of the underlying
driver genes in relevant pathways ([32]1–[33]3). However, determination
of the “driver” role of a gene through experimental investigation has
only been possible for a handful of genes because of the time-consuming
and expensive nature of such experiments. Thus, in silico analysis to
narrow down candidates of potential genes is vital. Current methods of
identifying driver genes involve multiomics data ([34]4) and often use
known biological pathways ([35]5). Among multiomics data,
transcriptomes, i.e., gene expression data, play a pivotal role in
biological processes and are the most available form of omics data for
many diseases including cancers. As such, analyzing transcriptomic data
is usually the first step in omics-directed characterization of
diseases.
In practice, selecting differentially expressed (DE) genes by
contrasting expression levels in disease and control tissues has been
broadly used for the exploration of biological mechanisms of various
diseases. Largely because of its simplicity, single-gene-based DE
analysis is the most popular method adapted by many researchers
([36]6). From the perspective of systems biology, it is natural to
expect that advanced models analyzing multiple genes jointly should
lead to additional in-depth understanding of disease pathology.
Unfortunately, instability of such complex models involving multiple
genes appears to be a serious problem; in many situations, current
methods do not generate consistent results. For instance, in a typical
coexpression network–based analysis, a gene network is built with its
nodes representing genes and edges based on their coexpression. The
genes that are highly connected with other genes in the network, called
“hub” genes, are expected to be important in pathology ([37]7). As
such, many pipelines discovering driver genes incorporate information
from coexpression networks and these hub genes into the next phase of
multiomics approaches ([38]8–[39]10). It has, however, been noted that
hub genes are not stable, and they are not guaranteed to be driver
genes ([40]11).
Regularized multiple regression methods, which optimize an objective
function by adding a regularization term to a likelihood, are widely
used in many domains ([41]12–[42]14) including biomarker selection
using transcriptomic data. LASSO (least absolute shrinkage and
selection operator) and ridge regression are two representative methods
of this nature ([43]15, [44]16). The choice of regularization plays a
notable role in the information supplied by the final model: Ridge
regression will lead to a model containing a large number of genes
([45]15), which may confer a high predictive power at the cost of
little meaningful information for functional characterization; LASSO,
in contrast, retains fewer genes ([46]16) but is inherently unstable in
the presence of highly correlated variables ([47]17, [48]18), which is
unfortunately the case of transcriptome data. While a logistic LASSO
regression can usually identify significant variables in determining
case and control, it also tends to provide completely different
outcomes with different parameterizations or, even by running a
similarly parameterized model multiple times over, slightly different
data ([49]17, [50]19). That is why, historically, as far as we
understand, there have been few efforts using such feature selection
methods in identifying underlying genes from transcriptomic data.
Coexpression network analyses and regularized multiple regressions form
disconnected fields, which are by themselves unable to produce stable
results offering insights into disease pathology. We propose the
Stabilized COre gene and Pathway Election (SCOPE), a new tool to
reliably discover candidate genes and pathways using transcriptomic
data. The new framework represents a synergy between coexpression
network analysis and regularized multiple regressions, with two layers
of stabilization integrated.
To assess the theoretical properties of SCOPE, we first conducted a
simulation study where various scenarios of signal-to-noise ratio,
nonlinearity, interaction, and coexpression structures are considered
and evaluated. The results showed SCOPE’s advantage in most
configurations over state-of-the-art regularization methods.
As a proof of concept in real data, we applied SCOPE to six cancer
datasets from The Cancer Genome Atlas (TCGA) ([51]20) [breast invasive
carcinoma (BRCA), colon adenocarcinoma (COAD), kidney renal clear cell
carcinoma (KIRC), lung adenocarcinoma (LUAD), prostate adenocarcinoma
(PRAD), and thyroid carcinoma (THCA) with 1,041-111, 480-70, 483-54,
387-37, 458-50, and 444-53 “primary tumor”–“normal tissue” samples,
respectively] to identify novel core genes and their related pathways.
Thorough comparisons were carried out against standard methods
including LASSO (for the selection step only) and differential
expression (DE) analysis as well as differential coexpression
(DiffCoEx) analysis. As expected, the core genes selected by
SCOPE-Stabilized LASSO are stable with respect to small changes of the
input datasets. Despite being significantly fewer than the typical set
of genes identified by a standard LASSO, the SCOPE-identified core
genes remain highly predictive. Notably, as another line of evidence at
the pathway level, pathways identified by SCOPE show significant
within- and pan-cancer overlap. In contrast, standard DiffCoEx analysis
led to significantly lower overlaps across and within cancers.
Moreover, as a confirmation by using connectivity analysis, we found
that the core genes play central roles in the shared pathways.
To discover previously unknown insights into cancer pathology based on
the stably identified core genes and pathways, we have carefully
annotated the data based on our insights into cancers and from the
literature. Notably, we shed light on the critical role of CD63 in the
Von Hippel–Lindau tumor suppressor (VHL)–hypoxia-inducible factor 1
subunit α (HIF1A)/hypoxia-inducible factor 2α (HIF2A)–vascular
endothelial growth factor A (VEGFA) protein (VHL-HIF-VEGF/VEGFR) axis
([52]21), a putative key driver that governs tumorigenesis in kidney
cancer. These discoveries may provide interesting insights into the
mechanism of cancer, identifying concrete targets for further
experimental follow-ups.
RESULTS
The SCOPE framework
SCOPE begins by conducting multiple regression by using a stabilized
extension of LASSO, here termed SCOPE-Stabilized LASSO, which uses a
bootstrap of multiple LASSO models ([53]Fig. 1, A and B), leading to a
handful of core genes robust to statistical instability ([54]Fig. 1C).
These core genes differ from hub genes in that they are not identified
because of their interconnectedness but because of their power in
prediction while still being stable across random samples. These core
genes are then used as seed genes for further coexpression and DiffCoEx
analysis ([55]Fig. 1D), constructing core gene networks (CGNs). These
CGNs are then piped into pathway enrichment analysis ([56]Fig. 1E). The
pathways learned from each CGN are lastly intersected to provide
another level of stabilization ([57]Fig. 1E). A high-level pseudo-code
is included in [58]Fig. 1F, and the detailed algorithms and design
considerations are provided in Materials and Methods and the
Supplementary Materials. This framework incorporates both optimizations
brought by multiple regressions and gene-gene interactions identified
by coexpression analysis while retaining stability in large part due to
two levels of stabilization.
Fig. 1. Overview of SCOPE framework.
[59]Fig. 1.
[60]Open in a new tab
(A) Expression data are split multiple times randomly into typical
training/test splits with a consistent phenotype ratio as in the
original data. (B) LASSO models are trained on each of the splits, and
the genes selected in each split are recorded. (C) Selected genes are
ordered by the frequency of occurrence in the splits. On the basis of a
cutoff θ[thr] (dashed green line), core genes are identified and used
to identify the CGNs in (D). (D) CGNs are identified, indicated by
genes circled in orange and blue dashed lines. Null distributions of
both DiffCoEx and coexpression are used to identify genes significantly
interacting with the identified core genes. (E) Pathway enrichment
analysis is conducted for each CGN, and the overlap between
CGN-directed pathways will be identified. Last, substantially
overlapped pathways and core genes will be the output. (F) The
algorithm in a simplified high-level pseudo-code. The full version of
the algorithm is presented in the Supplementary Materials.
SCOPE’s outcome is robust to its most key tuning parameters
The SCOPE framework uses several parameters that may be tuned to
produce biologically relevant results. θ[thr] ( ∈ [0,1]) determines the
number of core genes identified by SCOPE, with higher values reducing
the number of core genes selected. r[thr] ( ∈ [0,1]) and
[MATH: rthrD(∈[0,1]) :MATH]
, which are the coexpression and DiffCoEx percentile thresholds,
respectively, are used to determine the cutoff of significance of
secondary genes used to construct CGNs. The number of iterations
n[iter](∈
[MATH: Z+ :MATH]
) and the sample split proportion s[prop] ( ∈ [0,1]) are the parameters
relevant to the SCOPE-Stabilized LASSO step of the framework.The
outcome of SCOPE is highly robust to reasonable changes in its
parameters’ values. Figure S1 gives a visual indication of the
influence of these parameters on the overall pathway overlap score
(POS; a measure of the scale of shared pathway enrichment across
multiple datasets ranging from 0 to the number of datasets) across the
six cancers studied in the TCGA database. Evidently, the maximum POS
remains robust throughout changes in all parameters (fig. S1, A to D),
except for θ[thr] where stringent values lower the POS lightly but with
an observable trend (fig. S1E). Thus, θ[thr] may be tuned on the basis
of the situation and upon observing the frequency distribution of the
selected genes and upon the feasibility of experimental follow-up.
Simulation study unveils the performance of SCOPE in identifying core genes
and related pathways in simulations
The main simulations were conducted using 670 samples of whole-blood
tissue expression from the Genotype-Tissue Expression (GTEx) Consortium
([61]22) to compare the performance of SCOPE, Adaptive Elastic-Net
([62]23), and randomized LASSO ([63]17). Overall, simulations uncovered
SCOPE’s distinct competitive advantage over other methods in
discovering core genes and their related pathways. Simulations were
conducted under a variety of scenarios simulating noise-to-signal
ratios, linear and nonlinear phenotypes, and correlation structures,
which are detailed further in Materials and Methods. In each
simulation, we first set up “gold-standard” core pathways and then
selected core genes related to these pathways that are either (i)
highly or (ii) lowly correlated with these pathways (Materials and
Methods). These core genes and a few randomly selected additional genes
were then deemed as “causal genes,” which were used to generate a
binary phenotype through both linear and nonlinear models consisting of
interactions. The three methods are then evaluated on their ability to
identify both causal genes and core pathways in terms of F1 scores. We
present the results using two different cutoffs: The first is based on
the top 10 pathways that may reflect the practice of looking at the top
pathways for experimental validations ([64]Fig. 2 with a breakdown of
prediction metrics provided in tables S1 and S2); the second is based
on pathways identified with false discovery rate (FDR) < 0.05 that may
be statistically rigorous (fig. S2 with a further breakdown in table
S3). Since the above comparisons focus on the performance measured only
by the final outcome (causal genes and core pathways), to characterize
the contribution of multiple steps, we also analyzed the ability in
identifying core genes, the intermediate outcome (fig. S3 with a
further breakdown in table S4). Simulations were also run on smaller
random subsets of the data as 500 samples (figs. S4 to S6 and tables S5
to S8) and 250 samples (figs. S7 to S9 and tables S9 to S12).
Fig. 2. Simulations comparing the performance of SCOPE, Adaptive Elastic-Net,
and randomized LASSO.
[65]Fig. 2.
[66]Open in a new tab
F1 score {= TP/[TP + 0.5 × (FP+ FN)]} calculated for the accuracy of
Adaptive Elastic-Net, randomized LASSO, and SCOPE models in identifying
causal genes and pathways simulated in the gene expression data with
670 samples. (A) and (B) indicate the ability of SCOPE to identify
causal genes with better accuracy, particularly in scenarios with
higher correlations of the core genes in both linear and nonlinear
phenotypes, respectively. (C) and (D) demonstrate the ability of SCOPE
to identify core pathways among the top 10 pathways enriched using each
method with higher accuracy. Note that in the nonlinear model, we
assumed that the genes participating in interactions are known as a
priori; otherwise, the powers of all three methods are close to zero.
Please see detailed justifications in Materials and Methods.
Under both the linear and nonlinear models, SCOPE-Stabilized LASSO was
able to consistently perform competitively with the other methods and
had a distinct advantage in identifying causal genes in the presence of
highly correlated core genes ([67]Fig. 2, A and B). By inspecting the
corresponding performance for core genes (fig. S3), especially in the
case of highly correlated core genes, one can see that the performance
of all models is lowered. Evidently, SCOPE can better identify highly
correlated core genes in contrast to other methods, which gain power
more through the discovery of causal genes that are not core genes.
While randomized LASSO performs similarly to SCOPE-Stabilized LASSO in
the low-correlation scenario and the Adaptive Elastic-Net performs
relatively well in the presence of a nonlinear phenotype with highly
correlated core genes, SCOPE-Stabilized LASSO performs the best, on
average, across all scenarios. In real data analysis, one is unaware of
the level of correlations and the linearity of the phenotype;
therefore, SCOPE-Stabilized LASSO would be the best tool of choice.
Pathway enrichment revealed that SCOPE was able to better identify core
pathways ([68]Fig. 2, C and D) when considering the top pathways
enriched. In this scenario, randomized LASSO performs the poorest
because of the lower number of genes identified in comparison to the
Adaptive Elastic-Net and SCOPE. However, the larger number of genes
identified by Adaptive Elastic-Net, which led to the larger number of
false-positive causal genes (thus a lower F1 score in identifying the
same), enabled Adaptive Elastic-Net to achieve a similar performance to
SCOPE in the linear model ([69]Fig. 2C) but still provided an edge to
the more comprehensive coexpression network analysis of SCOPE in the
nonlinear scenario ([70]Fig. 2D).
SCOPE stably selected considerably fewer core genes while retaining
predictive power
SCOPE-Stabilized LASSO identified significantly fewer genes among
different splits of the data ([71]Fig. 3A, vertical red dashed lines).
The consistently selected genes are named in [72]Fig. 3A with details
in table S13. In contrast, on the same data, genes selected by a
standard LASSO are much more numerous and vary widely from around 10 to
45 genes ([73]Fig. 3A, colored distributions), documenting the
instability of gene selection by standard LASSO.
Fig. 3. Comparison of SCOPE to standard LASSO in stability and predictive
accuracy.
[74]Fig. 3.
[75]Open in a new tab
(A) Histograms of the number of genes selected by standard LASSO
(colored distributions) in comparison to SCOPE (vertical red dashed
lines) for each cancer. The thresholds chosen for SCOPE-selected core
genes were varied: θ[thr] = 0.90 for BRCA and θ[thr] = 0.75 for KIRC,
LUAD, COAD, PRAD, and THCA. These thresholds resulted in 5 to 10 core
genes being identified per cancer, identified in the six panels for
each cancer. (B) Prediction metrics for SCOPE (core genes) in
comparison to standard LASSO in terms of accuracy = [True Positives
(TP) + True Negatives (TN)]/[TP + TN + False Positives (FP) + False
Negatives (FN)], sensitivity, and specificity. (C) Prediction
accuracy = (TP + TN)/(TP + TN + FP + FN) for two independent microarray
datasets for lung cancer was obtained. In the case of SCOPE, the same
core genes identified and indicated in table S1 were used. For standard
LASSO, multiple sets of genes selected by independent LASSO runs in the
TCGA LUAD dataset were used to assess the varied distribution (due to
instability), and for top LASSO, the top genes in each standard LASSO
equal in number to those selected by SCOPE-Stabilized LASSO were used.
Prediction measures are calculated on the basis of the true labels of
the data (tumor/normal) and the predicted labels on the test data
sampled from the original TCGA data.
Despite being much smaller in number, the predictive power (in
predicting normal/tumor phenotypes) of SCOPE-selected core genes is
close to that obtained by standard LASSO. In-sample validation shows
that the few genes identified by SCOPE confer almost the same, and
sometimes even higher, predictive power compared with the many genes
selected by standard LASSO ([76]Fig. 3B). Restricting the LASSO to use
the top (indicated by the highest absolute coefficients) genes, equal
in number to the number of genes used by SCOPE, reveals poorer
predictive power in comparison with both the standard LASSO and
SCOPE-Stabilized LASSO.
We also resorted to external data validation using two microarray
datasets ([77]24, [78]25) based on the set of core genes identified by
SCOPE-Stabilized LASSO (listed in table S13) and the multiple runs of
standard LASSO (mirroring in practice the range that different people
might achieve on the basis of the genes that they ended up identifying)
as well as the top genes identified in each of the standard LASSO runs.
Evidently, the predictive accuracy remains close to one using standard
LASSO ([79]Fig. 3C) and higher in accuracy than using the top genes
identified by any single LASSO model. Internal and external validation
highlighted the ability of SCOPE-Stabilized LASSO to identify a highly
predictive handful of genes that are comparable in prediction accuracy
to the many-fold larger number of genes selected by a standard LASSO
model. The small margin also indicates that, while models including
more genes may be slightly more predictive, they may not all be vital
to tumorigenesis. Furthermore, such a large number of genes could be
extremely costly to experimentally validate and thus are not an ideal
outcome of in silico methods, a problem relieved by SCOPE-Stabilized
LASSO selection.
The consistency and stability of SCOPE-Stabilized LASSO over standard
LASSO were demonstrated by looking at the replicability over multiple
runs on the same data (with different splits of training/testing
samples). The comparison was conducted by randomly splitting the data
into training and testing samples 100 times, using a different random
seed for each split. This reflects the effect of choosing a different
training sample in a typical usage scenario. Proportions of runs in
which genes were selected are shown in table S14, illustrating the high
level of stability obtained by SCOPE-Stabilized LASSO over standard
LASSO.
SCOPE identified pan-cancer pathways, focusing on DNA replication and repair
Via standard coexpression analysis of gene networks, the core genes
selected by the SCOPE-Stabilized LASSO were used to form their
corresponding CGNs, which, in turn, were used to identify pathways
based on pathway enrichment analysis (Materials and Methods). The
pathways that are identified by multiple CGNs are the output of SCOPE
(table S15). Many of these pathways fall into the categories of “cell
growth and death,” “replication and repair,” and “folding, sorting, and
degradation.” These pathways are highly related to cancer cell
immortality and cancer genome damage response. A similar protocol was
also conducted using DiffCoEx analysis by looking at pathways
identified by multiple modules (table S16).
To further assess the stability of SCOPE, we analyzed the sharing of
core genes between cancers. At the gene level, MT-CO2 was identified as
a core gene by SCOPE in four of the six cancers. This gene produces the
cytochrome c oxidase subunit 2 protein, which is essential in a
mitochondrial process associated with oxidative phosphorylation.
Besides MT-CO2, no other core gene is shared among cancers, indicating
that different cancers may have different core genes if one does not
look at higher levels such as pathways.
We then characterized the sharing of pathways across cancers. To
quantify the extent of sharing, we first formed a within-cancer
statistic, π[cancer], which denotes the proportion of genes contained
in CGNs out of the total number of genes in each pathway. Then, the POS
was calculated as the summation of the π[cancer] values over all six
cancers. Higher values intuitively indicate higher overlap of the
pathway across cancers. Contrasting the results of the three tools,
namely, SCOPE, DE, and DiffCoEx, despite their substantial sharing in
terms of pathways identified ([80]Fig. 4A), showed quite different
landscapes in terms of sharing between cancers. Evidently, SCOPE
identified both cancer-specific and pan-cancer pathways characterized
by its two-spike distribution of POS: The cluster at the low-POS end
stands for cancer specific, while the cluster at the high-POS end
indicates pan-caner pathways ([81]Fig. 4B). In contrast, both DE and
DiffCoEx distributions have only one spike at the low-POS spectrum
([82]Fig. 4, C and D). These POS distributions are further detailed in
table S17 for SCOPE, DiffCoEx, and DE, evidencing the potential
drawback of DiffCoEx and DE in their inability to repetitively identify
key pathways that could be universally vital in cancers. This
distinction between SCOPE and standard methods suggests that SCOPE’s
stability in discovering pathways can reveal key pathways even in
different cancers.
Fig. 4. Comparison of SCOPE to alternative methods on pathway
identifications.
[83]Fig. 4.
[84]Open in a new tab
(A) Pathways identified by DiffCoEx, DE, and SCOPE are compared for
uniqueness and sharing. (B to D) POS, which indicates the level of
enrichment of a pathway across multiple cancers, is contrasted among
the three methods. (B) SCOPE uncovers both cancer-specific (notable by
the spike in lower POS) and pan-cancer shared pathways (notable by the
spike in higher POS), while both DiffCoEx (C) and DE (D) appear to be
more cancer specific than SCOPE as evidenced by the lower distribution
of POS.
We then annotated pathways that were identified by SCOPE to check
whether they were relevant. Investigating pathways related to the
hallmarks of cancer ([85]26) and the proportion of genes in each of
these pathways by each of the three methods ([86]Fig. 5, A to F)
reveals that SCOPE identifies these hallmarks across cancers quite
significantly. [87]Figure 5G looks at the POS for these hallmark
pathways across the different cancers. SCOPE stands out by identifying
the highest proportion of genes involved in these pathways.
Fig. 5. Comparison of proportion of genes in each cancer (π) identified by
SCOPE in contrast to DiffCoEx and DE in pathways related to hallmarks of
cancers.
[88]Fig. 5.
[89]Open in a new tab
Pathways shown are (A) DNA replication (has03030), (B) base excision
repair (hsa03410), (C) nucleotide excision repair (hsa03420), (D)
homologous recombination (hsa03440), (E) cell cycle (hsa04110), and (F)
p53 signaling pathway (hsa04115). (G) Comparison of POS across the
three methods for the same pathways (A to G). POS is calculated as the
sum of π values across the cancers for each pathway. Higher values
indicate higher discovery of genes related to each pathway across
cancers.
By analyzing the literature further, we realized that the pan-cancer
pathways revealed by SCOPE are biologically meaningful. Overlapped
pathways enriched using overrepresentation analysis on the Kyoto
Encyclopedia of Genes and Genomes (KEGG) database, excluding any
pathways that had no enrichment for one or more cancers, immediately
reveal the high enrichment of pathways related to regulating the
universal level of DNA/RNA/protein and, notably, the pathways related
to DNA repair. The most notable characteristic of cancer is the
unlimited growth of cancer cells, which also links to the cell cycle
pathway ([90]26). Mechanistically, they need readily available supplies
of materials for cell growth and replication, e.g., more DNA
replication, more RNAs transcribed by RNA polymerase and spliced by the
spliceosome, and more proteins translated by ribosome. The increased
supplies are also likely due to less RNA degradation, less protein
degradation by the proteasome, and more N-glycan biosynthesis for
N-linked glycosylation, one of the most abundant protein modifications
that play a critical role in tumorigenesis ([91]27). In addition,
increased DNA replication accumulates errors as DNA mutations.
Mutations inactivating tumor suppressor genes can further accelerate
the accumulation of mutations, partially through defective DNA damage
repair, and result in genome instability, a hallmark of all cancers
([92]26). Hence, this result demonstrates that the core genes
identified by SCOPE-Stabilized LASSO are stably connected with pathways
essential to tumor growth and/or associated with the fundamental
hallmarks of any type of cancer.
Pan-cancer pathways exhibit contrastive interaction patterns centered by core
genes
To further confirm the roles of the core genes in their discovered
pathways, we calculated the correlations between a core gene and all
the genes in the corresponding pathway. The core genes exhibit highly
disruptive patterns in the coexpression network. Taking the nucleotide
excision repair pathway network across multiple cancers as an example,
the coexpression networks fundamentally differ in structure and
intensity with respect to the core genes ([93]Fig. 6, A and B). Despite
different cancers using different core genes, the correlations between
the core genes and the other genes in the same pathway are universally
higher or lower in the tumor tissue. Many of the core genes are not DE
([94]Fig. 6B), indicating that core genes may contribute to cancers by
disrupting their interactions, although their own expression levels are
not significantly altered. These results further strengthen the role
that core genes appear to play in the pathology of these cancers.
Fig. 6. Example of the roles of core genes in a pan-cancer pathway uncovered
by SCOPE.
[95]Fig. 6.
[96]Open in a new tab
The nucleotide excision repair pathway (hsa03420) is used in this
example. Core genes (light blue) are in the center of the network with
the genes in this pathway (gray) arranged in a circle. Pearson’s
correlation coefficients are indicated as edges ranging from −1 (red)
to +1 (blue). The names of the genes and their correlations with the
core genes are noted in table S18. Boxplots contrast the distributions
of two sets of correlations (tumor versus normal) along with the P
value for the Kolmogorov-Smirnov test, with the null hypothesis being
that the two samples were drawn from the same distribution. (A) Core
genes are DE, and (B) core genes are not DE.
Other coexpression networks centralized by other core genes are
provided in figs. S10 to S14 and further detailed in table S18, showing
switched (opposing) correlation patterns (and, in some cases, an
absence of correlations) when contrasting tumor tissue to normal
tissue. These switched correlations appear to indicate that the core
genes identified by the SCOPE-Stabilized LASSO method are highly
connected genes that are indicators of the proper functioning of these
pathways, if not responsible for mediating these pathways.
In addition to the above pan-cancer shared pathway analysis, SCOPE also
identified cancer-specific pathways, some of which show contrastive
connectivity patterns. For instance, in breast cancer, glyoxylate and
dicarboxylate metabolism, fatty acid degradation, and regulation of
lipolysis in adipocytes were found (fig. S15, A to C) ([97]28). For
colon cancer, bile secretion, mineral absorption, and proximal tubule
bicarbonate reclamation were highlighted (fig. S15, D to F). Out of the
SCOPE-identified colon cancer–specific pathways, 90% are classified as
being in the category of metabolism, while other cancers do not show
such patterns.
In-depth annotation reveals hypothetical CD63-centered mechanism in kidney
cancer
Among the five core genes selected by SCOPE in kidney cancer (KIRC),
CD63 plays an indispensable role in VEGFR2 activation in response to
VEGF ([98]29). Notably, aberrant activation of the VEGF-VEGFR axis is a
pivotal driver in kidney cancer since more than 60% of patients with
kidney cancer harbor VHL mutations ([99]30). Inactivated VHL fails to
degrade HIF α subunits (HIFα) in kidney cancer cells. The accumulation
of HIFα induces the transcription of hypoxia-related genes and
activation of hypoxia signaling in the presence of oxygen. As a key
downstream target of HIFα, VEGF expression and secretion further cause
autocrine or paracrine activation of the VEGFR signaling pathway
([100]21). Hence, CD63 probably plays an oncogenic role in kidney
cancer. Consistently, high mRNA level of CD63 associates with adverse
prognosis in patients with KIRC (P = 0.0019; [101]Fig. 7A, 1). In
contrast, there is no such relationship in the other five cancer types
(fig. S16). In agreement with CD63’s role in the activation of VEGFR
signaling pathway, which is driven by VHL mutations in KIRC, the
association is more significant in VHL-mutated cohorts (P = 0.0006;
[102]Fig. 7A, 2) than in VHL–wild-type cohorts (P = 0.216;
[103]Fig. 7A, 3). Along this line, high expression of CD63 in kidney
tumors correlates with a hypoxia gene signature assessed by two
different scores ([104]Fig. 7, B and C). CD63 is also known as a marker
of exosomes, extracellular vesicles secreted by cells ([105]31). In
agreement with the fact that exosomes can contribute to metastasis
([106]32), CD63 shows a tendency to be correlated with metastasis in
patients with KIRC although barely above the significant cutoff of 0.05
(P = 0.0565; [107]Fig. 7D). In particular, CD63 knockout mice are
viable, fertile, and almost normal except for an altered water balance,
such as increased urinary flow, water intake, reduced urine osmolality,
and a higher fecal water content ([108]33). This does not only suggest
that CD63 plays a critical and specific role in kidney pathology, and
consequently in kidney tumorigenesis, but also hints that CD63 can be a
therapeutic target for kidney cancer with minimal systemic toxicity.
Anti-CD63 antibodies were reported to suppress allergy ([109]34) or
inhibit metastasis ([110]35) in vivo. It will be worth exploring
whether anti-CD63 antibodies are able to improve the potency of
targeted therapy or immunotherapy and inhibit metastasis in patients
with kidney cancer. Nevertheless, this example suggests that the core
genes selected by SCOPE may help exert bona fide biological functions
in the mechanisms of cancer.
Fig. 7. Hypothetical role of CD63 in kidney cancer.
[111]Fig. 7.
[112]Open in a new tab
EXP < 0 indicates samples in which the expression level of the gene
CD63 is lower than the arithmetic mean of the expression levels of the
gene across all samples, while EXP > 0 indicates a value higher than
mean expression levels. (A) Survival plots of patients considering
differing expression of CD63 in (1) all patients in the KIRC dataset,
(2) patients with VHL mutation, and (3) patients with VHL wild type
(WT). (B) and (C) indicate that a higher expression of CD63 correlates
with a higher expression of hypoxia-related genes profiled by two
([113]73, [114]74) well-known hypoxia gene signatures, while (D)
indicates the relationship between CD63 and metastasis in kidney
cancers. (E) (1 and 2) Connectivity network suggesting the role that
CD63 plays in the arginine and proline metabolism pathway with key
genes involved in the pathway [data and figures of (A) to (E) are
derived from the cBioPortal website ([115]www.cbioportal.org/)].
Among all genes connected with CD63, SAT2 is the one with the most
significantly differential correlations in tumor and normal tissues.
SAT2 mRNA level shows a negative correlation with CD63 in normal tissue
while exhibiting an almost opposite correlation in tumors
([116]Fig. 7E, 1 and 2). Many other genes in the pathway of arginine
and proline metabolism, such as AGMAT, DAO, ALDH4A1, PRODH2, GATM, and
HOGA1, also show similar patterns of switched correlations with CD63
([117]Fig. 7E, 1 and 2). The altered correlations in tumors uncovered
by SCOPE hint that these genes may play critical roles in kidney
tumorigenesis. In agreement with this hypothesis, agmatinase, encoded
by AGMAT, is diminished in kidney cancer samples, whereas AGMAT mRNA is
most abundant in human liver and kidney ([118]36). Moreover, SAT2, DAO,
ALDH4A1, PRODH2, GATM, and HOGA1 are ubiquitously expressed in the
kidney based on the Human Protein Atlas ([119]37, [120]38). However,
other genes belonging to the pathway of arginine and proline metabolism
were not identified or were only shown negligible correlation
differences by SCOPE, such as SAT1, NOS1, CKM, CKB, and ARG2, and do
not show obvious overexpression in the kidney ([121]37–[122]39). The
distinct tissue specificity of two groups of genes in the same pathway
of arginine and proline metabolism validates that SCOPE was able to
identify altered coexpression patterns in specific cancer types. In
contrast, neither DE nor DiffCoEx uncovered significant enrichment of
these pathways in KIRC, further strengthening the ability of SCOPE in
uncovering such biologically relevant pathways.
DISCUSSION
Current methods of driver gene identification use multiomics data,
particularly mutation data in collaboration with known biological
pathways. Transcriptomic data are seldom used for the identification of
driver genes. This is in part due to the inability to determine
causality using methods such as DE, DiffCoEx, and coexpression
networks. Gene expression data alone, while conveniently available, are
infrequently used for this purpose and rather directed toward biomarker
discovery. Our proposed method of stabilizing the LASSO such that it
identifies consistent predictors followed by coexpression and pathway
analysis enables researchers to identify the core genes and pathways by
taking advantage of the synergy between two disconnected fields: linear
feature selection and nonlinear coexpression network analysis. This
provides a method for experimentalists to narrow down candidate genes
using more cost-effective expression data. Furthermore, only a handful
of such core genes are selected, thus providing experimentalists an
ideal scenario of being able to study these few genes extensively.
SCOPE uses both coexpression and DiffCoEx in building the CGNs that
represent the units for enrichment. While it is usual for coexpression
to be typically studied, DiffCoEx is less used and the combination even
less so. The intuition here is that while genes coexpressed in both
tumor and normal tissues are clearly interacting with the core genes
identified, differentially coexpressed genes are even more so due to
the differences in their behavior between the two phenotypes. Thus,
combining both types of interactions leads to a better constructed
network, identifying more interesting groups of genes to be studied.
Discovering enriched pathways connected with core genes may provide a
rationale for targeted therapy against certain cancer types. A series
of genes in the ferroptosis pathway, including ACSL4, MAP1LC3B, ATG5,
PRNP, NCOA4, PCBP1, LPCAT3, VDAC3, FTH1, SLC39A14, SLC40A1, and
SLC11A2, showed significantly changed patterns of correlation with
PSMG4 in the PRAD dataset (table S19). Ferroptosis is a programmed cell
death driven by iron-dependent phospholipid peroxidation and reactive
oxygen species generation ([123]40). Since excessive iron contributes
to ferroptosis, PCBP1 and FTH1, which regulate iron metabolism and
storage, are considered negative regulators of ferroptosis. ATG5,
MAP1LC3B, and NCOA4 initiate autophagy and consequently promote iron
release from degraded iron-bound proteins. SLC40A1, SLC39A14, and PRNP
export iron from cells and reduce ferroptosis, whereas SLC11A2
regulates iron release to the cytoplasm and may enhance ferroptosis.
ACSL4, LPCAT3, and VDAC3 regulate the mechanism of phospholipid and
NADH oxidation and play roles as positive regulators of ferroptosis
([124]41). In particular, AIFM2, a critical ferroptosis suppressor
identified in 2019 ([125]42, [126]43), shows reduced expression in
prostate cancer (PRAD) [logFC (fold change) = −0.9008]. All these data
indicate that ferroptosis inducers might be potent in patients with
PRAD. Consistently, recent work has reported the induction of
ferroptosis as a new therapeutic strategy for advanced prostate cancer
([127]44). Neither DiffCoEx nor DE highlighted the ferroptosis pathway
as significant in PRAD, while SCOPE was able to highlight this pathway
uniquely and significantly in PRAD.
There are also many other pan-cancer analyses. A weighted gene
co-expression network analysis (WGCNA) ([128]45)–based approach
([129]46) identified multiple hallmarks of cancer stratifying different
tumors contrary to SCOPE, where the pathways and hallmarks that are
shared by different cancers are identified. Another study conducting
survival analysis based on the TCGA database ([130]47) identifies
unique prognostic tumor-specific genes that are also cancer hallmark
genes and remarks on their tumor specificity. However, the shared
pathways identified by SCOPE may highlight that cancer hallmarks may be
induced at a pathway level even if the same hallmark genes are not
clearly expressed in each type of tumor. A mutation-based approach to
pan-cancer network analysis ([131]48) identifies 16 significant
subnetworks that span across multiple pathways with previously
identified roles in cancer, further contributing to the hypothesis of
shared pathways explored by SCOPE.
An inherent limitation of transcriptomic data is that most biological
functions are performed by proteins, not mRNAs. One example is the p53
signaling pathway in BRCA, which is significantly enriched by the gene
pairs of MT-CO2 with CDK4, AIFM2, or CHEK2 (table S20). Furthermore,
another core gene, CD300LG, shows switched correlations with TP53 and
CASP9, although the respective CGN is not enriched for the p53
signaling pathway. Although TP53 (encoding p53) showed altered
correlations with CD300LG and MT-CO2, the putative transcriptional
targets of p53, such as CDKN1A and MDM2 ([132]49), did not show
significant changes of correlation. It implies that the p53
transcriptional activity was not significantly changed in the presence
of significantly changed mRNA level of TP53. This implication was
further supported by two facts: (i) The regulation of p53 activity is
dominant at the posttranslational level ([133]49), not at the mRNA
level; (ii) 35% of patients in the TCGA-BRCA database harbor TP53
mutations, and most TP53 mutations abolish the transcriptional activity
of p53. We looked for top transcription factor binding sites in the
promoters of these genes (CASP9, CDK4, AIFM2, and CHEK2) provided by
QIAGEN through GeneCards ([134]50) and found CCAAT/enhancer binding
proteins (C/EBPs) bound to these promoters. Since the
phosphatidylinositol 3-kinase (PI3K)–AKT–mTOR signaling pathway is
highly mutated in the BRCA database [fig. S17; obtained from cBioPortal
in the TCGA-BRCA database ([135]51, [136]52)] and is able to regulate
the transcriptional activity of C/EBPs ([137]53), a reasonable
explanation is that a hyperactivated PI3K-AKT-mTOR axis induces the
mRNA expression of these targets via C/EBPs as the transcriptional
factor (but not TP53) in patients with BRCA. Nevertheless, with more
data of cancer at the protein level, such as The Pathology Atlas
([138]38) and The Cancer Proteome Atlas Portal ([139]54, [140]55),
SCOPE may be substantially empowered to provide more valuable insights
into the aberrant connections in tumor cells.
To recap, we have presented SCOPE, a method stabilizing gene selection
and coexpression network analysis, which is able to identify core genes
and pathways underlying cancers. Its effectiveness has been
demonstrated by various analyses from three angles (i.e., selection of
few, stable, and predictive genes; pan-cancer shared pathways; and the
role of core genes in connectivity analysis). Moreover, in-depth
annotations have revealed the pivotal role of CD63 on tumorigenesis in
kidney cancer and the potential therapeutic application of anti-CD63
antibody on patients with kidney cancer. As a proof of concept, we have
only contrasted cancer and normal tissues in this work. However, the
statistical framework is applicable to any case/control settings. In
the future, we will adapt SCOPE to analyze clinically important
qualities such as whether a patient will respond to medical treatments
such as immunotherapy, paving the way to the application of precision
medicine in more applications.
MATERIALS AND METHODS
SCOPE-Stabilized LASSO selection
While LASSO has proven versatile in many applications, statistically,
it has become apparent that in the presence of multiple correlated
features, it may be inconsistent in its selection of features, even in
multiple random samplings of the same data ([141]56). This has led to a
number of new methods being proposed that are all modifications of the
original LASSO such as adaptive LASSO ([142]56), random LASSO
([143]57), and bolasso ([144]19). A seminal work by Meinshausen and
Bühlmann ([145]17) discusses stability paths, which obtain the
selection probabilities of each feature by subsampling along all
possible values of the tuning parameter with randomized LASSO, which
introduces a random penalty λ for each feature. While such methods are
statistically proven and can lead to sound results, they appear to be
seldom used in the field of genomics.
In SCOPE, a simpler solution to the inconsistency of variable selection
in LASSO is proposed and applied in the form of a bootstrapped LASSO
([146]Fig. 1, A and B). While simple in design, it produces consistent
results that are highly predictive. In the case of this paper, where
the phenotype is binary (tumor or normal sample), a logistic LASSO
regression model is trained multiple times by subsampling from the same
dataset. Genes that were selected in most of the models (over a
threshold proportion θ[thr]) are used to build a final logistic
regression model for which the final accuracy will be assessed. This
stable subset of genes is proposed to be the “core” genes of the
disease. These core genes can then be used in other predictive models
or for further downstream analysis; SCOPE uses a coexpression-based
pathway analysis using these selected core genes.
The SCOPE-Stabilized LASSO used in this analysis features the consensus
of 1000 training-test splits of a 70%-30% split ratio (each with a
consistent case/control ratio of the full dataset). Each LASSO model
trained was tuned for the optimal value of λ using 10-fold
cross-validation. The thresholds used for the different datasets of the
real analysis are detailed in Results.
Coexpression and pathway analysis
It is assumed that core genes interact with multiple other genes that
may be involved in pathways responsible for disease mechanisms. To
identify these genes, we conducted both coexpression and DiffCoEx
analysis ([147]Fig. 1D). To claim a gene as being significantly
coexpressed with a core gene, we required a null distribution for the
correlations (coexpression) between pairs of random genes. To this end,
we drew random pairs of genes and calculated the Pearson correlation
coefficients of these pairs. Using this distribution, we obtain the
(r[thr]=) 97.5th percentiles for both positive and negative
correlations. This allowed us to identify genes that are significantly
coexpressed with core genes. Each set of genes thus identified
(secondary coexpressed genes, along with their corresponding core
gene), termed CGNs, was then tested for pathway enrichment.
To reflect the fact that some critical genes are not so highly
coexpressed but are significantly differentially coexpressed when
contrasting cancer and normal tissues, we also obtained the genes that
are significantly differently coexpressed with the core gene between
tumor and normal tissues ([148]Fig. 1D). As in the case of the
coexpression analysis, a null distribution of the DiffCoEx values
(|corr[case] − corr[control]|) was obtained, and the (
[MATH: rthrD= :MATH]
) 97.5th percentile was used to select significantly differentially
coexpressed secondary genes. These secondary genes from DiffCoEx
analysis were also added to the CGNs for pathway analysis below.
Pathway enrichment is typically used to assess whether a particular set
of genes overlap with known biological pathways significantly higher
than by chance. There are many databases containing such pathways, and
SCOPE uses the KEGG ([149]58, [150]59) database because of its
comprehensiveness and popularity. Overrepresentation analysis ([151]60)
is used to identify the statistical significance, and the R package
WebGestaltR ([152]61, [153]62) was used for testing pathway enrichment
against the KEGG database. This analysis results in several pathways
enriched (at FDR ≤ 0.05) for each CGN (comprising of genes both
coexpressed and differentially coexpressed) underlying the focal core
gene. SCOPE then discovers pathways that are commonly influenced by
CGNs (seeded by different core genes).
For a single disease such as a cancer, an index for the level of
sharing of a pathway (across multiple CGNs within a cancer), π[cancer],
is defined as the number of coexpressed genes (including the core gene)
found to be enriched in this pathway divided by the total number of
genes in the pathway. When multiple diseases are jointly analyzed
(e.g., the six cancers used here as a demonstrating example), SCOPE
will further discover pathways common to all diseases (table S17). The
summation of this single cancer-specific index (π[cancer]) over all the
cancers is noted as the POS. Intuitively, a higher POS indicates a
higher overlap of the pathway across different cancers.
Methods compared to SCOPE
Least absolute shrinkage and selection operator
The primary benchmark and point of comparison is a traditional L[1]
regularized logistic regression model, which uses the addition of the
absolute value of the coefficients to promote sparsity in the loss
function. Given n number of samples and p number of features/variables,
the regularized loss function of a logistic regression model takes the
form
[MATH: minβ∑i
mi>=1n[yix<
/mi>iTβ−log(1+exiTβ)]+λ∑j=<
/mo>1p|βj| :MATH]
(1)
where λ is the tuning parameter controlling the trade-off between
sparsity and accuracy. Ten-fold cross-validation is typically used to
tune for this parameter, and the R package glmnet ([154]63, [155]64)
enhanced by glmnetUtils ([156]65) was used to fit LASSO models here
(Supplementary Materials). Genes selected by a traditional LASSO and
genes selected by the SCOPE-Stabilized LASSO step are compared in
Results.
Adaptive Elastic-Net
The Adaptive Elastic-Net ([157]23) is essentially a merging of the
popular elastic-net ([158]66) (which combines L[1] and L[2]
regularization) and the adaptive LASSO ([159]56), which assigns
data-dependent weights to the coefficients in the L[1] penalty. The
data-dependent weights are calculated using a standard elastic-net
model, and these estimates are denoted by
[MATH: β^(enet) :MATH]
. The Adaptive Elastic-Net estimates are then given by
[MATH: β^^=(1
mn>+λ2n){arg minβ∣∣y−Xβ∣∣22+<
msub>λ2∣∣β∣∣22+<
msubsup>λ1∗<
mstyle displaystyle="true"
scriptlevel="0">∑j=1
mn>pw^j∣βj∣}
:MATH]
where
[MATH:
w^
j=[∣β^j
msub>(enet)∣]−γ
:MATH]
and γ is a positive constant. The R package gcdnet ([160]67) is used to
tune and fit the Adaptive Elastic-Net models used here, while the
elastic-net weights were obtained using the glmnet package.
Randomized LASSO
The randomized LASSO modifies the penalty λ of a typical LASSO model to
a randomly chosen value in the range
[MATH: [λ,λα] :MATH]
where α ∈ (0,1] (default = 0.8) is defined as the weakness. Assuming
W[j] to be independent, identically distributed (IID) random variables
in [α,1], the randomized LASSO estimator is
[MATH: β^^=arg
minβ(∣∣y−Xβ∣∣22+<
mi mathvariant="normal">λ∑j=1
mn>p∣βj∣Wj
) :MATH]
Randomized LASSO, as described in ([161]17), also follows the
additional step of stability selection by selecting only variables that
are above a certain threshold (π[thr], default = 0.8) across random
subsamples. The implementation of randomized LASSO in the monaLisa
([162]68) R package was used for comparison purposes here.
DiffCoEx analysis
Standard network-based methods of analysis of expression data typically
use the interconnectedness of genes (in the form of coexpression) to
identify important networks (or “modules”) of genes. However, in a
case-control setting, the use of DiffCoEx can prove more informative
because of the contrastive nature of the analysis. DiffCoEx ([163]69),
one of the most popular methods extending the popular WGCNA
coexpression network analysis tool, was chosen for comparison. DiffCoEx
identified differentially coexpressed modules (Supplementary Materials)
that were used in pathway enrichment similar to how CGNs were used for
pathway enrichment and for studying pathway overlaps.
Differential expression
An edgeR-limma–based pipeline ([164]70) was used to normalize the data
to log[2]–counts per million values, and a linear model incorporating
weights from voom to correct for the mean-variance relationship was
used to statistically detect the DE of genes in each of the cancers.
The pipeline was run using default values for all parameters as
described in the workflow.
Data generation and model fitting for simulations
A number of different factors such as signal-to-noise ratio,
linear/nonlinear effects on phenotype, and different coexpression
structures were considered in generating the data required for the
simulations. The GTEx ([165]22) was used as the source of expressions.
A random sample of 15 pathways from the KEGG pathway database was
considered as candidates of gold-standard core pathways. Each
simulation begins by randomly sampling p (p = 3,5,10) number of core
pathways from the available pathways. Then, the Pearson correlation of
all genes in the 15 pathways is calculated pairwise for each of the
genes in the core pathways. The absolute sums of these values are then
calculated, providing an empirical estimate of the interaction of each
gene with the core pathways. Then, g[c] (g[c] = 5,10,15) number of core
genes are selected on the basis of (i) the highest interacting genes
and (ii) the lowest interacting genes with the core pathways. A further
g[e] (g[e] = 0,3,5) extra causal genes are randomly selected from all
the remaining genes of the 15 pathways, and the combined set of genes,
{g[c], g[e]}, was used to generate the phenotype.
Phenotypes are generated using both linear and nonlinear models:
1) Linear:
[MATH: Yinitiallinear=∑j=1gc+g
mi>eβjXj
:MATH]
, where β~Unif( − 10,10)
2) Nonlinear:
[MATH: Yinitialnonlinear=∑j=1gc+g
mi>e∑i=1jβj,i<
/mrow>Xj⊙Xi
:MATH]
, where β~Unif( − 10,10) and ⊙ represents element-wise multiplication
To ensure that a consistent signal-to-noise ratio (s[nr] = 0.7,0.8,0.9)
is achieved and a binary phenotype is produced, both Y[initial] values
undergo the following transformations to obtain the respective
phenotypes. Let
[MATH: σg2=var(Yinitial) :MATH]
. Then,
[MATH: σerror2=<
/mo>σg2[1snr2<
mo>−1] :MATH]
and Y[noisy] = Y[initial] +
[MATH: ϵ :MATH]
where
[MATH: ϵ∼N(0,σerror2) :MATH]
. Then, let
[MATH: p=11<
/mn>+exp(−Ynoisy) :MATH]
. Last, Y[i]~Bin(1, p[i]). Ten replicates of each unique combination of
parameters were obtained, resulting in a total of 3240 simulations.
The above procedure generates data ready for analysis. The analytic
procedure is generally the same as what we did for real data analysis.
The only alteration is on the interaction term in the nonlinear case.
In practice, one has to rely on regularized regression to select causal
genes out of many candidates. Following this procedure, when analyzing
the data, around 700 terms (genes) from the randomly selected pathways
were included in the regularized regression to test the ability of
feature selections. However, in the nonlinear cases, if we put all
potential combinations of all noisy genes to the model, the power will
be close to zero. As such, we assume that the candidates under
interactions are known and only put the interaction terms between these
genes into the regression. This might still be close to the practice as
users should have a rough idea of which genes are under interactions;
without which, such feature selection methods would not work well.
Nevertheless, we make the interacting genes available for all three
competitive methods to ensure the fairness of the comparison.
Real data analysis
Data source and processing
The TCGA program initiated by the National Cancer Institute and the
National Human Genome Research Institute in 2006 ([166]20) offers a
wealth of omics data on 32 different cancers and their subtypes at 68
primary sites. This includes RNA sequencing data that provide a
snapshot of the transcriptomic landscape of the tumor site and of solid
normal tissue close to the tumor site.
Six cancers [breast invasive carcinoma (BRCA), kidney renal clear cell
carcinoma (KIRC), lung adenocarcinoma (LUAD), colon adenocarcinoma
(COAD), prostate adenocarcinoma (PRAD), and thyroid carcinoma (THCA)]
were chosen primarily because of the large samples of expression data
available and the inclusion of normal tissue from the same patients. A
breakdown of the sample sizes and disease status is given in table S21.
The raw count data were downloaded from the TCGA data portal and then
converted to transcripts per million values using gene lengths obtained
through the biomaRt ([167]71, [168]72) package. Phenotype (tumor or
normal) was determined on the basis of the sample type column provided
in the database, and “primary tissue” was considered as cases and
normal tissue as controls. Any other sample types such as “metastasis”
were discarded. Models were then fitted on these processed data. Two
additional datasets (lung cancer associated), E-MTAB-6043 ([169]24) and
MTAB-6699 ([170]25), were downloaded from ArrayExpress and used to
externally validate the predictive accuracy of core genes selected by
SCOPE and alternative methods.
Acknowledgments