Abstract

Background

   Copy number aberrations (CNAs) in cancer affect disease outcomes by
   regulating molecular phenotypes, such as gene expressions, that drive
   important biological processes. To gain comprehensive insights into
   molecular biomarkers for cancer, it is critical to identify key groups
   of CNAs, the associated gene modules, regulatory modules, and their
   downstream effect on outcomes.

Methods

   In this paper, we demonstrate an innovative use of sparse canonical
   correlation analysis (sCCA) to effectively identify the ensemble of
   CNAs, and gene modules in the context of binary and censored disease
   endpoints. Our approach detects potentially orthogonal gene expression
   modules which are highly correlated with sets of CNA and then
   identifies the genes within these modules that are associated with the
   outcome.

Results

   Analyzing clinical and genomic data on 1,904 breast cancer patients
   from the METABRIC study, we found 14 gene modules to be regulated by
   groups of proximally located CNA sites. We validated this finding using
   an independent set of 1,077 breast invasive carcinoma samples from The
   Cancer Genome Atlas (TCGA). Our analysis of 7 clinical endpoints
   identified several novel and interpretable regulatory associations,
   highlighting the role of CNAs in key biological pathways and processes
   for breast cancer. Genes significantly associated with the outcomes
   were enriched for early estrogen response pathway, DNA repair pathways
   as well as targets of transcription factors such as E2F4, MYC, and ETS1
   that have recognized roles in tumor characteristics and survival.
   Subsequent meta-analysis across the endpoints further identified
   several genes through the aggregation of weaker associations.

Conclusions

   Our findings suggest that sCCA analysis can aggregate weaker
   associations to identify interpretable and important genes, modules,
   and clinically consequential pathways.

Introduction

   Cancer genomes have an enriched burden of somatic copy number
   aberrations [[28]1–[29]3] (CNAs) such as DNA copy number gains and
   losses that harbor or are proximal to important oncogenes and tumor
   suppressor genes controlling cell growth and division [[30]4]. The CNAs
   may directly regulate cellular growth pathways and other gene sets
   impacting key biological outcomes by altering gene expressions at the
   RNA level that are important for tumorigenesis and outcomes. Thus,
   identification of such driver CNAs and understanding the mechanism
   through which they affect downstream gene expression and the resulting
   effects on cancer outcomes are crucial for effective disease management
   and control.

   This past decade has witnessed significant advances in our ability to
   measure large volumes of data on CNA and gene expressions from tumor
   samples. These data offer unprecedented opportunities to identify
   biomarkers associated with disease outcomes. These opportunities have
   accelerated a shift towards the development of novel genomics-based
   prognostic and therapeutic tools. For example, CNAs are associated with
   poor survival in breast cancer patients [[31]5]; mutations in PIK3CA
   are associated with poor survival in certain estrogen receptor
   (ER)-positive breast cancers [[32]6,[33]7]. Oncotype DX, the
   FDA-approved score based on the expression of 21 genes, and MammaPrint,
   a score based on the expression of 70 genes, are associated with poor
   survival in breast cancer patients and are used for making treatment
   decisions in estrogen-positive early-stage breast cancers and all
   early-stage breast cancers, respectively [[34]8–[35]11]. Identifying
   such biomarkers has been pivotal to advancing cancer care in the past
   decade. However, due to the low prevalence of specific CNAs and since
   existing gene expression scores are relevant for only specific
   subgroups of cancer patients, current biomarkers can only be used for
   managing the disease of a small segment of patients. Given the growth
   and aging of the US population and, with this, the projected 50%
   increase in cancer incidence over the next three decades [[36]12],
   further research is urgently needed to identify biomarkers for the
   effective management of all types of cancers. In this article we
   present a novel statistical and computational approach to address this
   need, focusing on CNAs and gene expressions in breast cancer.

   Standard methods for identifying biomarkers examine the association
   between individual genomic features and outcomes (such as survival) and
   select the top-ranking biomarkers after adjusting for multiple
   comparisons [[37]13–[38]15]. Penalized regression methods [[39]16],
   including the causal modeling with expression linkage for complex
   traits (Camelot) approach [[40]17], relate multiple CNAs and gene
   expressions to an outcome via a multivariable regression model to
   select biomarkers by imposing sparsity via L[1] or L[2] penalties.
   These methods prioritize biomarkers based on their individual effects
   on the outcome without leveraging the putative biological relationships
   between CNAs and gene expressions. It is well-understood that genes and
   their products rarely act in isolation but work with other genes or
   their products to form networks or pathways to address specific
   biological functions [[41]18]. Thus, a comprehensive understanding of
   gene networks at DNA and RNA levels and the resulting impact on disease
   outcomes is warranted for identifying clinically relevant biomarkers.
   To this end, the piecewise linear regression spline (PLRS) method
   [[42]19], the lots of lasso (Lol) method [[43]20–[44]22], the weighted
   correlation network analysis [[45]23], and Oncodrive-CIS [[46]24]
   identify significant CNA-gene expression pairs but do not provide
   comprehensive maps of regulatory modules that are essential for gaining
   insights into biomarkers reflecting disease biology. The relationship
   between CNAs and downstream networks or modules of gene expressions and
   the resulting effect on cancer outcomes have not been fully explored.

   The key challenges in analyzing such modules are the large
   dimensionality of the CNA and gene expression data sets and the
   sparsity of CNAs. The latter issue leads to considerable challenges for
   modeling large and sparse matrix of CNAs in relation to large gene
   expression matrix to identify sets of somatic changes that regulate the
   expression levels of gene modules. A pragmatic strategy to overcome
   this challenge would be to view this as a sparse canonical correlation
   analysis (sCCA) [[47]25,[48]26] problem to produce sparse latent
   variables representing biologically relevant CNA sets and gene
   expression modules to achieve maximal correlation between the CNA and
   gene expression matrices and relate the resulting gene expression
   modules to disease outcomes to identify the biomarkers of interest.
   This paper is based on the thesis that substantial biological and
   clinical relevance for cancer outcomes can be effectively captured
   using a sparse or smaller number of CNA sets and gene expression
   modules. But this requires innovative statistical and computational
   application of sCCA to extract such sparse latent variables
   effectively.

   We present a two-step analysis framework that aims to map sets of CNAs
   that regulate modules of gene expressions to affect cancer-related
   outcomes. In the first step, sCCA is used to identify gene expression
   modules that are regulated by CNA sets in an unsupervised manner that
   does not employ the disease outcomes. Given the gene modules, in the
   second step, we use a multivariable generalized linear model framework
   to isolate genes within a particular gene expression module that are
   associated with breast cancer-related outcomes. The key advantage of
   our proposed approach is that it is particularly amenable to
   interpretation as it not only identifies the genes whose expression
   levels are associated with outcomes but also identifies the set of CNAs
   that potentially regulates them.

   We apply our approach to analyze data on 1904 breast cancer patients
   from the METABRIC study [[49]7,[50]27] for whom data are available on
   CNAs, gene expressions, and various clinical information related to
   breast cancer (See Supplementary Methods in [51]S1 File for details on
   the study). As an example, analysis, we focus on two clinical outcomes:
   estrogen receptor status, which is a binary variable, and overall
   survival, which is a censored variable measured as months elapsed from
   the date of study entry to the date of death or, if the patient is
   alive, the end of the study. Through extensive downstream analysis we
   demonstrate that the genes identified to be associated to the clinical
   outcomes have plausible independent evidence to be biologically
   relevant for the outcome of interest. Further, we meta-analyzed the
   results across six different clinical outcomes to identify genes that
   were potentially associated with multiple outcomes. We additionally
   validated the gene modules identified from the METABRIC study in an
   independent data on individuals in a study of breast invasive carcinoma
   from The Cancer Genome Atlas (TCGA) that includes CNAs, gene
   expressions, clinical variables, and outcomes in 1,077 patients. Both
   the data sets were obtained from the cBioPortal catalog [[52]28].

Overview of methods

   To describe our approach, we assume that we have individual level data
   for n individuals on p copy number aberrations (CNA) and g
   gene-expressions.

Step 1. Identifying Gene modules through sCCA

   As mentioned before, the large dimensionality of the data sets is a
   major limitation in modeling the association between CNA and gene
   expressions. Here we assume that, despite the large numbers of CNA and
   genes (n << min(p,g)), the association pattern between them can be
   described by mapping small sets of CNA to small sets of genes, which
   essentially denotes biologically relevant gene modules associated to
   sets of CNA. The association can arise due to physical proximity
   (cis-association) or it may be distal (trans-association).

   A natural way to view this would be by using sparse canonical
   correlation analysis (sCCA) [[53]25,[54]26] problem to produce sparse
   latent variables representing biologically relevant CNA sets and gene
   expression modules to achieve maximal correlation between the CNA and
   gene expression matrices.

   We first aim to identify gene modules regulated by CNA, by mapping
   groups of CNA to groups of associated gene expressions using sparse
   canonical correlation analysis (sCCA). sCCA identifies approximate
   orthogonal gene modules that are regulated by CNA. This step is
   agnostic of any phenotypic information or outcomes. For n individuals,
   let G^n×p be the matrix for p sites of copy number aberration (CNA)
   sites with G[ij] being the number of insertion or deletions for
   individual i at site j, and E^n×g be the normalized gene-expression
   levels for g genes across n individuals. Sparse canonical correlation
   analysis (sCCA) identifies sparse linear combinations of CNA (u^p×1;
   termed CNA component) and gene-expressions (v^g×1; termed gene
   component) such that the correlation between Gu and Ev is maximized
   i.e.,
   [MATH:
   <mo>(</mo><mrow><mi>u</mi><mo>,</mo><mi>v</mi></mrow><mo>)</mo><mo>=</m
   o><mrow><mi>a</mi><mi>r</mi><mi>g</mi><mi>m</mi><mi>a</mi><mi>x</mi></m
   row><mspace width="0.25em"></mspace><msup><mrow><mover
   accent="true"><mrow><mi>v</mi></mrow><mo>˜</mo></mover></mrow><mrow><mi
   >T</mi></mrow></msup><mspace width="0.25em"></mspace><msup><mrow><mover
   accent="true"><mrow><mi>E</mi></mrow><mo>˜</mo></mover></mrow><mrow><mi
   >T</mi></mrow></msup><mspace width="0.25em"></mspace><mover
   accent="true"><mrow><mi>G</mi></mrow><mo>˜</mo></mover><mover
   accent="true"><mrow><mi>u</mi></mrow><mo>˜</mo></mover> :MATH]
   [MATH: <mi mathvariant="normal">u</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">d</mi><mi
   mathvariant="normal">e</mi><mi mathvariant="normal">r</mi><mspace
   width="0.25em"></mspace><mo>||</mo><mover
   accent="true"><mrow><mi>u</mi></mrow><mo>˜</mo></mover><msub><mrow><mo>
   ||</mo></mrow><mrow><mn>1</mn></mrow></msub><mo>≤</mo><msub><mrow><mi>c
   </mi></mrow><mrow><mi>u</mi></mrow></msub><mo>;</mo><mo>||</mo><mover
   accent="true"><mrow><mi>v</mi></mrow><mo>˜</mo></mover><msub><mrow><mo>
   ||</mo></mrow><mrow><mn>1</mn></mrow></msub><mo>≤</mo><msub><mrow><mi>c
   </mi></mrow><mrow><mi>v</mi></mrow></msub><mspace
   width="0.25em"></mspace><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mi mathvariant="normal">d</mi><mspace
   width="0.25em"></mspace><mo>||</mo><mover
   accent="true"><mrow><mi>u</mi></mrow><mo>˜</mo></mover><msub><mrow><mo>
   ||</mo></mrow><mrow><mn>2</mn></mrow></msub><mo>=</mo><mn>1</mn><mo>,</
   mo><mo>||</mo><mover
   accent="true"><mrow><mi>v</mi></mrow><mo>˜</mo></mover><msub><mrow><mo>
   ||</mo></mrow><mrow><mn>2</mn></mrow></msub><mo>=</mo><mn>1</mn> :MATH]

   where ||.||[h] denotes the L[h] norm and
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold-italic">G</mi></mrow><mo>˜</mo></mover> :MATH]
   (or
   [MATH: <mover accent="true"><mrow><mi
   mathvariant="bold-italic">E</mi></mrow><mo>˜</mo></mover> :MATH]
   ) denotes the normalized version of the corresponding matrix. The
   subsequent pairs of sCCA components are obtained similarly by matrix
   deflation and under the constraint of being uncorrelated or orthogonal
   to the previous components. Ideally each pair of sCCA components
   selects a sparse set of CNA sites that regulate the expression of a
   sparse set of genes across the genome, denoted by the non-zero elements
   in u and v respectively. Overall, the sCCA aggregates multiple
   associations between the selected CNA and genes and hence represents
   principal regulation or association patterns. Additionally, due to the
   orthogonality constraint each pair of sCCA component reflect
   approximately an independent or orthogonal pattern of regulation. c[u]
   (and c[v]) represent the sparsity parameters for the CNA and gene
   components respectively. To facilitate interpretation, we choose the
   sparsity parameters such that there is no overlap between the CNA
   selected in the components. (See Supplementary Methods in [55]S1 File
   for details).

Step 2. Association with outcomes

   Given the gene-modules identified in Step 1, we now identify the genes
   within these modules that are associated with the outcomes of interest.

Univariate outcomes

   Let
   [MATH: <msub><mrow><mover
   accent="true"><mrow><mi>E</mi></mrow><mo>˜</mo></mover></mrow><mrow><mi
   >k</mi></mrow></msub> :MATH]
   be the n×r[k] matrix of normalized gene-expressions for the genes
   selected in sCCA component k, where r[k] = ||v[k]||[0] and v[k] denotes
   the gene-component of the kth sCCA component. We use the following
   generalized linear model to associate the r[k] genes to a phenotype y
   as
   [MATH:
   <mi>g</mi><mo>[</mo><mi>E</mi><mo>(</mo><mrow><mi>y</mi></mrow><mo>)</m
   o><mo>]</mo><mo>=</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></mr
   ow></msub><mo>+</mo><msub><mrow><msub><mrow><mover
   accent="true"><mrow><mi>E</mi></mrow><mo>˜</mo></mover></mrow><mrow><mi
   >k</mi></mrow></msub><mi>β</mi></mrow><mrow></mrow></msub><mo>,</mo>
   :MATH]

   where g[.] is a canonical link function and β, β[0] are regression
   parameters. For each gene module identified by sCCA, we perform the
   association analysis, record gene-specific p-values, and obtain the
   false discovery rates (FDR). Genes with FDR < 0.05 are declared to be
   significantly associated with the outcome.

Multivariate outcomes

   If multiple, potentially correlated, outcomes are available for the
   individuals, we can meta-analyze results across the multiple outcomes
   to identify genes that are possibly associated to more than one
   outcome. Let p[1], p[2],…,p[s] be the univariate p-values for a
   particular gene for s outcomes, from the previous univariate
   association analysis. These p-values are likely to be correlated due to
   potential correlation between the outcomes. We perform a
   Cauchy-transformed meta-analysis [[56]29] which has been shown to
   maintain correct false positive rate even in the presence of
   correlation [[57]30,[58]31]. Specifically, we transform each p-value to
   a Cauchy variable as
   [MATH:
   <msub><mrow><mi>c</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   i mathvariant="normal">t</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">n</mi><mspace
   width="0.25em"></mspace><mo>(</mo><mi>π</mi><mo>(</mo><msub><mrow><mi>p
   </mi></mrow><mrow><mi>i</mi></mrow></msub><mo>−</mo><mn>0.5</mn><mo>)</
   mo><mo>)</mo> :MATH]

   The test statistic is the unweighted mean of these transformed
   variables which follows a standard Cauchy distribution, under the null
   hypothesis of no association, irrespective of the correlation between
   the outcomes [[59]29].
   [MATH:
   <mi>T</mi><mo>=</mo><mfrac><mrow><mn>1</mn></mrow><mrow><mi>s</mi></mro
   w></mfrac><mrow><munderover><mo
   stretchy="false">∑</mo><mrow><mi>i</mi><mo>=</mo><mn>1</mn></mrow><mrow
   ><mi>s</mi></mrow></munderover><mrow><msub><mrow><mi>c</mi></mrow><mrow
   ><mi>i</mi></mrow></msub></mrow></mrow><mo>∼</mo><mi>C</mi><mi>a</mi><m
   i>u</mi><mi>c</mi><mi>h</mi><mi>y</mi><mo>(</mo><mn>0,1</mn><mo>)</mo>
   :MATH]

   The overall p-value can be calculated by inverting the cumulative
   density function of the standard Cauchy distribution.

Results

   We used publicly available de-identified individual data from
   cBioPortal ([60]https://www.cbioportal.org/). More information on the
   funding source and data use restrictions from cBioPortal has been
   detailed in [61]https://www.cbioportal.org/about. Hence no prior IRB
   approval was required for our study. Since we are performing secondary
   analysis of publicly available data, we did not directly collect
   participation consent. The details for participation consent has been
   outlined in the original papers by Pereira et al. [[62]7] and TCGA. We
   started with 1,904 individuals who had complete data at 22,544 CNA
   sites and expression level data for 24,360 genes. Sparse canonical
   correlation analysis (See [63]Methods) identified 14 gene modules
   through the sCCA components. In this article, we will use the terms
   modules primarily to denote the collection of genes selected in a gene
   component and set to denote the CNA sites selected in a CNA component.
   Across the 14 gene modules, sCCA selects 824 genes, whose expression
   levels are regulated by 1,851 CNA sites overall ([64]Table 1). In
   general, for each sCCA component, the CNA component selects CNA sites
   located in a small sub-region within a chromosome ([65]Fig 1A). Our
   sCCA analysis was agnostic of the physical location of the CNA in the
   genome and hence the sCCA algorithm is not guided or biased towards
   selecting positionally proximal CNA. However, due to the high
   correlation between nearby CNA, each CNA component selects a smaller
   subregion in chromosome of high correlated CNA which might have
   regulatory effects on the gene selected in the corresponding gene
   component. For example, the 115 CNA sites selected in CNA component 4,
   were located on chromosome 17q11.2-q21.32 region. The corresponding
   gene components can then be viewed as the gene module having strong
   association to or being potentially regulated by the selected CNA sites
   and can possibly mediate their effects. In general, we expect the
   regulatory structures captured by the sCCA components to be
   approximately independent. However, we notice that the expression
   levels of genes selected in gene component 8 has a higher correlation
   with the CNA selected in CNA component 2 ([66]Fig 1B). It is to be
   noted that CNA components 2 & 8 defined highly proximal regions in
   chromosome 8. Hence, the correlation between gene module 8 & CNA set 2
   is not unexpected due to LD and/or possible long range regulatory
   activity. This indicates that genes modules 2 and 8 are possibly
   coregulated by the CNA selected in the corresponding CNA components.
   Overall, a CNA component defines a chromosomal subregion which has
   potentially multiple independent regulatory effects on the gene module
   identified by the corresponding gene component. The advantage of the
   sCCA in this application is that it can aggregate multiple, possibly
   weaker association to select groups of CNA associated with genes
   modules (See S1 and S2 Tables in [67]S2 File for full list of CNA and
   genes selected).

Table 1. Description of the 14 gene modules and CNA components identified
through sCCA.

   CNA components
   Gene components (modules)
   sCCA component Number of CNA selected Chromosome Number Genomic
   location of selected CNA (Mb) Number of Genes selected Genes on
   different chromosome Distal genes on same chromosome
   1 139 5 72.17–119.54 70 24 18
   2 107 8 135.45–145.01 59 0 14
   3 131 1 153.20–156.96 56 0 6
   4 115 17 41.86–45.50 67 9 4
   5 145 22 35.64–44.57 63 0 6
   6 130 16 65.28–77.25 58 0 6
   7 125 22 17.95–24.63 58 0 10
   8 125 8 115.95–134.48 59 0 24
   9 145 7 100.40–127.61 71 0 17
   10 151 9 121.82–131.49 74 0 1
   11 129 16 1.06–3.72 64 1 11
   12 117 1 9.65–16.64 63 0 12
   13 147 3 168.00–193.44 71 0 11
   14 145 17 58.16–69.14 62 0 1
   [68]Open in a new tab

Fig 1. Results from sCCA analysis.

   [69]Fig 1
   [70]Open in a new tab

   (A) Chromosomal subregions identified by 14 CNA components. For each
   CNA component the region between the most distal CNA selected in that
   component is marked. (B) Average squared correlation between the CNA
   selected in CNA components and genes selected in Gene components.
   Correlation between Genes in components 2 and CNA in component 8 (and
   vice versa) are observed due possible correlation between selected CNA
   and long-range regulatory effects. Similar correlation is observed for
   components 5 & 7 as well.

Gene modules capture cis and trans effects

   Through the identification of gene modules, we capture regulatory
   effects of CNA. In general, we found that most of the associations
   aggregated by the sCCA components identified effects of CNA sites on
   nearby (cis) gene expression. On average, 44.8% of the genes selected
   in each sCCA component also has a CNA in or near the same gene selected
   in the respective CNA component. This is expected since cis effects are
   known to be much stronger compared to distal (trans) effects and would
   have a direct regulatory effect on the expression level of a nearby
   gene. However, several examples of distal (trans) regulatory effects on
   expressions of genes on different chromosome were also identified in
   the gene modules ([71]Fig 2A and 2B). On average, 3.2% of the genes
   selected in the gene components were on a different chromosome than the
   corresponding CNA component. Further, on average 15.9% of the genes
   selected in the gene components were more than 10Mb away from the
   sub-region of chromosome selected by the corresponding CNA component,
   indicating long range regulatory effects ([72]Table 1). For example,
   among the 67 genes selected in component 4, 9 genes are on different
   chromosomes and an additional 4 genes are outside of the region
   17q11.2-q21.32, which contains the CNA selected by CNA component 4. We
   found possible mechanistic explanations for several such distal
   associations in existing genomic and profiling data. For example, gene
   component 4, selects atlastin GTPase 3 gene (ATL3) on chromosome 11.
   ATL3 is a downstream target for transcription factor Signal Transducer
   and Activator of Transcription 3 (STAT3) in ENCODE transcription factor
   database [[73]32,[74]33]. Interestingly, a copy number aberration of
   STAT3 was selected among the CNA sites in CNA component 4, which
   suggests a possible cis-mediation mechanism for the association of this
   and other nearby CNA sites with ATL3.

Fig 2.

   [75]Fig 2
   [76]Open in a new tab

   Examples of trans-associations identified in using the genes selected
   through sCCA in (A) Gene component 1 (B) Gene component 4. Several
   genes in a chromosome different from that of the selected CNA is
   identified. Further, numerous distal genes (> 10Mb) on the same
   chromosome are detected as well.

Evidence of coregulation

   To further validate whether the genes selected by the 14 significant
   sCCA components had any overall evidence of biological coregulation as
   well, we used large-scale transcription factor databases from the
   ENCODE study [[77]32] and existing ChIP-chip, ChIP-seq, and several
   other transcription factor binding site profiling experiments (ChEA)
   [[78]34,[79]35]. We report proportion of genes in each of the 14 gene
   modules which are downstream targets of at least 20 transcription
   factors across the 181 transcription factors and their downstream
   targets reported in ENCODE (S3 Table in [80]S2 File). We found that,
   across the 14 gene module identified though sCCA components, on an
   average 58.2% of the genes were downstream targets for more than 20
   transcription factors. For ChEA, which reports data on 202 TFs and
   their downstream targets, we found similarly that on average 56.1%
   genes were downstream targets for more than 20 transcription factors.
   Although exploratory in nature, this analysis provides suggestive
   evidence that a large proportion of the genes in each gene module might
   have evidence of being coregulated (directly or indirectly) by TFs and
   the sCCA analysis can independently identify such patterns of
   coregulation.

Replication of gene modules using TCGA breast invasive carcinoma data

   Selection of genes and CNA can be influenced and biased if there are
   systematic biases and batch effects. So, we investigated whether the
   gene modules and CNA sites identified through sCCA, were replicable in
   an external dataset. For that, we used the TCGA breast invasive
   carcinoma data (See Supplementary Methods in [81]S1 File for details on
   the study), which reports data on CNA sites and gene expression in
   primary breast tumor tissue for 1077 breast cancer patients. We adopted
   a resampling-based procedure to test whether the sCCA components
   represented gene modules and CNA sets that had stronger association
   than expected at random. For a given gene module (selected through a
   gene component), we evaluated whether the observed average squared
   correlation between these genes and CNA selected in corresponding CNA
   component were higher than what is expected at random. Similarly, for a
   set of CNA (selected through a CNA component), we evaluated whether the
   observed average squared correlation between these CNA and genes
   selected in corresponding gene component were higher than what is
   expected at random. We found that among the gene modules and CNA sets
   selected in METABRIC and present in TCGA, the average correlation for
   all the 14 components were significantly (p-value < 0.05) higher than
   expected (S1 Fig in [82]S1 File). Further 10 of these components were
   strongly significant as well (p-value < 0.001). Such a result is not
   unexpected as the sCCA components include a majority of stronger cis
   effects. Further, this also suggests that the sCCA components in
   METABRIC possibly captured true effects replicable across different
   datasets and not potential artefacts and batch effects within METABRIC
   (See Supplementary Methods in [83]S1 File).

Association with breast cancer related outcomes

   Given the 14 gene modules obtained through sCCA analysis, we
   investigated whether these gene modules were associated with 7
   different types of breast cancer outcomes ([84]Table 2). At a lenient
   cutoff of FDR < 0.05, we found that 562 genes across the 14 modules
   were associated with at least one of the outcomes (S4 Table in [85]S2
   File). Further, at a stringent exome-wide cutoff of p-value < 2.5 х
   10^−06, we found 94 genes associated with at least 1 outcome.
   Subsequently, through several downstream analysis we investigated
   whether the genes that are significant for a given outcome indeed had
   external evidence of association to breast cancer related outcomes.
   Here we demonstrate the results for two distinct types of outcomes.

Table 2. Description of the seven breast cancer related outcomes analyzed,
and the number of genes associated significantly.

   Outcome % cases Median survival Significant Genes (FDR < 0.05)
   Estrogen Receptor status (ER) 76.6 - 210
   Progesterone receptor status (PR) 52.9 - 237
   Human Epidermal growth factor Receptor 2 status (HER2) 12.4 - 255
   Grade (Grade 3 vs Grade lower than 3) 47.5 - 65
   Lymph Nodes Examined to be present (present vs absent) 47.8 - 12
   Age at diagnosis (< 50 years) 78.4 - 100
   Overall Survival - 154.2 73
   [86]Open in a new tab

Estrogen receptor (ER)

   Of the 1,904 individuals in the sample, 1,459 (76.6%) individuals had
   ER positive status. We performed logistic regression-based association
   tests of the 14 significant gene modules. Across the components we
   found that 210 genes were significant at an FDR < 0.05 and 36 genes
   were significant with p-value < 2.5 х 10^−06. Among the genes
   significantly associated with ER status, we identified known breast
   cancer related genes such as Microtubule Associated Protein Tau (MAPT),
   whose expression is highly associated with low sensitivity to taxanes
   that are important drugs for breast cancer treatment [[87]36], and
   Macrophage migration inhibitory factor (MIF), a pro-inflammatory
   cytokine whose blockade reduces the aggressiveness of invasive breast
   cancer [[88]37]. Further our sCCA-based model provides a potential
   explanation of the related biological mechanism. For example, among
   genes selected in gene component 4, we found that Dynein Axonemal Light
   Intermediate Chain 1 (DNALI1), on chromosome 1 is associated with ER
   status (p-value = 4.8 х 10^−04), being trans-regulated by CNA sites on
   chromosome 17 selected in CNA component 4 ([89]Fig 3A). DNALI1 is a
   downstream target for transcription factors STAT3 and UBTF, both of
   which are selected in CNA component 4. Further, there is evidence of
   physical interactions between the proteins resulting from DNALI1 and
   UBTF in large protein interactions databases as well [[90]38]. This
   indicates the possibility that DNALI1 mediates the effects of the CNA
   sites in chromosome 5 selected by CNA component 1, on ER status. Thus,
   not only we identify the genes whose expression levels are associated
   with breast cancer outcomes, we also additionally identify which CNA
   potentially regulate such genes.

Fig 3. Association analysis with breast cancer related outcomes.

   [91]Fig 3
   [92]Open in a new tab

   (A) DNALI1 gene on chromosome 1 associated with ER status and
   trans-regulated by a CNA of transcription factor CHD1 on chromosome 5.
   (B) Trans regulation of CD2BP2 gene on chromosome 16p11.2 by a CNA in
   ZNF263 gene located in chromosome 16q13.3 which are approximately 27Mb
   apart. (C) Association of CD2BP2 expression level with overall survival
   probability. Expression levels have been dichotomized as high and low
   using 75-th percentile as cut-off. (D) p-values of 72 genes identified
   to be strongly associated (p-value < 2.5 х 10^−06) with multiple
   outcomes, across the 7 outcomes. P-values < 1 х 10^−12 are collapsed to
   1 х 10^−12 for the ease of viewing.

   Through pathway enrichment analysis ([93]Table 3), we found that the
   genes significantly associated to ER status at FDR < 0.05, were
   enriched for hallmark pathways [[94]39] like early response to
   estrogen, DNA repair and MYC targets, MYC being a well-known oncogene
   [[95]40]. Further, in pathways curated from chemical and genetic
   perturbation experiments, we found that the genes were enriched for
   genes highly positively co-expressed with BRCA1 and BRCA2, two genes
   well reported to be involved in breast cancer [[96]41–[97]43]. Further,
   the genes were also enriched for targets of several transcription
   factors, like NELFE [[98]44,[99]45], E2F4 [[100]46,[101]47] and CREB1
   [[102]48,[103]49] which are known to play key roles in several cancers,
   including breast cancer. However, overlap of the identified significant
   genes with key cancer related pathways suggest a possible mechanistic
   explanation for the outcome. For example, of the 210 significant genes,
   7 genes (ADCY9, ABAT, MAPT, SLC9A3R1, CANT1, BCL2, FAM102A) are in the
   early estrogen response hallmark pathway [[104]39]. These 7 genes are
   identified as part of gene modules 4, 11, 14, 7 and 10. This indicates
   that the CNA selected in the corresponding CNA components, which
   regulates these gene components respectively, as shown in sCCA
   analysis, significantly changes estrogen response and can possibly be
   causal for ER status. So essentially perturbations in early estrogen
   response hallmark pathway can occur due to CNA in chromosomal
   sub-regions defined by components 4, 11, 7 and 14. This
   interpretability is a key advantage of our analysis approach.

Table 3. Different categories of pathways enriched for the 210 genes
associated (FDR < 0.05) with ER status.

   Category Pathway Adjusted p-value Genes in pathway Genes overlap
   GO Cellular macromolecule localization 5.7 x 10^−09 1886 39
   Intracellular protein transport 1.2 x 10^−07 1156 28
   Cellular response to DNA damage stimulus 4.4 x 10^−07 841 23
   Catalytic complex 4.7 x 10^−10 1552 36
   Adenyl Nucleotide binding 3.9 x 10^−06 1536 30
   Hallmark Estrogen Response (early) 4.9 x 10^−03 200 7
   DNA repair 4.9 x 10^−03 149 6
   E2F targets 9.3 x 10^−03 200 6
   MYC targets 9.3 x 10^−03 200 6
   MTORC1 signaling 4.5 x 10^−02 200 5
   Curated Gene sets
   NIKOLSKY_BREAST_CANCER_8Q23_Q24_AMPLICON 8.2 x 10^−14 157 17
   PUJANA_BRCA1_PCC_NETWORK 2.3 x 10^−08 1617 34
   CHARAFE_BREAST_CANCER_LUMINAL_VS_MESENCHYMAL_UP 4.3 x 10^−07 446 17
   VANTVEER_BREAST_CANCER_ESR1_UP 5.7 x 10^−05 208 15
   PUJANA_BRCA2_PCC_NETWORK 5.4 x 10^−04 423 12
   Immunologic Signatures GSE4984_GALECTIN1_VS_VEHICLE_CTRL_TREATED_DC_DN
   1.4 x 10^−06 198 12
   GSE2770_UNTREATED_VS_IL12_TREATED_ACT_CD4_TCELL_2H_DN 3.8 x 10^−06 200
   12
   GSE19825_NAIVE_VS_IL2RAHIGH_DAY3_EFF_CD8_TCELL_DN 1.8 x 10^−05 200 11
   TF targets ENCODE: NELFE 6.5 x 10^−46 9442 173
   ENCODE: E2F4 2.8 x 10^−34 12626 180
   ENCODE: CREB1 5.7 x 10^−33 12289 177
   ChEA: EGR1 3.1 x 10^−09 5000 82
   ChEA: ELF3 2.2 x 10^−07 1760 40
   [105]Open in a new tab

   We further benchmarked whether the 210 genes found to be significant
   using this sCCA based approach against 161 gene selected through a
   standard sparse logistic regression (See Supplementary Methods in
   [106]S1 File). Assessing the model fit for these two approaches in TCGA
   data, we found that the BIC of the model including genes significant
   through the sCCA-based analysis was substantially lower than the model
   that included the genes selected through sparse logistic regression.
   This indicates that our approach not only provides a better
   interpretability of the overall genetic mechanism but also produces
   comparable or better model fit for the data.

Overall survival (OS)

   Of the 1,904 individuals in the sample, 1,109 (76.6%) individuals died
   during the study, with median survival time being 154 months
   approximately. In a cox proportional hazard model, we found 73 genes to
   be significant (FDR < 0.05) across the 14 components. Notably, several
   interesting distally regulated genes are identified in our analysis.
   For example, in sCCA component 11, we found that the expression of
   CD2BP2 gene on chromosome 5 is associated with the overall survival.
   This gene is differentially regulated in T47D cells of breast cancer
   patients in response to tamoxifen [[107]50], a widely used hormonal
   therapy drug for breast cancer [[108]51]. The transcription start site
   for CD2BP2 is over 27 Mb downstream from the subregion of chromosome 5
   selected through the CNA component 11. The corresponding selected set
   of CNA in component 11 contains a TF ZNF263, which has a long-range
   regulatory effect on CD2BP2 on the same chromosome [[109]32] and
   indicates that possibly CD2BP2 mediates the effects of the selected CNA
   producing significant change in overall survival probability ([110]Fig
   3B and 3C).

   A comprehensive pathway enrichment analysis ([111]Table 4) reveals that
   the selected genes are enriched in gene-sets and pathways defined by
   several breast cancer related perturbation experiments. For example, we
   found a significant enrichment of the genes associated to OS, in the
   genes related to adipogenesis. Enrichment was found among genes up
   regulated in early primary breast tumors expressing ESR1 vs the ESR1
   negative ones. In addition, the genes significantly associated to OS
   were enriched for targets of several key cancer TFs like ELK1, YY1 and
   RUNX3 [[112]52–[113]54]. As highlighted above, through the sCCA
   components and the subsequent cox PH association model, we not only
   identify which genes are associated with OS but also detect the CNA
   sites regulating these gene expressions and, hence affecting OS.

Table 4. Different categories of pathways enriched for the 73 genes
associated (FDR < 0.05) with overall survival (OS).

   Category Pathway Adjusted p-value Genes in pathway Genes overlap
   GO RNA metabolic process 4.2 x 10^−03 1542 14
   Macromolecule catabolic process 4.2 x 10^−03 1366 13
   Cellular component disassembly 1.2 x 10^−02 537 8
   Hallmark Adipogenesis 2.8 x 10^−02 200 4
   Curated Gene sets DIAZ_CHRONIC_MEYLOGENOUS_LEUKEMIA_UP 1.2 x 10^−03
   1397 14
   Reactome: Metabolism of RNA 1.3 x 10^−03 668 10
   NIKOLSKY_BREAST_CANCER_17Q21_Q25_AMPLICON 2.1 x 10^−02 332 6
   Immunologic Signatures GSE3982_NEUTROPHIL_VS_TH1_DN 4.8 x 10^−04 199 7
   GSE3982_EOSINOPHIL_VS_NKCELL_DN 1.5 x 10^−03 197 5
   GSE27786_NKCELL_VS_NEUTROPHIL_UP 8.1 x 10^−03 199 5
   TF targets ENCODE: RUNX3 8.1 x 10^−16 11816 63
   ENCODE: ELK1 1.5 x 10^−14 11349 61
   ENCODE: YY1 1.5 x 10^−12 12289 177
   ChEA: ETS1 8.0 x 10^−08 1359 20
   ChEA: PADI4 9.4 x 10^−03 877 9
   [114]Open in a new tab

Multiple outcomes

   We further meta-analyzed results across all the seven breast cancer
   related outcomes to identify genes that are possibly associated to
   multiple outcomes. Since the outcomes are correlated and as a result
   the association p-values across the outcomes for each gene are
   correlated, it is difficult to use standard meta-analysis for this. In
   fact, the effect size estimates for association models pertaining to
   different types of outcomes (binary and survival), would complicate the
   interpretation of effect size based meta-analysis. Here, we used the
   Cauchy combination test to meta-analyze results across the seven
   outcomes. 72 genes were identified to be significant at the exome-wide
   p-value threshold of 2.5 х 10^−06 ([115]Fig 3D). At FDR < 0.05, we
   found 508 genes to be significantly associated with the set of 7
   outcomes. Although majority of these associations were driven by
   significant associations with at least one outcome and weaker
   association with several others, 13 genes were also identified (Cauchy
   combination FDR < 0.05) which had no significant association (FDR >
   0.05) to any single outcome but had possibly weaker association with
   multiple outcomes. For example, we found E4F1 gene (Cauchy combination
   test p-value = 0.028), a gene that a can induce cell growth arrest
   [[116]55], to be significant (FDR < 0.05) through the Cauchy
   combination approach which has nominal associations with the
   presence/absence of lymph nodes (p-value = 0.0083), overall survival
   (p-value = 0.011) and HER2 status (p-value = 0.029). DDX5 (Cauchy
   combination test p-value = 0.021), a gene well reported to be
   associated to breast cancer and regulates DNA replication and cell
   proliferation [[117]56], had multiple weaker associations with the
   grade of tumor (p-value = 0.017), overall survival (p-value = 0.004)
   and HER2 status (p-value = 0.022). We conducted pathway enrichment
   analyses with the 508 genes that were found significant at FDR < 0.05
   in meta-analyses. Similarly, as before, the results show that the
   numerous relevant pathways and gene-sets related to breast cancer are
   significantly enriched for these genes.

Discussions

   Extensive research has established that CNA are indeed important for
   several cancer types and subtypes, especially in breast cancers
   [[118]7]. However, the intermediate mechanisms and processes via which
   CNA impact breast cancer related outcomes have not been conclusively
   established and merits further research. In this article we have
   outlined a novel and generalizable analytic approach to identify how
   CNA regulate expression levels of gene modules that ultimately
   influence several breast cancer related outcomes. Our approach involves
   two steps: using sparse canonical correlation analysis to identify gene
   modules associated with sets of CNA, followed by testing association
   between the gene modules and breast cancer related outcomes. We further
   carried out a meta-analysis across different types of outcomes to
   identify genes with multiple associations. Extensive downstream
   analysis shows that the genes identified through our analysis have key
   relevance for breast and other cancers that have also been noted in
   other studies. Unlike these other studies, our approach additionally
   identifies CNA sets that possibly regulate the genes which in turn
   bring about changes in outcomes related to breast cancer. Below we
   summarize the advantages of our analysis and its potential
   generalizability.

Identifying and interpreting gene modules through sCCA

   The identification of gene modules using sCCA, agnostic of the clinical
   or disease outcome, is a key advancement that we propose over existing
   work on this topic. Numerous methods have been developed to identify
   significant associations between individual pairs of CNA and gene
   expressions. However, recently several authors have hypothesized that
   the effects of somatic genetic variants like CNA are cascaded through
   complex intermediate gene network and modules to bring about phenotypic
   change [[119]57]. Thus, identifying individual CNA-gene expression
   associations, although informative, cannot provide deeper insight into
   the gene modules that can potentially be impacted by CNA. Through our
   joint analysis approach in sCCA, we map groups of CNA to gene modules,
   which essentially identifies broad gene sets potentially regulated by
   CNA. In fact, existing transcription factor databases show that the
   gene modules thus detected through sCCA have suggestive evidence of
   coregulation, which indicates that such groupwise mapping approach can
   identify patterns of biological regulation as well.

   One of the key interpretations of the gene modules is that they
   represent approximately independent regulatory patterns due to the
   orthogonality condition imposed by sCCA. Thus, in principle, the first
   step in our analysis, identifies key distinct biological regulatory
   processes that are impacted within primary breast tumor tissue. The
   advantage that sCCA provides over identifying single CNA-gene pairs is
   that, since sCCA can aggregate multiple possibly weaker association
   within the identified components, it can be statistically more powerful
   in identifying gene modules. In fact, transcriptomic studies have shown
   that at moderate sample size, similar to that of METABRIC study, sCCA
   can outperform standard pairwise regression in identifying broad
   downstream gene modules regulated by genetic variants. However, we
   acknowledge that there will possibly be a plethora of regulatory
   associations beyond the ones identified through sCCA, which can be
   identified in a larger sample.

Association analysis using multivariable regression

   The subsequent association analysis using gene modules is relatively
   standard and has been used commonly. However, interpreting the
   association p-values needs caution since the association model is a
   joint regression. In general, single gene vs outcome tests are
   different from this since the joint model additionally adjusts for
   correlation between the genes and reports the p-value conditional on
   the gene module. This multivariable regression framework identifies the
   genes that are associated to the clinical outcome while adjusting for
   the correlation within the module. While association between and
   outcome and a single gene can arise either due to true causality or due
   to correlation of the tested gene with the true causal gene, our
   approach accounts for the dependency and hence significant association
   can potentially be causal.

Meta-analysis of results from correlated clinical outcomes

   The multiple outcome meta-analysis [[120]29] demonstrates another
   advantage of our approach. In general, complex diseases like cancers,
   and in particular breast cancer, can have numerous biomarkers of
   disparate types. Combining results across the biomarkers can highlight
   overall important genes and genes affecting multiple biomarkers.
   However, meta-analyzing association results across them is not
   straightforward since the results are correlated. Further for different
   types of biomarkers (continuous, discrete, binary, survival etc.) the
   effect sizes have disparate interpretation and hence standard
   meta-analysis might not be appropriate. However, the Cauchy combination
   approach alleviates these problems since it is based on p-values and
   not on the effect size. Further, due to the correlation agnostic
   property of Cauchy distributions, meta-analysis of correlated p-values
   controls false positive rates. In principle, using this approach,
   meta-analysis of different types of biomarkers and outcomes are
   possible, as long as the univariate association mean model is correctly
   specified.

Generalizable analysis

   The approach that we adopted here is highly generalizable as an overall
   analysis framework. The first step pertains to groupwise mapping and
   identification of modules followed the association analysis in the
   second step. Although we adopted an sCCA approach here for its ease of
   interpretation, several other methods mapping sets of genetic variants
   to gene modules can be used instead. For example, methods for
   biclustering and matrix factorization can be adapted in the first step
   to identify gene modules. In fact, groupings based on functional
   annotation can also be incorporated to further strengthen the
   mechanistic interpretation of the identified modules. The subsequent
   association analysis can also be customized to address scientific
   questions of interest. However, one of the major advantages of our
   pipeline is that, in principle, the two steps can be performed on
   separate datasets as well. For example, in current studies large scale
   genomic and transcriptomic data for many individuals are available
   while detailed information on phenotypes and traits might not be
   available. Thus, the identification of gene modules can be performed
   using such data while the subsequent association analysis can be
   carried out in a separate data using generalized meta-analysis
   [[121]58].

   Although unique in its use of sCCA, our approach is not the only method
   that aims to identify disease related gene sets. Several other methods
   have been devised to for this purpose including, but not limited to,
   DriverNet [[122]59], DawnRank [[123]60], Prodigy [[124]61],
   PersonaDrive [[125]62], especially for different oncological
   applications. However, our approach has several important distinctions
   from these methods. One of the key distinctions is in the motivation.
   While majority of the aforementioned methods aim to distinguish driver
   mutations and genes from passenger ones, we aim to broadly characterize
   the regulatory associations between somatic CNA and gene expression.
   The main goal of our approach is to identify the genes whose
   expressions are regulated by CNA and have association with breast
   cancer related outcomes. Furthermore, in terms of the input data
   required, the aforementioned methods rely on prior knowledge on
   gene-networks (directed or undirected). In contrast, our method does
   not require such prior information and hence can be used to formulate
   downstream hypothesis relating the gene-sets to breast cancer or to
   prioritize gene-sets for further analyses.

   Our analysis approach currently has several limitations. To improve
   interpretability and enforce partitioning, we have chosen the
   regularization parameter c[u] to be such that the CNA components have
   no overlap for selected CNA. However, despite that our algorithm can
   select physically proximal and correlated CNA in different components.
   Although approximately the CNA components remain orthogonal, the
   effects on the gene modules can still be correlated due to this feature
   (for example CNA components 2 and 8 or 5 and 7). Further research is
   warranted on the parameter settings to obtain desirable performance and
   interpretation. Additionally, the optimal number of components in sCCA
   is chosen heuristically by maximizing the iterative ratio of canonical
   correlations rather than using any significance or enrichment tests. In
   the current set up, although recent statistical tests have been
   proposed for canonical correlation loadings, formulating analytical
   tests of significance has proven difficult and methods based on sCCA
   have mostly resorted to resampling methods [[126]63]. In the future,
   research on Bayesian formulations coupled with sequential testing is
   warranted to perform tests of significance for regularization and
   clustering methods which can indicate the optimal number of components
   and parameter settings. Further, it remains to be explored whether
   other regularization techniques like fused Lasso [[127]64] which are
   more suitable for correlated settings, produce more interpretable and
   precise results. Additionally, sCCA only explores linear interactions
   between CNA and gene expressions. Although this assumption has been
   used in multiple different contexts, in the future, research is
   required to understand whether linearity of interactions is a valid
   assumption and whether possible kernelization is required to introduce
   nonlinear effects.

   Together, our analysis provides a comprehensive understanding of the
   impact of CNA can impact different breast cancer outcomes via
   regulation of intermediate gene modules. If a particular gene is
   significantly associated with a breast cancer related outcome, we can
   identify which set of CNA of which genomic subregion regulates it using
   the identified gene modules. Further, overlap of the significant genes
   with several breast cancer related pathways identifies the genes within
   the predefined biological process is differentially regulated by CNA to
   bring about phenotypic change. In future, as larger studies emerge with
   a greater coverage and spectrum of molecular phenotypes, a more
   comprehensive insight as to the intermediate regulatory mechanism will
   take shape. For that, it will be imperative to move beyond identifying
   single variant-gene or variant-outcome associations and conceptualize
   associations in context of networks and modules. The broad intuition of
   this analysis framework can further be extended to multi-view data sets
   and can be useful in integrative analysis of multi-omics data.

Supporting information

   S1 File. Details and description of methods, description of study
   samples used for secondary analyses.

   Including S1 Fig and corresponding caption.

   (DOCX)
   [128]Click here for additional data file.^ (458.2KB, docx)
   S2 File. Supplementary tables.

   (XLSX)
   [129]Click here for additional data file.^ (47.7KB, xlsx)

Data Availability

   The data and code are available from the following sources: METABRIC:
   [130]https://www.cbioportal.org/study/summary?id=brca_metabric TCGA
   Breast Cancer:
   [131]https://www.cbioportal.org/study/summary?id=brca_tcga Code for
   analysis: [132]https://github.com/diptavo/METABRIC_analysis ShinyGO:
   [133]http://bioinformatics.sdstate.edu/go/ FUMA:
   [134]https://fuma.ctglab.nl/ CHEA:
   [135]https://maayanlab.cloud/Harmonizome/dataset/CHEA+Transcription+Fac
   tor+Targets ENCODE:
   [136]https://maayanlab.cloud/Harmonizome/dataset/ENCODE+Transcription+F
   actor+Targets.

Funding Statement

   DD was supported by R01 grant from the National Human Genome Research
   Institute [1 R01 HG010480-01; PI Dr. Nilanjan Chatterjee]. AS and JS
   were supported by R01 grant from the National Cancer Institute [7 R01
   CA197402-05; PI: Jaya Satagopan]. The funders had no role in study
   design, data collection and analysis, decision to publish, or
   preparation of the manuscript.

References