Graphical abstract

   graphic file with name ga1.jpg
   [43]Open in a new tab

   Keywords: Regulatory network, Breast cancer, Multi-omics, Integrative
   network-based approach, GWASs

Abstract

   Although genome-wide association studies (GWASs) have successfully
   identified thousands of risk variants for human complex diseases,
   understanding the biological function and molecular mechanisms of the
   associated SNPs involved in complex diseases is challenging. Here we
   developed a framework named integrative multi-omics network-based
   approach (IMNA), aiming to identify potential key genes in regulatory
   networks by integrating molecular interactions across multiple
   biological scales, including GWAS signals, gene expression-based
   signatures, chromatin interactions and protein interactions from the
   network topology. We applied this approach to breast cancer, and
   prioritized key genes involved in regulatory networks. We also
   developed an abnormal gene expression score (AGES) signature based on
   the gene expression deviation of the top 20 rank-ordered genes in
   breast cancer. The AGES values are associated with genetic variants,
   tumor properties and patient survival outcomes. Among the top 20 genes,
   RNASEH2A was identified as a new candidate gene for breast cancer.
   Thus, our integrative network-based approach provides a genetic-driven
   framework to unveil tissue-specific interactions from multiple
   biological scales and reveal potential key regulatory genes for breast
   cancer. This approach can also be applied in other complex diseases
   such as ovarian cancer to unravel underlying mechanisms and help for
   developing therapeutic targets.

1. Introduction

   Genome-wide association studies (GWASs) have identified thousands of
   risk single nucleotide polymorphisms (SNPs) for human complex diseases
   [44][1], [45][2], [46][3]. However, understanding the biological
   function and molecular mechanisms of the associated SNPs involved in
   complex diseases is challenging. Breast cancer represents a typical
   example of such status quo. Breast cancer is one of the most common
   diagnosed cancers and leading causes of cancer-related death in females
   [47][4]. GWASs have identified over 7000 SNPs for breast cancer based
   on the results collected from the NHGRI GWAS Catalog (release on
   2020-08-26) [48][1] and published GWASs [49][5], [50][6]. Most of these
   SNPs are located within non-coding genomic regions and their functions
   need to be interpreted. The genetic information flow from genome to
   traits goes through various intermediate molecular layers, including
   genome, epigenome, transcriptome, proteome, and metabolome. Different
   type of molecules works together as an interactive system to affect
   biological processes [51][7], [52][8]. Therefore, integration of
   multiple omics data might offer an unprecedented opportunity to
   illuminate the complex pathogenesis [53][9]. For example, Zhang et al.
   used datasets from multiple biological scales to construct regulatory
   networks and identified TYROBP as a key gene in late-onset Alzheimer's
   disease [54][10]. Therefore, integrating GWASs findings with multiple
   layers of functional data might provide new insight into the complex
   pathogenesis of breast cancer.

   There are increasing evidence showing that genetic variants can perturb
   the regulatory network and consequently contribute to the changing of
   traits [55][11], [56][12]. Incorporating multi-omic data into network
   analyses has the advantage of constructing regulatory networks by
   combining biological signals from different scales [57][11], [58][12],
   [59][13], [60][14], and therefore can enhance the understanding of the
   molecular mechanisms underlying the pathophysiology of diseases.
   Several previous studies have constructed regulatory networks to
   unravel potential mechanisms for complex diseases by using integrative
   approaches. For example, Jang et al. [61][15] used the chromatin
   interactome and protein interactome for combinatorial regulatory
   variants to find driving genes in breast and liver cancers. Hsiao et
   al. [62][16] performed a network analysis on the gene level and the
   gene set level. However, these studies didn’t consider the prior
   information of GWASs signals. Castro et al. [63][17] created a
   TF-driven gene regulatory network for breast cancer by combining GWASs
   variants and eQTLs. Although these studies have proved the
   effectiveness of integrative network-based models for identification of
   new driver genes or biological pathways, no studies have constructed a
   full-scale network by using molecular interactions across multiple
   biological scales to provide systemic insight into breast cancer.

   Therefore, in this study, we developed a framework named integrative
   multi-omics network-based approach (IMNA) to capture genetic-driven
   regulatory networks and predict key regulatory genes for breast cancer.
   We combined the interactions and functional relationship data from
   multiple biological scales, including GWASs, eQTLs, epigenomic
   elements, transcriptome, protein interactome and chromatin long-range
   interactions. We also developed an abnormal gene expression score
   (AGES) signature based on the gene expression deviation of the top
   ranked-order candidate genes and found that the AGES signature was
   correlated with genomic variants and clinical properties in breast
   cancer patients. Moreover, we identified RNASEH2A as a novel breast
   cancer-associated gene. Our method will benefit further investigation
   of molecular pathogenic mechanisms in diseases.

2. Method

2.1. Multi-omics datasets used in this study

   The GWAS datasets for breast cancer came from the Breast Cancer
   Association Consortium (BCAC). The study design, samples
   characteristic, genotyping and quality control within the datasets have
   been described previously [64][18], [65][19], [66][20]. Briefly, the
   BCAC comprised 52,675 cases and 49,436 controls from 52 studies,
   including 41 European ancestry, 9 Asian ancestry and 2 African-American
   ancestry studies [67][18], [68][19]. The samples were genotyped using a
   custom Illumina Infinium iSelect array (iCOGS) comprising 211,155 SNPs
   and imputed to 2.5 million SNPs by IMPUTE2 [69][21] with the 1000
   Genomes Project March 2012 release as the reference.

   The multiple functional genomic data included cis-eQTLs, cis-mQTLs,
   chromosome long-range interactions and epigenomic regulatory
   annotations in relevant tissues/cells. The cis-eQTLs data in breast
   tissues came from the Genotype-Tissue Expression (GTEx) [70][22]and the
   Cancer Genome Atlas (TCGA) [71][23] database. Cis-mQTLs dataset was
   obtained from Pancan-meQTL
   ([72]http://bioinfo.life.hust.edu.cn/Pancan-meQTL/) [73][24]. Chromatin
   interactions were extracted from 4D genome
   ([74]http://4dgenome.research.chop.edu/) using the disease-related cell
   lines (MCF7 and GM12878) [75][25]. The Assay for Transposase-Accessible
   Chromatin followed by sequencing (ATAC-seq) and 15-state chromatin
   states were used for epigenomic regulatory annotation. ATAC-seq
   datasets from 5 cell types (HCC1806, MDA-MB-231, MCF-7, ZR-75-1, T47D)
   were downloaded from Cistrome Data Browser [76][26]
   ([77]http://cistrome.org/db) ([78]Supplementary Table 1). 15-state
   chromatin states of 3 cell types (breast vHMEC, breast myoepithelial
   cells and HMEC mammary epithelial cell) were downloaded from the
   Roadmap Epigenomics project. We merged “Enhancer”, “Genic enhancer” and
   “Bivalent enhancer” into one type of regulatory element to simplify the
   interpretation of regulatory elements.

   We collected 6 expression-based prognostic and predictive gene sets
   including MammaPrint, PAM50, OncotypeDX, Wang-76, Zhang-15, and Cell
   cycle from previous publications ([79]Supplementary Table 2, [80]Fig.
   1c) [81][27], [82][28], [83][29], [84][30], [85][31], [86][32]. These
   gene sets are independent of anatomical markers and could reveal
   molecular characteristics of breast cancer and contribute to breast
   cancer prognosis, diagnosis, and therapeutics [87][33], [88][34].

Fig. 1.

   [89]Fig. 1
   [90]Open in a new tab

   An overview of the integrative genomics network-based approach. a)
   Extraction of SNPs from GWASs and annotation SNPs by multiple
   tissue-specific functional omics datasets. b) Construction of SNP-gene
   bipartite with SNP-gene mapping pairs and calculation of degree
   centrality for each gene node based on the network topology. c)
   Construction of functional interaction network. Gene nodes in the
   networks are extracted from the bipartite. Key genes in weighted
   molecular networks are captured by using enrichment analysis. d)
   Identification of key genes in the network. Scores of each gene from
   multiple biological networks are combined to get composite scores. The
   importance of genes is evaluated by the order of their composite
   scores.

   Gene expression profiles for breast cancer were derived from the GEO
   (the Gene Expression Omnibus) database, including [91]GSE37751
   (n = 108), [92]GSE3744 (n = 47), [93]GSE86374 (n = 159), [94]GSE21422
   (n = 19), [95]GSE70947 (n = 296), [96]GSE29044 (n = 109) and
   [97]GSE14999 (n = 129). We selected these datasets by searching the GEO
   database using keywords as “breast cancer”, organisms as Homo sapiens,
   study type as “Expression profiling by array”. Datasets containing
   breast tumors and adjacent normal breast tissue samples with total
   sample size more than 100 and the number of controls more than 30 at a
   single platform were included. Datasets for one or several specific
   molecular subtypes were not included in our analysis. With the above
   process, [98]GSE37751, [99]GSE86374, [100]GSE70947, [101]GSE29044 and
   [102]GSE14999 were chosen. We also included [103]GSE27562 which
   contained human blood samples. [104]GSE3744 and [105]GSE21422 were used
   to investigate gene expression profiles between early-stage tumor
   samples and late-stage tumors by Zhang et al. [106][35], we also
   included these two datasets in our analyses. In addition to the
   above-mentioned microarray data, RNA-seq data of breast tumor samples
   and controls ([107]GSE115577 and TCGA BRCA dataset) were also included.

   The genomic variants, mRNA expression, clinical information for TCGA
   Breast Invasive Carcinoma (TCGA-BRCA) dataset were downloaded from
   cBioPortal [108][37]. Samples with missing clinical information and
   male samples were removed before analyses.

2.2. Construction of SNP-gene bipartite network

2.2.1. Construction of SNP-gene mapping pairs

   To capture disease-associated SNP-gene mapping pairs, we extracted SNPs
   with their GWASs P-values < 5 × 10^−5 from white populations. We
   excluded SNPs located in the extended MHC region (chr6: 25–35 Mb). For
   the SNPs located in the gene-body regions, we annotated these SNPs and
   corresponding genes as SNP-gene mapping pairs. For the other SNPs,
   their target genes were assigned through multiple aspects, including
   cis-eQTLs, cis-mQTLs, promoter-enhancer interactions, and regulatory
   elements annotation. The significant threshold for cis-eQTL was set as
   P-value < 0.05 after multiple-testing corrections by Benjamini and
   Hochberg (BH) procedure. The collection of statistically significant
   cis-meQTLs pairs for breast cancer has been included in the analysis.
   For chromatin interactions, chromatin states and ATAC-seq peaks were
   firstly used to get the enhancer region information and annotate the
   interacting regions. Then we collected the interaction pairs for which
   there were SNPs located in enhancer in one locus that was paired to a
   gene promoter in the other locus. All of the above results were merged
   together to obtain the final SNP-gene mapping pairs. To investigate the
   function of the target genes, we used clusterProfiler R package
   [109][38] to perform gene sets enrichment analysis in KEGG gene sets.

2.2.2. Construction of SNP-gene bipartite network

   We constructed a bipartite network based on the SNP-gene mapping pairs
   as edges connecting SNP and gene nodes. To analyze the structure of the
   bipartite, we calculated the degree centrality of each SNP and gene
   node by using NetworkX package Version 2.1
   ([110]http://networkx.github.io/) in python. In the bipartite, the
   degree centrality for a node v is defined in Eq. 1, where the sets U
   are SNP nodes with n nodes and the sets V are gene nodes with m nodes,
   deg(v) is the degree of a node v. The degree of the node v is the total
   number of edges linking to in the opposite node set.
   [MATH: <mrow><msub><mi mathvariant="normal">d</mi><mrow><mfenced
   close=")" open="("><mrow><mi
   mathvariant="normal">v</mi></mrow></mfenced></mrow></msub><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mi
   mathvariant="normal">d</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">g</mi><mo stretchy="false">(</mo><mi
   mathvariant="normal">v</mi><mo stretchy="false">)</mo></mrow><mi
   mathvariant="normal">m</mi></mfrac><mo>,</mo><mspace
   width="5.0pt"></mspace><mi mathvariant="normal">f</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mspace
   width="5.0pt"></mspace><mi mathvariant="normal">v</mi><mo>∈</mo><mi
   mathvariant="normal">U</mi></mrow> :MATH]
   (1)
   [MATH: <mrow><msub><mi mathvariant="normal">d</mi><mrow><mfenced
   close=")" open="("><mrow><mi
   mathvariant="normal">v</mi></mrow></mfenced></mrow></msub><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><mi
   mathvariant="normal">d</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">g</mi><mo stretchy="false">(</mo><mi
   mathvariant="normal">v</mi><mo stretchy="false">)</mo></mrow><mi
   mathvariant="normal">n</mi></mfrac><mo>,</mo><mspace
   width="5.0pt"></mspace><mi mathvariant="normal">f</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mspace
   width="5.0pt"></mspace><mi mathvariant="normal">v</mi><mo>∈</mo><mi
   mathvariant="normal">V</mi></mrow> :MATH]
   (1)

   The degree centralities for all genes in the bipartite network are
   normalized to 1–2 (Eq. [111](2)).
   [MATH: <mrow><mi mathvariant="normal">N</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">l</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">z</mi><mi mathvariant="normal">e</mi><mi
   mathvariant="normal">d</mi><mspace width="5.0pt"></mspace><mi
   mathvariant="normal">s</mi><mi mathvariant="normal">c</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mi
   mathvariant="normal">e</mi><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msub><mi
   mathvariant="normal">d</mi><mrow><mfenced close=")" open="("><mrow><mi
   mathvariant="normal">v</mi></mrow></mfenced></mrow></msub><mo>-</mo><ms
   ub><mrow><mi mathvariant="normal">m</mi><mi
   mathvariant="normal">i</mi><mi
   mathvariant="normal">n</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">V</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow><mrow><msub><mrow><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">a</mi><mi
   mathvariant="normal">x</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">V</mi><mo
   stretchy="false">)</mo></mrow></msub><mo>-</mo><msub><mrow><mi
   mathvariant="normal">m</mi><mi mathvariant="normal">i</mi><mi
   mathvariant="normal">n</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">V</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow></mfrac><mo
   linebreak="badbreak">+</mo><mn>1</mn></mrow> :MATH]
   (2)

   where
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">m</mi><mi
   mathvariant="normal">a</mi><mi
   mathvariant="normal">x</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">V</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow> :MATH]
   is the max value of degree centrality in the gene set V and
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">m</mi><mi
   mathvariant="normal">i</mi><mi
   mathvariant="normal">n</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">V</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow> :MATH]
   is the minimum degree centrality.

2.3. Construction of functional interaction network based on SNP-gene
bipartite network

   We constructed two types of functional interaction networks based on
   PPI and GIANT network [112][14]. A total of 11,255,250 known
   protein-protein interactions (PPIs) for human were derived from the
   STRING database [113][39]. To reduce the complexity of network
   structure and improve calculation speed, we set the threshold of
   interaction as confidence score over 0.4, which is the default
   threshold of the STRING database. The confidence score of interactions
   was used as weights of edges in the PPI network. The GIANT networks
   identify tissue-specific interactions by using a Bayesian framework to
   integrate heterogeneous public data [114][14]. For each tissue, only
   edges with evidence supporting a tissue-specific functional interaction
   were included in the network ([115]https://hb.flatironinstitute.org/).
   A total of 10,857,540 interactions from the mammary gland tissue were
   used in our analysis. Similarly, the confidence scores of the
   tissue-specific functional interaction were also used as weights of
   edges in the GIANT networks. The SNP-gene bipartite provided
   topological and GWAS signal-driven information on SNP-gene directly
   regulatory relationships. We extracted subnetworks from the PPI and
   GIANT networks that only contain genes in the bipartite and their
   directly connected genes.

2.4. Identification of key drivers

   We evaluated the regulatory relationships between genes by considering
   three types of regulatory networks. The information from networks was
   incorporated to improve the coverage of functional association between
   genes. We performed key driver analysis (KDA) to identify key driver
   genes [116][40], [117][41], [118][42]. The above mentioned 6
   expression-based prognostic and predictive gene sets (MammaPrint,
   PAM50, OncotypeDX, Wang-76, Zhang-15, and Cell cycle) were firstly
   collected. For each gene, we extracted 1-layer neighborhoods directly
   connected with it in the PPI or GIANT network as the subnetwork.
   Fisher’s exact test was used to test whether genes in this subnetwork
   were enriched in these 6 gene sets. The background included all genes
   in the 6 gene sets. The enrichment P-values were -log10 transformed and
   normalized to a scale of 0 to 1 using the min–max normalization method.
   For each gene, the normalized value was set as the enrichment score.
   For a given network, each gene has 6 enrichment scores in the current
   study for breast cancer.

   To incorporate with the GWAS signals (i.e., the SNP-gene bipartite
   network), we defined a signature score for each gene as shown in Eq.
   [119](3):
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">S</mi><mi
   mathvariant="normal">S</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">g</mi><mo>,</mo><mi mathvariant="normal">G</mi><mo
   stretchy="false">)</mo></mrow></msub><mo
   linebreak="goodbreak">=</mo><mfrac><mrow><msub><mrow><msubsup><mo>∑</mo
   ><mrow><mi mathvariant="italic">Si</mi></mrow><mrow><mi
   mathvariant="italic">Sn</mi></mrow></msubsup><mrow><mo
   stretchy="false">(</mo><mi mathvariant="normal">E</mi><mi
   mathvariant="normal">S</mi></mrow></mrow><mrow><mfenced close=")"
   open="("><mrow><mi mathvariant="normal">g</mi><mo>,</mo><mi
   mathvariant="normal">S</mi><mi
   mathvariant="normal">i</mi></mrow></mfenced></mrow></msub><mo>×</mo><ms
   ub><mi mathvariant="normal">D</mi><mrow><mfenced close=")"
   open="("><mrow><mi
   mathvariant="normal">g</mi></mrow></mfenced></mrow></msub><mrow><mo
   stretchy="false">)</mo></mrow></mrow><mi>n</mi></mfrac><mo>,</mo><mspac
   e width="5.0pt"></mspace><mi mathvariant="normal">f</mi><mi
   mathvariant="normal">o</mi><mi mathvariant="normal">r</mi><mspace
   width="5.0pt"></mspace><mi mathvariant="normal">s</mi><mo>∈</mo><mi
   mathvariant="normal">S</mi><mo>,</mo><mspace width="5.0pt"></mspace><mi
   mathvariant="normal">g</mi><mo>∈</mo><mi
   mathvariant="normal">G</mi></mrow> :MATH]
   (3)

   where
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">S</mi><mi
   mathvariant="normal">S</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">g</mi><mo>,</mo><mi mathvariant="normal">G</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow> :MATH]
   is the signature score for gene g in a given network (PPI or GIANT);
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">E</mi><mi
   mathvariant="normal">S</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">g</mi><mo>,</mo><mi mathvariant="normal">S</mi><mi
   mathvariant="normal">i</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow> :MATH]
   is the enrichment score for gene g in the gene set i; n is the total
   number of gene sets (here equals to 6 for breast cancer).
   [MATH: <mrow><msub><mi mathvariant="normal">D</mi><mrow><mfenced
   close=")" open="("><mrow><mi
   mathvariant="normal">g</mi></mrow></mfenced></mrow></msub></mrow>
   :MATH]
   is the normalized degree centralities for gene g in the SNP-gene
   bipartite network. If a gene did not exist in the bipartite network,
   its gene weight
   [MATH: <mrow><msub><mi mathvariant="normal">D</mi><mrow><mfenced
   close=")" open="("><mrow><mi
   mathvariant="normal">g</mi></mrow></mfenced></mrow></msub></mrow>
   :MATH]
   was set to 1. Again, the signature scores were normalized to a scale of
   0 to 1.

   Lastly, we combined the signature scores for each gene from different
   networks (e.g., PPI or GIANT). For a gene, the average of signature
   scores from different networks was combined and normalized to get a
   final score as Eq. [120](4):
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">S</mi></mrow><mrow><mfenced close=")"
   open="("><mrow><mi
   mathvariant="normal">g</mi></mrow></mfenced></mrow></msub><mo
   linebreak="goodbreak">=</mo><mfrac><msub><mrow><msubsup><mo>∑</mo><mrow
   ><mi mathvariant="italic">Gi</mi></mrow><mrow><mi
   mathvariant="italic">Gn</mi></mrow></msubsup><mi
   mathvariant="normal">S</mi><mi
   mathvariant="normal">S</mi></mrow><mrow><mfenced close=")"
   open="("><mrow><mi mathvariant="normal">g</mi><mo>,</mo><mi
   mathvariant="normal">G</mi><mi
   mathvariant="normal">i</mi></mrow></mfenced></mrow></msub><mi>N</mi></m
   frac><mo>,</mo><mspace width="5.0pt"></mspace><mi
   mathvariant="normal">f</mi><mi mathvariant="normal">o</mi><mi
   mathvariant="normal">r</mi><mspace width="5.0pt"></mspace><mi
   mathvariant="normal">s</mi><mo>∈</mo><mi
   mathvariant="normal">S</mi><mo>,</mo><mspace width="5.0pt"></mspace><mi
   mathvariant="normal">g</mi><mo>∈</mo><mi
   mathvariant="normal">G</mi></mrow> :MATH]
   (4)

   where
   [MATH: <mrow><msub><mrow><mi mathvariant="normal">C</mi><mi
   mathvariant="normal">S</mi></mrow><mrow><mo stretchy="false">(</mo><mi
   mathvariant="normal">g</mi><mo
   stretchy="false">)</mo></mrow></msub></mrow> :MATH]
   is the composite score of gene g, N is the number of networks. In this
   study, N equaled to 2 (refers to the PPI and GIANT network). The
   composite score across networks was indicated as the numerical value of
   genes (0–1). Composite scores were ranked according to values and
   provided a criterion for determining the significance of genes in
   biological mechanisms for a given disease.

2.5. Gene expression profiling and heat maps

   Limma package (Version 3.30.13) [121][43], [122][44] in R was used to
   detect differentially expressed genes between tumor and normal samples.
   P-values were adjusted for multiple-testing by the BH procedure. For
   plotting heat maps, hierarchical clustering was performed in R to
   cluster samples on the gene expression profiles by centroid.

2.6. Survival analysis and Kaplan-Meier plots

   The survival analysis for utilization of multiple gene expression
   signatures was performed using the TCGA-BRCA dataset. We developed an
   abnormal gene expression score (AGES) for each sample i as Eq.
   [123](5):
   [MATH: <mrow><msub><mrow><mi>AGES</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mo>=</mo><mfrac><mrow><msub><
   mrow><mi mathvariant="normal">C</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub><mo>-</mo><mi
   mathvariant="normal">μ</mi></mrow><mrow><mi
   mathvariant="normal">σ</mi></mrow></mfrac></mrow> :MATH]
   (5)

   where
   [MATH: <mrow><msub><mrow><mi
   mathvariant="normal">C</mi></mrow><mrow><mi
   mathvariant="normal">i</mi></mrow></msub></mrow> :MATH]
   represents the counts of genes whose expression in sample i was above
   the third quartile or below the first quartile of the expression values
   in all samples. μ is the mean value, and σ is the standard deviation.
   Here we only included genes with the top 20 composite scores in AGES.
   The individuals were divided into two groups of high (above the third
   quartile) and low (below the first quartile) according to the AGES.
   Univariate survival analyses between two groups were calculated by
   using Kaplan-Meier curves and the log-rank test in R. The P-values were
   determined by using the log-rank test, and the significance threshold
   was set as P-value < 0.05.

3. Result

3.1. Identification of disease-associated gene nodes in SNP-gene bipartite
network

   We integrated GWASs results with multiple functional omics data,
   including cis-eQTLs, cis-mQTLs, chromatin interactions, chromatin
   states and ATAC-seq, to construct SNP-gene pairs ([124]Fig. 1a). After
   annotation, we obtained 7,500 SNP-gene mapping pairs, including 6,647
   SNPs and 274 protein-coding genes. To investigate the functional
   mapping structure between disease-associated SNPs and genes, we
   constructed the mapping pairs as a bipartite network. About 0.037% of
   SNPs were connected with at least one gene (6,647 in total 18,134,195
   SNPs). Meanwhile, 1.35% of protein-coding genes were mapped to at least
   one SNP (274 in total 20,345 genes).

   In order to find the contribution of the GWAS signal-driven genes to
   biological mechanisms, we performed pathway enrichment analysis for
   these genes by using canonical pathways from KEGG ([125]Figure
   Supplementary Fig. 1). The top-ranked pathways included cancer-related
   pathways, such as gastric cancer, thyroid cancer, acute myeloid
   leukemia, pancreatic cancer, and breast cancer. We also found several
   signaling pathways, such as MAPK signaling pathway, TGF-beta signaling
   pathway and WNT signaling pathway, as well as cellular processes
   pathways, like cell cycle pathway and homologous recombination.

   We incorporated all SNP-gene mapping pairs into a bipartite network
   with two types of nodes as SNPs and genes, and edges for SNP-gene
   functional mapping pairs ([126]Fig. 1b). To investigate the relevance
   between the bipartite topology structure and regulatory function of
   nodes, we calculated the degree centrality for each gene and SNP in the
   bipartite network ([127]Supplementary Tables 3 and 4). Genes with
   higher degree centrality values connected with more nodes in SNP sets.
   Likewise, SNPs with high degree centrality values connected with
   numerous genes. These results suggested that degree centrality
   coincided with the genetic variant-disease association and reflected
   potential genetic perturbation to genes in the mechanism of disease.

3.2. Subnetworks and key genes of biological tissue-Specific networks

   To reveal key genes based on disease-associated information and further
   investigate regulatory mechanisms of those genes in diseases, we
   constructed two types of tissue-specific gene interaction networks for
   those genes in the bipartite network and performed KDA on these
   networks ([128]Fig. 1c). For each network, we identified key genes of
   which the neighbor-genes were enriched in the gene expression-based
   signatures. The full list of KDA results for genes in the two types of
   networks are in [129]Supplementary Table 5. We calculated a composite
   score for each gene based on the PPI and GIANT networks to provide a
   quantitative evidence to evaluate the importance of regulatory function
   ([130]Fig. 1d). The full list of composite scores are provided in
   [131]Supplementary Table 5 and the top 20 genes are summarized in
   [132]Table 1.

Table 1.

   Top 20 key genes in breast cancer based on composite scores by
   considering criteria from 3 types of network.
   Key gene Gene name Score[PPI] Score[GIANT] Composite score
   CCND1 Cyclin D1 1.000 0.579 1.000
   CDKN2A Cyclin dependent kinase inhibitor 2A 0.735 0.768 0.952
   MYC MYC proto-oncogene, bhlh transcription factor 0.728 0.708 0.910
   ESR1 Estrogen receptor 1 0.760 0.430 0.754
   XRCC5 X-ray repair cross complementing 5 0.184 1.000 0.750
   CHEK2 Checkpoint kinase 2 0.606 0.534 0.722
   XRCC6 X-ray repair cross complementing 6 0.352 0.697 0.664
   EXO1 Exonuclease 1 0.399 0.648 0.663
   RNASEH2A Ribonuclease H2 subunit A 0.346 0.644 0.627
   BRCA2 BRCA2 DNA repair associated 0.286 0.632 0.581
   ADSL Adenylosuccinate lyase 0.000 0.847 0.536
   PPP2R1B Protein phosphatase 2 scaffold subunit Abeta 0.380 0.344 0.459
   JUND Jund proto-oncogene, AP-1 transcription factor subunit 0.277 0.437
   0.452
   PES1 Pescadillo ribosomal biogenesis factor 1 0.113 0.589 0.445
   PSAT1 Phosphoserine aminotransferase 1 0.064 0.625 0.436
   ADCY3 Adenylate cyclase 3 0.134 0.551 0.434
   SEPT9 Septin 9 0.041 0.633 0.426
   STAT3 Signal transducer and activator of transcription 3 0.389 0.277
   0.422
   HDGF Heparin binding growth factor 0.000 0.658 0.417
   EP300 E1A binding protein p300 0.431 0.218 0.411
   [133]Open in a new tab

   Footnotes: PPI: protein–protein interaction.

   To assure the robustness of our method, we performed IMNA by using the
   subsets of gene expression-based signatures as follows: For a total of
   6 gene sets, one gene set was removed at each time and the remaining 5
   gene sets were used to perform analysis. Jaccard index was used to
   evaluate the similarity of the top 20 genes between different
   conditions. We found the relatively high similarity (Jaccard index
   >0.48) between the pairwise comparison of all conditions ([134]Figure
   Supplementary Fig. 2a). We also detected the same trend (Jaccard index
   >0.42) for the similarity of the top 100 genes ([135]Figure
   Supplementary Fig. 2b).

3.3. Identification of the effects of top key genes for breast cancer

   We further examined the expression levels and genomic variations of the
   top 20 ranked-order genes to investigate the performance of these genes
   in the pathogenesis of breast cancer. Compared with normal samples,
   these 20 genes showed significantly abnormal expression levels in
   breast tumor samples, as well as peripheral blood mononuclear cells
   (PBMCs) from breast cancer patients (adjust P-value < 0.05 in at least
   3 datasets, [136]Fig. 2a, [137]Supplementary Table 6). The expression
   levels of these 20 genes were progressively changed during disease
   progression ([138]Fig. 2b). The range of genetic alteration frequencies
   (including copy-number alterations (CNA) and mutation frequency) of
   these 20 genes were between 0.9% and 21% in breast cancer ([139]Fig.
   2c). Among these 20 genes, 9 genes have been identified as cancer
   driver genes in breast cancer and/or multiple other tumors according to
   the COSMIC database ([140]http://cancer.sanger.ac.uk/cosmic), including
   CDKN2A, CCND1, MYC, ESR1, CHEK2, BRCA2, STAT3, SEPT9 and EP300.
   Moreover, in the top 20 genes, CCND1, MYC, ESR1 and BRCA2 were found in
   the KEGG breast cancer pathway. In addition to the breast cancer
   pathway, some of the top 20 genes were found in other breast
   cancer-related pathways. For example, 2 genes are in the MAPK signaling
   pathway (MYC and JUND), 3 in the PI3K-Akt pathway (CDKN2A, CCND1 and
   CHEK2), and 3 in the p53 pathway (MYC, CCND1 and PPP2R1B).

Fig. 2.

   [141]Fig. 2
   [142]Open in a new tab

   The top 20 rank-ordered genes are correlated with breast cancer,
   disease pathogenesis and genetic variants. The top 20 genes are
   differentially expressed between breast cancer samples and normal
   samples. The top 20 gene expression levels in a) various datasets. b)
   breast ductal carcinoma in situ (DCIS) and invasive ductal carcinomas
   (IDCs) ([143]GSE21422). Samples were clustered by gene expression
   levels of the 20 genes by centroid. c) Genetic alternation frequencies
   of the top 20 genes in TCGA-BRCA patient samples.

   Furthermore, to investigate the overall impact of the 20 ranked-order
   genes expression levels on breast cancer outcomes, we calculated the
   AGES to represent the abnormal expression levels of the 20 genes in
   samples and evaluated the relevance of the AGES values to tumorigenesis
   and clinical survival. We found that the AGES value was significantly
   positively correlated with CNA fraction (Pearson correlation,
   P-value = 1.40 × 10^−15, r = 0.25, [144]Fig. 3a) in the TCGA-BRCA
   dataset. The AGES was associated with several tumor clinical
   characteristics by investigating the clinical information in the
   TCGA-BRCA dataset. More breast cancer tumors with ER-negative and
   triple-negative showed significantly higher AGES values (Chi-squared
   test, P-value < 0.05, [145]Fig. 3b-c). We also evaluated the relevance
   between the AGES values and breast patient survival outcomes.
   Categorizing patients in the first tercile into the low group and the
   third tercile into the high group separately, the low group had better
   significant disease-free survival and distant metastasis-free survival
   outcomes compared to the high group (P < 0.05, [146]Fig. 3d-f). These
   results indicate that the AGES signature may contribute to breast
   cancer clinical properties, and might be an effective marker of patient
   survival.

Fig. 3.

   [147]Fig. 3
   [148]Open in a new tab

   The relevance of the AGES values to tumorigenesis, clinical
   characteristics and clinical survival. a) AGES values are significantly
   positively correlated with linear copy-number alterations (CNA). b,c)
   The AGES values are associated with tumor clinical characteristics.
   Significant P-values are determined by Chi-squared test (P < 0.01). b)
   ER-negative. c) Triple-negative. d,e,f) Kaplan-Meier survival curves
   showed that the AGES are predictive of survival outcomes for breast
   cancer. d) disease-free survival in TCGA-BRCA dataset. e) distant
   metastasis-free survival in [149]GSE11121. f) distant metastasis-free
   survival in [150]GSE20685.

3.4. Identification of novel breast cancer-associated genes

   To verify the associations between the top 20 genes and breast cancer,
   we used various databases to assess functions in breast cancer through
   text mining of published articles and integrating disease databases
   ([151]Table 2), including PolySearch2
   ([152]http://polysearch.cs.ualberta.ca/polysearch), COREMINE
   ([153]https://www.coremine.com/medical/) and MalaCards
   ([154]https://www.malacards.org/). The top 20 genes have been linked to
   breast cancer or neoplasm by using at least one of the tools. We
   noticed that RNASEH2A (ribonuclease H2 subunit A) was not previously
   verified as breast cancer-related gene. We further evaluated whether
   its expression level was related to tumorigenesis and clinical
   survival. We found that RNASEH2A expression levels were significantly
   up-regulated in breast cancer samples compared to normal tissues
   (adjusted P-value < 0.05, [155]Fig. 4a). Univariate survival analyses
   revealed that low RNASEH2A mRNA expression was significantly associated
   with better overall and relapse-free survival (P = 1.3 × 10^−5 and
   P < 1.0 × 10^−16, [156]Fig. 4b-c) in KM-plotter [157][45].
   Co-expression analysis was performed for RNASEH2A with the most
   well-evaluated cell proliferation markers, including MKI67, PCNA, MCM2,
   MCM3, MCM4, MCM5, MCM6 and MCM7) [158][46], by using the breast cancer
   samples in TCGA (n = 1100). We observed that these genes were
   positively co-expressed with each other, and RNASEH2A had positive
   correlations with the other 8 genes (rho ≥ 0.3, P-value < 0.01)
   ([159]Figure Supplementary Fig. 3).

Table 2.

   In Silico Validation for the top 20 ranked genes for breast cancer.
     Gene   PolySearch2 (Z Score) COREMINE (Statistical significance)
   MalaCards (Score)
   CCND1            49.43         5.12 × 10^−5                         86.94
   CDKN2A           4.59          7.97 × 10^−4                         33.58
   MYC              26.82         1.59 × 10^−4                         72.97
   ESR1             30.11         3.78 × 10^−6                        1165.88
   XRCC5            13.23         1.51 × 10^−2                           –
   CHEK2            14.01         1.30 × 10^−4                        1461.53
   XRCC6              –           1.43 × 10^−2                           –
   EXO1             1.23          0.135                                  –
   RNASEH2A         4.71          0.502                                  –
   BRCA2            11.95         3.39 × 10^−6                        1500.88
   ADSL             2.69          0.350                                  –
   PPP2R1B          5.42          2.44 × 10^−2                           –
   JUND               –           7.16 × 10^−2                           –
   PES1             2.39          1.83 × 10^−2                           –
   PSAT1            1.45          1.14 × 10^−2                           –
   ADCY3            8.19          0.335                                  –
   SEPT9              –           0.193                                  –
   STAT3            20.75         3.05 × 10^−4                         42.45
   HDGF               –           0.232                                  –
   EP300            1.82          1.72 × 10^−3                         50.25
   [160]Open in a new tab

   Footnotes: - indicates that the value is not available.

Fig. 4.

   [161]Fig. 4
   [162]Open in a new tab

   The expression level of RNASEH2A related to patient outcome in breast
   cancer. a) Box plots show increased expression of RNASEH2A in breast
   tumor samples in 7 independent studies. b, c) Kaplan-Meier survival
   analyses show the differences in overall survival (OS) and relapse-free
   survival (RFS) between breast cancer patients with high or low RNASEH2A
   mRNA levels. P-values are calculated by using the log-rank analysis.

3.5. Application of the method to ovarian cancer

   To evaluate the reliability of IMNA, we applied our method to ovarian
   cancer. The GWAS summary dataset was derived from the Ovarian Cancer
   Association Consortium (OCAC), which comprised 18,174 cases with
   epithelial ovarian cancer and 26,134 controls of European ancestry from
   43 studies [163][47], [164][48]. The multiple functional data included
   cis-eQTLs, chromosome long-range interactions and epigenomic regulatory
   annotations, since mQTLs data was not available yet. After the similar
   procedures as breast cancer, we obtained 5,386 SNP-gene mapping pairs,
   including 4,419 SNPs and 149 protein-coding gene, and constructed the
   SNP-gene bipartite network. KEGG pathway enrichment analysis was
   perfomred for investigating the GWAS signal-driven genes to biological
   mechanisms ([165]Figure Supplementary Fig. 4 [166]Supplementary Fig.
   4a).

   We collected 6 expression-based ovarian cancer-related gene sets from
   the published studies [167][49], [168][50], [169][51], [170][52],
   [171][53], [172][54]. To construct regulatory networks, 11,255,250 PPI
   interactions were derived from STRING and 6,4594,988 edges from ovary
   tissue of GIANT were used. In the weighted regulatory networks, we
   performed KDA to identify key genes. The full list of gene composite
   scores was listed in [173]Supplementary Table 7. CDK6 and MLLT10 have
   been identified as cancer driver genes according to the COSMIC
   database. By manually checking in the PubMed, 7 of the top 20 genes
   were reported to be relevant to ovarian cancer, including MAPT
   [174][55], BECN1 [175][56], PRC1 [176][57], CBX1 [177][58], CDK6
   [178][59], WWOX [179][60], and PNPO [180][61]. We performed KEGG
   pathway enrichment analysis for the top 100 genes ([181]Figure
   Supplementary Fig. 4b), and found that the top-ranked pathways included
   the cell cycle pathway, several cancer-related pathways (renal cell
   carcinoma, proteoglycans in cancer and small cell lung cancer), and
   important signaling pathways (PI3K-Akt signaling pathway, VEGF
   signaling pathway and chemokine signaling pathway). These results
   indicated that our method can also apply to other types of complex
   diseases.

4. Discussion

   Although GWASs have identified thousands of diseases susceptibility
   variants, the causal genes and their molecular functional mechanisms
   are largely unknown [182][62]. Network-based approaches are helpful to
   understand the mechanisms of complex diseases and capture genetic
   variants perturbation in the gene regulatory network [183][8],
   [184][63]. We applied IMNA by combining tissue-specific regulatory
   networks from diverse biological scales, including GWAS signals,
   cis-eQTLs, cis-meQTLs, long-range interactions and epigenomic
   regulatory annotations, multi-gene predictive and prognostic signatures
   and regulatory networks to identify key genes supported by genetic and
   biological processes. The concentration of functional
   disease-associated SNPs on genes reflects the importance of gene
   regulatory roles in diseases. Predictive and prognostic signatures
   provide intrinsic molecular characteristics of breast cancer based on
   gene expression analysis [185][64]. The identified key genes derived
   from multiple tissue-specific regulatory networks can regulate other
   disease-associated genes and influence disease susceptibility. We used
   the AGES signature to present the correlations between the expression
   levels of the top 20 genes and several tumor characteristics, which
   demonstrates the utility of the AGES signature for genetic variants,
   tumor properties and patient survival. We also applied this method for
   ovarian cancer to demonstrate the effectiveness of this pipeline on
   other types of diseases.

   This approach used integrative multi-omics methods that benefited the
   research of complex diseases. First, we derived functional data to
   reveal the mapping between SNPs and genes in different respects
   (cis-eQTLs, cis-mQTLs, long-range interactions, chromatin states and
   ATAC-seq) from diseases-related tissues or cell types. Second,
   knowledge-driven signatures revealed molecular characteristics of
   breast cancer from the gene expression level. Intrinsic subtyping,
   prognostic and predictive signatures of breast cancer based on gene
   expression profiling have been found for correlation with molecular
   subtypes, including proliferation, prognosis, and therapeutic response
   [186][33], [187][34], [188][64], [189][65]. Third, since biological
   molecules interact within and across multiple biological levels, we
   constructed disease networks from genomic, transcriptomic and
   proteomics scales based on data-driven gene sets and identified the key
   genes [190][8], [191][11]. The network models could provide a
   relatively complete understanding of biological interactions [192][12],
   [193][13], [194][33], [195][66]. Therefore, our integrative
   network-based approach unveiled the biological mechanism of diseases
   from different scales. Gene expression profiles and survival analyses
   verified the ability of key genes for prognosis in breast cancer.

   Besides the known breast cancer genes, our study also identified a
   potential new regulator, RNASEH2A, for breast cancer. RNASEH2A encodes
   the main catalytic subunit of ribonuclease H2 (RNase H2), which
   mediates the removal of lagging-strand Okazaki fragment RNA primers
   from the DNA:RNA duplex during DNA replication and degrades the RNA of
   RNA:DNA hybrids [196][67]. RNASEH2A is frequently overexpressed in a
   variety of cancers, including breast, bladder, brain, prostate,
   seminomas, and leukemia [197][68]. RNASEH2A could promote DNA
   replication and cancer cell proliferation, and may regulate signaling
   pathways responsible for cell proliferation and apoptosis [198][68],
   [199][69], [200][70].

   Although we have discussed the advantages of IMNA, there are still
   limitations in our work. First, as a GWASs-based method, other
   potential important genes without genetic support could be missed in
   our method. Second, this method constructed models by using genetics
   and molecular data from static biological systems. Due to the lack of
   dynamic biological data in disease processes, we could not present the
   dynamics of network to investigate the disease development currently.
   Third, our method incorporates breast cancer-related signatures
   information as prior knowledge to construct network. The results might
   be influenced by selections of prior knowledge inputs compared with
   other methods which ignored pathway information. However,
   disease-related gene sets are more stable in different
   datasets/platforms compared to individual genes or variants [201][8],
   [202][71]. The information we captured might have better generalization
   capability. Finally, we used molecular interactions from multiple
   diverse datasets in different biological scales. Since different
   molecular subtypes for many datasets (e.g., eQTL) are not available
   currently, we only applied the model for overall breast cancer. If
   datasets for different molecular subtypes are available in future, the
   method can be used to characterize key regulator genes for different
   molecular subtypes.

   In conclusion, we utilized IMNA to interpret the GWAS signals of breast
   cancer and highlight key genes in the diseases-associated regulatory
   networks. Our findings provide a global perspective to understand the
   molecular underpinnings in pathogenesis of breast cancer, and point out
   candidate therapeutic targets. IMNA can also apply to other complex
   diseases to unveil underlying mechanisms and help for investigating
   therapeutic targets.

   URLs. IMNA, an integrative genomics network-based approach,
   [203]https://github.com/xjtugenetics/IMNA.

CRediT authorship contribution statement

   Yi-Xiao Chen: Conceptualization, Methodology, Software, Data curation,
   Writing - original draft, Writing - review & editing. Yu Rong: Data
   curation, Validation. Feng Jiang: Investigation, Resources. Jia-Bin
   Chen: Software, Data curation. Yuan-Yuan Duan: Data curation. Shan-Shan
   Dong: Formal analysis, Writing - original draft, Writing - review &
   editing. Dong-Li Zhu: Resources. Hao Chen: Visualization. Tie-Lin Yang:
   Resources. Zhijun Dai: Resources. Yan Guo: Supervision, Writing -
   review & editing, Methodology.

Declaration of Competing Interest

   The authors declare that they have no known competing financial
   interests or personal relationships that could have appeared to
   influence the work reported in this paper.

Acknowledgments