Abstract

   Multiple types of high throughput genomics data create a potential
   opportunity to identify driver patterns in ovarian cancer, which will
   acquire some novel and clinical biomarkers for appropriate diagnosis
   and treatment to cancer patients. To identify candidate driver genes
   and the corresponding driving patterns for resistant and sensitive
   tumors from the heterogeneous data, we combined gene co-expression
   modules with mutation modulators and proposed the method to identify
   driver patterns. Firstly, co-expression network analysis is applied to
   explore gene modules for gene expression profiles through weighted
   correlation network analysis (WGCNA). Secondly, mutation matrix is
   generated by integrating the CNV data and somatic mutation data, and a
   mutation network is constructed from the mutation matrix. Thirdly,
   candidate modulators are selected from significant genes by clustering
   vertexs of the mutation network. Finally, a regression tree model is
   utilized for module network learning, in which the obtained gene
   modules and candidate modulators are trained for the driving pattern
   identification and modulators regulatory exploration. Many identified
   candidate modulators are known to be involved in biological meaningful
   processes associated with ovarian cancer, such as CCL11, CCL16, CCL18,
   CCL23, CCL8, CCL5, APOB, BRCA1, SLC18A1, FGF22, GADD45B, GNA15, GNA11,
   and so on.

Introduction

   Ovarian cancer is known as a complex genomic disease. During
   tumorigenesis, many factors contribute to pathological gene expression
   changes, such as genomic affection and epigenomic affection. Hence,
   acquiring the etiopathogenesis and chemoresponse of cancer faces a
   great challenge. And the achievement can be utilized to deploy the
   high-performance diagnosis and treatment of cancer patients. Through
   many studies presented the prognosis of ovarian cancer, there is no
   completely validated clinical model for predicting ovarian cancer
   prognosis and drug response. Therefore, it remains an important
   research issue to identify prognostic and predictive driver patterns
   for improving ovarian cancer treatment.

   Cancer genomes posses a large number of aberrations including somatic
   mutations and copy number variations (CNVs). Many mutations contribute
   to cancer progression from the normal to the malignant state. The
   researches have shown that some aberrations are vital for tumorigenesis
   and most cancers are caused by a small number of driver mutations
   developed over the course of about two decades^[34]1,[35]2. The
   detection of these mutations with exceptionally high association
   between the copy number variations, somatic mutations and gene
   expression can ascertain disease candidate genes and potential cancer
   mechanisms. Several large scale cancer genomics projects, such as the
   Genomic Data Commons (GDC), The Cancer Genome Atlas (TCGA), and
   International Cancer Genome Consortium (ICGC), etc., have produced a
   large volume of data and provided us an opportunity to integrate
   different level of gene expression data to identify candidate driver
   genes and driver pathways and better understand the cancer at the
   molecular level.

   During exploring the large volume of genomics abnormalities generated
   from large-scale cancer projects, many computational and statistical
   methods have been proposed to search for this mutation driver patterns.
   Adib et al. predicted the potential specificity of several putative
   biomarkers by using gene expression microarray profile and examined
   their expression in a panel of epithelial tissues and tumors^[36]3.
   Youn et al. proposed a new method to identify cancer driver genes by
   accounting for the functional impact of mutations on proteins^[37]4.
   Tomasetti et al. combined conventional epidemiologic studies and
   genome-wide sequencing data to infer a number of driver mutations which
   is essential for cancer development^[38]5. Jang et al. identified
   differentially expressed proteins involved in stomach cancer
   carcinogenesis through analyzing comparative proteomes between
   characteristic alterations of human stomach adenocarcinoma tissue and
   paired surrounding normal tissue^[39]6. Xiong et al. proposed a general
   framework in which the feature (gene) selection was incorporated into
   pattern recognition to identify biomarkers^[40]7. Jung et al.
   identified novel biomarkers of cancer by combining bioinformatics
   analysis on gene expression data and validation experiments using
   patient samples and explored the potential connection between these
   markers and the established oncogenes^[41]8. Logsdon et al. identified
   a new category of candidate tumor drivers in cancer genome evolution
   which can be regarded as the selected expression regulators (SERs)–
   genes driving deregulated transcriptional programs in cancer evolution,
   and uncovered a previously unknown connection between cancer expression
   variation and driver events by using a novel sparse regression
   technique^[42]9. Naif et al. integrated exome sequencing data with
   functional RNAi screening data onto a human signaling network to
   predict breast cancer subtype-specific drug targets and survival
   outcome^[43]10,[44]11. Some integrative network approaches on cancer
   omic data were developed to understand the function of genomic
   alterations in cancer^[45]12,[46]13. A cancer genome embodies thousands
   of genomics abnormalities such as single nucleotide variants, large
   segment variations, structural aberrations and somatic
   mutations^[47]14,[48]15. The corresponding malignant state reflects
   aberrant copy numbers (CN) and/or mutations and the expression patterns
   in which the mutation and copy number patterns are embeded^[49]16.
   Integration of copy number/mutation data and gene expression data was
   proposed to discover driver genes by quantifying the impact of these
   aberrations on the transcriptional changes^[50]17,[51]18. The available
   implementations for the integrative analysis of genomics data includes
   regression methods, correlation methods and module network
   methods^[52]19.

   Driver patterns, including driver genes, driver mutations, driver
   pathways and core modules, which are considered as cancer biomarkers,
   are supposed to promote the cancer progression. Thus, identification of
   driver patterns is vital for providing insights into carcinogenic
   mechanism. The integration of gene expression, copy number and somatic
   mutations data to identifying genomics alterations which induce changes
   in the expression levels of the associated genes, becomes a common task
   in cancer mechanism and drug response studies. However, there is still
   a tough work to integrate information across the different
   heterogeneous omics data and distinguish driver patterns which can
   promote the cancer cell to propagate infinitely. Hence, we proposed a
   driver pattern identification method over the gene co-expression of
   drug response in ovarian by integrating high throughput genomics data.
   In the method, we integrated different level genomics data including
   gene expression, copy number and somatic mutation to identify drivers
   of resistance and sensitivity to anti-cancer drugs. First, weighted
   correlation network analysis is applied to acquire the co-expression
   gene modules. These modules are utilized as initial inputs for the
   following module networks learning. Then, mutation network is
   constructed from the mutation matrix generated by integrating the CNVs
   and somatic mutations. The candidate modulators are selected from the
   clusters of the vertexes of the mutation network. Finally, an
   optimization model is used to identify the driving pattern and explore
   the modulator regulatory. We used publicly available and clinically
   annotated gene expression, CNVs, and somatic mutation datasets from
   TCGA. The experimental results show that in the driver patterns many of
   the identified candidate modulators are known to be involved in
   biological meaningful processes associated with ovarian cancer, which
   can be regard as potential driver genes.

Results

   We conducted the driver patterns identification and the driving
   regulatory analysis as following procedure. Firstly, co-expression
   network were constructed and the gene modules for gene expression
   profiles were explored. Then, CNVs and somatic mutations were
   integrated to build the mutation network. The candidate modulators were
   selected from clusters of the vertex of network. A regulation procedure
   was explored for candidate modulators and gene modules using network
   learning combined with regression tree model. At last, local polynomial
   regression fitting model was applied to conduct dosage-sensitive and
   dosage-resistant analysis.

Data preprocessing

   To select a subset of genes whose aberration/expression profiles were
   significantly different between sensitive and resistant tumors, we
   applied differential analysis to the gene expression and CNV data
   between these two groups. For different gene expression analysis, we
   used the EMD approach: EMDomics, which is designed especially for
   heterogeneous data. We selected genes with a q-value- < 0.1. For CNV
   data, firstly, we mapped genes to the CNV regions to obtain CNV genes;
   then, we used the EMD approach for the differential analysis and
   ranking of the CNV genes. We also calculated the frequency of
   amplification and deletion for each gene in the two groups and selected
   genes for which the difference between their frequencies is more than
   20%. We used a threshold of log2 copy number ratio of 0.3/−0.3 to call
   amplified/deleted genes. For somatic mutation data, we also calculated
   the frequency of mutations for all genes across the samples in each
   groups and selected the genes that were mutated in more than 2% of the
   tumors.

Gene modules with co-expression patterns

   After the data preprocessing, 2690 genes are chosen for co-expression
   gene modules exploring. We constructed a weighted correlation network
   and identified co-expression modules, which was conducted by WGCNA,
   associated with cis-platinum sensitive and resistant. The gene
   co-expression modules and the corresponding hierarchical clustering
   dendrograms of these genes are shown in Fig. [53]1. The color bars
   correspond to the clusters of genes which can be seen as gene module.
   We identified thirty-nine modules are described in thirty-nine colors
   with turquoise, blue, brown, yellow, green and so on. The top 10
   co-expression modules are shown in Table [54]1, The first row
   represents the color of the modules, and the second row represents the
   number of genes which contained in the corresponding module. The
   remaining genes are in grey. These grey genes are not clustered into
   any modules. The modules list is shown in Supplementary Table [55]S1.

Figure 1.

   Figure 1
   [56]Open in a new tab

   Gene co-expression modules for gene expression profiles. A network
   heatmap plot (interconnectivity plot) of a gene network together with
   the corresponding hierarchical clustering dendrograms, with
   dissimilarity based on topological overlap, together with assigned
   module colors.

Table 1.

   The top 10 co-expression modules.
   color turquoise blue brown yellow green magenta red black purple
   greenyellow
   count 357       288  145   130    115   78      62  61    29     23
   [57]Open in a new tab

Candidate modulators selected after long gene filtering

   There may be some genes which have mutation only because they are long.
   So, we filtered the frequently mutated genes by gene length. Finally we
   obtained 2334 genes including 1942 genes of copy number variation and
   392 genes of somatic mutation.

   For CNVs data, the mapping of CNVs can promote the search for causal
   links between genetic variation and disease susceptibility. Thus we
   mapped gene expression to CNVs regions to obtain CNVs genes, for a
   specific gene in a given sample, the value of amplified is assigned the
   number 1, the deleted is assigned the number -1, the normal is assigned
   the number 0; for somatic mutation data, we first obtained mutated
   genes and mapped the correlations between somatic mutations and their
   residing genes.

   Then, a mutation matrix A is generated by integrating the CNVs and
   somatic mutations. The mutation matrix is binary: if any variation
   region in a given gene of the particular sample is in a statistically
   significant variation or any mutation arises in the specific gene in
   the given sample, the value of mutation is assigned the number 1;
   otherwise the value is assigned the number 0. The matrix rows and
   columns correspond to samples and genes, respectively. A mutation
   network (MN) was constructed according to the mutation matrix. For a
   mutated gene, m [i] describes the number of mutations in gene i across
   the samples in the mutation matrix, i.e.,
   [MATH:
   <msub><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></msub><mo>=</mo><m
   underover><mo>∑</mo><mrow><mi>i</mi><mn>=1</mn></mrow><mi>s</mi></munde
   rover><msub><mrow><mi>a</mi></mrow><mrow><mi>s</mi><mi>i</mi></mrow></m
   sub> :MATH]
   , where m is the number of samples and a [ij] is the value of mutation
   matrix of the gene i in the sample j. The network vertex weight is
   defined as h [i] = m [i]/m. Also the edge weight V [ij] is defined as
   the number of samples in which exactly one of the pair is mutated
   divided by the number of samples in which at least one of the pair is
   mutated.

   We selected a subgroup of significant genes by clustering the vertex of
   the mutation network. These genes can be regarded as candidate
   modulators (genes that regulate other genes in their module). We
   identified a list of 624 initial candidate modulators for the mutation
   genes shown in Supplementary Table [58]S2, among which includes 40 data
   from somatic mutations. The gene expression heatmap of the top 50
   candidate modulators is shown in Fig. [59]2.

Figure 2.

   [60]Figure 2
   [61]Open in a new tab

   The gene expression heatmap of the top 50 candidate modulators.

Dosage-sensitive genes and dosage-resistant genes and driver patterns

   We run the module networks learning using the gene modules and
   candidate modulators obtained from the above steps. It generally
   assumed that the modulators most likely regulate the expression of the
   genes in the corresponding gene modules. Because the gene modulators
   would have distortions in a amount of samples and they are likely to be
   the roots of the regression (drive) trees, the modulator gene are
   generally treat as a driver gene. Through the module network learning
   we obtained 456 modules and 93 modulators (see Supplementary
   Table [62]S3). 5 out of these modulators are from somatic mutations.
   Some of modules and modulators identified by our method were also
   detected by previous studies^[63]20,[64]21. Magically, several modules
   shared the same modulators, in which only one modulator is derived from
   somatic mutations.

   We applied a local polynomial regression fitting model with the R
   package loess to the scatter diagram for 93 regulatory genes with 324
   samples. DS threshold is set to (−0.25, 0.25) as the filtering criteria
   for dosage-sensitive genes(DSGs) and the others were selected as
   dosage-resistant genes(DRGs). Therefore, we obtained 44 DSGs and 49
   DRGs (see Supplementary Fig. [65]S4). The LRG1, HYDIN, POLRMT and
   TNFSF10 gene data scatter diagram for CNVs or somatic mutations vs.
   gene expression with local polynomial regression fitting model are
   shown in Fig. [66]3, in which HYDIN is derived from somatic mutations.
   Each red point represents an independent sample, the black line is the
   linear regression result, and the blue curve is the local polynomial
   regression fitting result. In Fig. [67]3A and B the slope of the black
   lines are close to 0, so the dosage sensitivity score are nearly equal
   to 0. In Fig. [68]3C and D the slope’s absolute value of the black line
   are above 1, the blue curves are mostly overlapped with the black line,
   so the dosage sensitivity score are above 1. It can be concluded from
   the graph that the genes for Fig. [69]3A and B are dosage-resistant
   genes, whereas the genes for Fig. [70]3C and D are dosage-sensitive
   genes.

Figure 3.

   [71]Figure 3
   [72]Open in a new tab

   Scatter diagram with local polynomial regression fitting model. Each
   red point represents an independent sample. The black line is the
   linear regression, and the blue curve is the LOESS regression. (A)
   LRG1. (B) HYDIN. (C) POLRMT. (D) TNFSF10.

   We identified three major driver patterns in which the top 3 gene
   modules with the largest number of regulatory genes (modulators) are
   explored. Fig. [73]4 shows these driver patterns with gene modulators
   and the corresponding gene modules, gene modulators shown in the left
   side with blue background ovals are the dosage-resistant genes. Gene
   modulators are in the right side are dosage-sensitive genes. In the
   first driver pattern with color pale blue, there are 15 genes in gene
   co-expression module 3 and 13 gene modulators. In the second driver
   pattern with color green, there are 7 genes in module 30 and 18 gene
   modulators. In third pattern with color red, 6 genes are in module 31
   and 14 gene modulators are for regulatory. Remarkably, it shows that
   CCL11 regulate all the 3 modules; SLC18A1 and C9orf82 regulate the
   module 3 and module 30; PLXDC1, CCL16, CDC34 and HYDIN regulate the
   module 3 and module 31; ZFHX4 regulates the module 30 and module 31.
   Hence, these important regulatory genes regulate many gene modules, and
   these gene modulators are highly correlated with ovarian
   cancer^[74]22–[75]28.

Figure 4.

   [76]Figure 4
   [77]Open in a new tab

   The 3 major driver patterns for dosage-sensitive and dosage-resistant.
   The graph depicts inferred regulators (left; ovals with blue background
   represents the DRGs, and ovals with orange background represents the
   DSGs) and their corresponding regulated gene modules (right).

   In the above-mentioned analysis, we constructed gene co-expression
   modules using WGCNA. WGCNA is a well-established method for identifying
   significant genes from transcriptomic data. A gene significance(GS)
   measure is assigned to each gene. The higher the absolute value of GS
   [i], the more biologically significant is the i th gene. Gene
   significance of 0 indicates that the gene is not significant with
   regard to the biological question of interest. We applied WGCNA to
   select biologically and clinically significant genes with GS
   [i]-value > 0.7. The result is compared with the genes identified by
   the proposed method. We found that there are 17 common genes including
   CCL5,CCL8,CCL4,CCL23,CCDC135 and so on. The comparison is shown in the
   Fig. [78]5. From the figure, we can see that 6 common genes are in the
   intersection between the key genes selected from WGCNA and the driver
   genes of dosage-resistant from the proposed method,and 11 common genes
   are in the intersection between the key genes selected from WGCNA and
   the driver genes of dosage-sensitive from the proposed method.

Figure 5.

   [79]Figure 5
   [80]Open in a new tab

   The comparison of significant genes between WGCNA and the proposed
   method.

Enrichment analysis for driver patterns

   We applied the Gene Ontology (GO) category to expanded the results of
   DRGs and DSGs with the R package clusterProfiler, in which the
   significance threshold are set as p value < 0.05 and q-value < 0.05
   separately. The top 12 enriched GO terms for DRGs and DSGs are
   presented in Tables [81]2 and [82]3, respectively. The DRGs are mainly
   enriched in receptor binding, signaling pathway, cell migration,
   cellular response, cell chemotaxis, activator activity, etc. These
   processes are highly relevant to the trait of tumor and contribute to
   the chief progression of tumor. And the DSGs are mainly enriched in
   receptor binding, signaling pathway, template transcription, cell
   proliferation, cell differentiation, tissue development, cell division,
   etc. Compared with DRGs, the GO annotated to DSGs are mostly management
   functions and basic functions of the cell and tissue. Figure [83]6
   shows the top 12 biological process annotation analysis results on GO
   terms for DRGs and DSGs, respectively. The length of the bar represents
   the number of selected genes annotated onto this GO term, and the
   colors of the bar represents the significance levels of enrichment. The
   GO enrichment analysis results for DRGs and DSGs are in Supplementary
   Table [84]S5 and Supplementary Table [85]S6, respectively. From GO
   biological process enrichment analysis, we achieved CCL11, CCL16,
   CCL18, CCL23, CCL8, CCL5, APOB, BRCA1 play extremely significant role
   in DRGs, and EEF2, GNA11, GNA15, MED1, MYH8 and TCAP act very important
   role in DSGs. CCL11, CCL16, CCL18, CCL23, CCL8 and CCL5 have effective
   chemical attraction on eosinophils catalysing their accumulation at
   allergic inflammation sites. CCL11 has been shown to negatively
   regulate neurogenesis with certain human tumors^[86]22,[87]29. CCL16 is
   a chemokine prominently expressed in the liver, but also in ovarian and
   breast cancer^[88]25,[89]30. CCL18 is associated with some human cancer
   types including ovarian cancer^[90]31,[91]32. CCL23 plays a role in
   subclinical systemic inflammation and associated with
   atherogenesis^[92]33. CCL8 is strengthened in stromal fibroblasts at
   the tumor border and in tissues at which breast cancer cells incline to
   metastasize such as the lungs and the brain^[93]34,[94]35. CCL5 is a
   chemokine that boosts cancer progression by arousing and adjusting the
   inflammatory diseases, which sequently reconstruct the cancer
   microenvironment. Moreover, CCL5 also boosts metastasis in ovarian and
   breast cancer cells^[95]35,[96]36. APOB mutation is associated with
   some human cancer types such as steatosis, liver,
   hypocholesterolemia^[97]37,[98]38. BRCA1 mutation confers high risks of
   ovarian and breast cancer, encodes a tumor suppressor^[99]39. EEF2 is a
   key components for protein synthesis, and ubiquitously expressed in
   normal cells^[100]40. GNA11 is associated with Congenital
   Hemangioma^[101]41. GNA15 is associated with inhibition of
   proliferation, activation of apoptosis and differential
   effects^[102]42. MED1 is associated with some biological process such
   as prostate cancer cell growth^[103]43. MYH8 is a key components for
   muscle development and regeneration^[104]44. TCAP is involved with
   energy regulation and metabolism, and is implicated in the regulation
   of stress-related behaviors^[105]45.

Table 2.

   The top 12 enriched GO terms for DRGs.
   GO-BP Term pvalue qvalue gene
   lymphocyte chemotaxis 3.43E-09 2.12E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   monocyte chemotaxis 4.76E-09 2.12E-06 CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   mononuclear cell migration 1.95E-08 5.80E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   chemokine-mediated signaling pathway 3.4E-08 5.87E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   lymphocyte migration 3.67E-08 5.87E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   neutrophil chemotaxis 3.95E-08 5.87E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   cellular response to interleukin-1 4.91E-08 6.25E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   neutrophil migration 6.05E-08 6.25E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   myeloid leukocyte migration 6.31E-08 6.25E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5/AZU1
   granulocyte chemotaxis 1.09E-07 9.41E-06
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   cellular response to tumor necrosis factor 1.16E-07 9.41E-06
   APOB/CCL11/CCL16/CCL18/CCL23/CCL8CCL5/BRCA1
   granulocyte migration 1.84E-07 1.36E-05
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   [106]Open in a new tab

Table 3.

   The top 12 enriched GO terms for DSGs.
   GO-BP Term pvalue qvalue gene
   dopamine receptor signaling pathway 2.38E-06 0.001887
   PALM/GNA15/GNA11/KLF16
   skeletal muscle contraction 0.000127 0.032678 EEF2/TCAP/MYH8
   multicellular organismal movement 0.000239 0.032678 EEF2/TCAP/MYH8
   musculoskeletal movement 0.000239 0.032678 EEF2/TCAP/MYH8
   trabecula morphogenesis 0.000255 0.032678 MED1/SBNO2/UBE4B
   DNA-templated transcription, initiation 0.000289 0.032678
   MED1/MED16/POLR2E
   MED24/POLRMT
   phospholipase C-activating dopamine receptor signaling pathway 0.000289
   0.032678 GNA15/GNA11
   natural killer cell chemotaxis 0.000352 0.03489 CCL7/CCL4
   negative regulation of keratinocyte proliferation 0.000498 0.038042
   MED1/KDF1
   androgen receptor signaling pathway 0.000544 0.038042 MED1/MED16/MED24
   cardiac muscle tissue morphogenesis 0.000544 0.038042 MED1/TCAP/UBE4B
   response to radiation 0.000691 0.038042
   PPP1R1B/CCL7/GNA11/MYCECT2/UBE4B
   [107]Open in a new tab

Figure 6.

   [108]Figure 6
   [109]Open in a new tab

   The analysis of GO biological process annotation analysis for DSGs and
   DRGs. (A) The GO biological process enrichment of DRGs. (B) The GO
   biological process enrichment of DSGs.

Regulatory pathway analysis for driver patterns

   To further elucidate the biological pathway of gene modulators in the
   driver pattern, we performed the Kyoto Encyclopedia of Genes and
   Genomes (KEGG)^[110]46–[111]48, pathway enrichment analysis of the
   results of DSGs and DRGs list with the R package clusterProfiler
   separately. The top 12 pathways for DRGs and DSGs are presented
   respectively in Tables [112]4 and [113]5. For DRGs, the most
   significant canonical pathways are mainly bound up with Chemokine
   signaling, Cytokine – cytokine receptor interaction, Parkinson’s
   disease, Breast cancer, onset diabetes, Asthma, Prion diseases, etc.
   And the pathways for DSGs are major related to Thyroid hormone
   signaling, Apoptosis, Cytosolic DNA-sensing, Amoebiasis, Chagas
   disease, Cell cycle, WNT signaling, etc. Remarkablely for DRGs,
   Cytokine – cytokine receptor interaction is embedded in the Chemokine
   signaling pathway. For these two pathways, gene CCL11, CCL16, CCL18,
   CCL23, CCL8 and CCL5 have a key function (see Fig. [114]7). Chemokines
   play a role in the migration of many cells during development and are
   critical to nervous system development^[115]49. Cytokine interactions
   play an important role in health and are vital to many cancer during
   immunological and inflammatory responses in disease. It can lead to
   antagonist, additive, or synergistic activities in keeping
   physiological functions such as body temperature, feeding and somnus,
   as well as in anorectic, ardent fever, and sleepiness neurological
   representations of acute and chronic disease^[116]50. Parkinson’s
   disease is a pathway specifically associated to diseases of the central
   nervous system, NDUFB5 and SLC18A1 are enriched in this pathway (see
   Fig. [117]8). NDUFB5 and SLC18A1 are associated with multiple human
   diseases^[118]23,[119]51. Breast cancer is the major cause of cancer
   death among the female worldwide, gene BRCA1 and FGF22 are enriched in
   this pathway (see Fig. [120]9). FGF22 is required for brain development
   and associated with hereditary and neoplastic disease^[121]52. For
   DRGs, the thyroid hormones are key regulators of metabolism, growth and
   other body systems, MED1, MED16, MED24 and MYC are play a role in this
   pathway (see Fig. [122]10). MYC is involved in cell cycle progression,
   apoptosis, and cellular transformation and is related to many
   tumors^[123]53. Apoptosis is an evolutionarily conserved signaling
   pathway, which plays a fundamental role in regulating cell number and
   eliminating damaged or redundant cells^[124]54, GADD45B, TNFSF10, LMNB2
   are enriched in this pathway (see Fig. [125]11). TNFSF10 is associated
   with some human cancer types including ovarian cancer^[126]55. LMNB2
   acts as regulators of cell proliferation and differentiation, regards
   as a cancer risk biomarker in several cancer subtypes^[127]56. Cell
   cycle progression is completed through a reproducible sequence of
   events. It is a significant pathway associated with many diseases
   including ovarian cancer^[128]57. GADD45B and MYC are enriched in this
   pathway (see Fig. [129]12). GADD45B is implicated in some responses to
   cell injury including cell cycle checkpoints, apoptosis and DNA
   repair^[130]58. WNT signaling pathway is required for basic
   developmental processes and play an important role in human stem cells
   and cancers^[131]59. MYC and TBL1XR1 are enriched in this pathway (see
   Fig. [132]13). The KEGG pathway enrichment analysis result for DRGs and
   DSGs are shown in Supplementary Table [133]S7 and Supplementary
   Table [134]S8, respectively.

Table 4.

   The top 12 important pathways for DRGs.
   pathway pvalue gene
   Chemokine signaling pathway 4.80E-06 CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   Cytokine-cytokine receptor interaction 4.58E-05
   CCL11/CCL16/CCL18/CCL23/CCL8/CCL5
   Parkinson’s disease 0.052537 NDUFB5/SLC18A1
   Breast cancer 0.053866 BRCA1/FGF22
   Other types of O-glycan biosynthesis 0.056251 MFNG
   Ribosome 0.060691 RPS15/RPL19
   Vitamin digestion and absorption 0.061213 APOB
   Maturity onset diabetes of the young 0.066151 HNF1B
   NOD-like receptor signaling pathway 0.070724 MFN1/CCL5
   Phototransduction 0.071064 RCVRN
   Asthma 0.078387 CCL11
   Prion diseases 0.088067 CCL5
   [135]Open in a new tab

Table 5.

   The top 12 important pathways for DSGs.
                    pathway                   pvalue          gene
   Thyroid hormone signaling pathway         0.000438 MED1/MED16/MED24/MYC
   Apoptosis                                 0.009077 GADD45B/TNFSF10/LMNB2
   Cytosolic DNA-sensing pathway             0.017271 POLR2E/CCL4
   NF-kappa B signaling pathway              0.036048 GADD45B/CCL4
   Amoebiasis                                0.036744 GNA15/GNA11
   Chagas disease (American trypanosomiasis) 0.041034 GNA15/GNA11
   Cholinergic synapse                       0.048583 PIK3R5/GNA11
   Cytokine-cytokine receptor interaction    0.05237  CCL7/CCL4/TNFSF10
   Cell cycle                                0.058258 GADD45B/MYC
   FoxO signaling pathway                    0.065054 GADD45B/TNFSF10
   Ubiquitin mediated proteolysis            0.069434 CDC34/UBE4B
   Wnt signaling pathway                     0.074816 MYC/TBL1XR1
   [136]Open in a new tab

Figure 7.

   [137]Figure 7
   [138]Open in a new tab

   The six genes are enriched in the top 2 significant pathways(red,
   up-regulated). (A) Chemokine signaling pathway. (B) Cytokine-cytokine
   receptor interaction.

Figure 8.

   [139]Figure 8
   [140]Open in a new tab

   Parkinson disease pathway(red, up-regulated). NDUFB5 and SLC18A1 are
   enriched in this pathway. NDUFB5 is a gene alias of the complex 1
   deficiency(Cx1) and SLC18A1 is a gene alias of the vesicular amine
   transporter(VMAT).

Figure 9.

   [141]Figure 9
   [142]Open in a new tab

   Breast cancer pathway(red, up-regulated). MED1, MED16, MED24 and MYC
   are enriched in this pathway.

Figure 10.

   [143]Figure 10
   [144]Open in a new tab

   Thyroid hormones pathway (red, up-regulated). BRCA1 and FGF22 are
   enriched in this pathway.

Figure 11.

   [145]Figure 11
   [146]Open in a new tab

   Apoptosis pathway(red, up-regulated). GADD45B, TNFSF10, LMNB2 are
   enriched in this pathway. TRAIL is a gene alias of TNFSF10, LMNB2 is a
   gene alias of the lamin.

Figure 12.

   [147]Figure 12
   [148]Open in a new tab

   Cell cycle pathway(red, up-regulated). GADD45B and MYC are enriched in
   this pathway.

Figure 13.

   [149]Figure 13
   [150]Open in a new tab

   Wnt signaling pathway pathway(red, up-regulated). MYC and TBL1XR1 are
   enriched in this pathway.

Discussion

   The main purpose of this research was to distinguish the candidate
   driver genes and the corresponding driving mechanism for resistant and
   sensitive tumor from the heterogeneous data. We presented a
   machine-learning approach to integrate somatic mutations, CNVs and gene
   expression profiles to distinguish interactions and regulations for
   dosage-sensitive and dosage-resistant genes of ovarian cancer. We
   developed a general framework for integrating high dimensional
   heterogeneous omics data and can potentially lead to new insight into
   many related studies at the systems level. In the framework, weighted
   correlation network analysis (WGCNA) was applied to the co-expression
   network analysis; the mutation network was constructed by integrating
   the CNVs and somatic mutations and the initial candidate modulators was
   selected from the clustering vertexs of network; the regression tree
   model was utilized for module networks learning in which the obtained
   gene modules and candidate modulators were trained for the modulators
   regulatory mechanism; finally, a local polynomial regression fitting
   model was applied to identify dosage-sensitive and dosage-resistant
   driver patterns. From the Gene Ontology and pathway enrichment
   analysis, we obtained some biologically meaningful gene modulators,
   such as CCL11, CCL16, CCL18, CCL23, CCL8, CCL5, APOB, BRCA1, SLC18A1,
   FGF22, GADD45B, GNA15, GNA11, and so on, which can be conducive to
   appropriate diagnosis and treatment to cancer patients.

Methods

   We integrated gene expression, CNV, and somatic mutation data to
   identify the driver patterns for resistant and sensitive tumors. The
   overview of the overall integrative analysis is shown in Fig. [151]14.
   As can be seen in Fig. [152]14, the procedure takes these datasets as
   inputs and then produces a short list of genes as candidate divers. The
   method contains the following steps: (1) extract a pool of candidate
   modulators as initial driver genes, which have significant CNVs/Somatic
   mutations in ovarian cancer samples and achieve the initial gene
   modules from the gene expression by weighted correlation network
   analysis (WGCNA); (2) construct the heterogeneous network by module
   networks learning and evaluate the regulatory mechanism between these
   different omics data by liner regression model; (3) identify driver
   patterns and modulators interactions in the constructed regulatory
   networks using gene ontology and pathway enrichment analysis.

Figure 14.

   [153]Figure 14
   [154]Open in a new tab

   The scheme of driver patterns identification. (1) Acquire the initial
   gene modules for gene expression profiles via WGCNA network analysis
   (on the left) and extract a pool of candidate modulators as initial
   driver genes integrate CNVs and Somatic mutations in ovarian cancer
   samples (on the right); (2) Construct the heterogeneous network by
   module networks learning and evaluate the regulatory relationships by
   liner regression model; (3) Identify modulators interactions in the
   constructed regulatory networks and analyze the driver patterns through
   and gene ontology analysis and pathway enrichment analysis.

Data availability

   We downloaded gene expression data from the Agilent 244 K Custom Gene
   Expression platform, CNV data from the Affymatrix Genome-Wide Human SNP
   Array 6.0 platform, and somatic mutation data from whole exome
   sequencing data obtained using Illumina Genome Analyzer DNA Sequencing.
   The clinical data of these patients are analyzed to determine
   cis-platinum chemotherapy response samples and these tumor samples are
   classified into resistant and sensitive groups, sensitive tumors have a
   platinum-free space of six months or more after the last primary
   treatment, no sign of residue or relapse, and the follow-up will at
   least six months later; and resistant tumors are recurred within six
   months after the last treatment^[155]60. According to the above
   definition, we identified 93 platinum-resistant and 231
   platinum-sensitive primary tumors from the TCGA website, for which the
   gene expression profiles, CNV, and somatic mutation data were available
   as well. All the dataset used in this paper can be downloaded from
   Broad GDAC Firehose website
   ([156]http://gdac.broadinstitute.org/runs/analyses_2014_10_17/data/OV/2
   0141017).

Filtering mutation genes by gene length

   The length of genes in human are very different and so the mutation
   probabilities of different genes are in vast difference. There may be
   some genes which have mutation only because they are long. In order to
   filter those long but non-qualified genes (genes which have mutations
   only because they are long yet they are not driver mutations) we
   adopted the filtering strategies of VarWalker^[157]61 and computed a
   probability weight vector (PWV) by formulating a generalized additive
   model and estimated a relative mutation rate for each gene. The vector
   of X represents the gene length of cDNA. We applied the following model
   to estimate the mutated probability based on the cDNA length of genes,
   [MATH: <mi>log</mi><mspace
   width=".10em"></mspace><mi>i</mi><mi>t</mi><mo
   stretchy="false">(</mo><mi>π</mi><mo
   stretchy="false">)</mo><mo>=</mo><mspace
   width=".25em"></mspace><mi>log</mi><mrow><mo
   stretchy="true">(</mo><mrow><mfrac><mi>π</mi><mrow><mn>1</mn><mo>−</mo>
   <mi>π</mi></mrow></mfrac></mrow><mo
   stretchy="true">)</mo></mrow><mo>~</mo><mi>f</mi><mo
   stretchy="false">(</mo><mi>X</mi><mo stretchy="false">)</mo> :MATH]
   1

   where
   [MATH:
   <mi>π</mi><mo>=</mo><mfrac><mrow><mo>#</mo><mi>m</mi><mi>u</mi><mi>t</m
   i><mi>a</mi><mi>t</mi><mi>i</mi><mi>o</mi><mi>n</mi><mspace
   width=".5em"></mspace><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>s</mi
   ></mrow><mrow><mo>#</mo><mi>C</mi><mi>C</mi><mi>D</mi><mi>S</mi><mspace
   width=".5em"></mspace><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mi>s</mi
   ></mrow></mfrac> :MATH]
   represents the proportion of mutant genes in the investigated samples.
   And the
   [MATH: <mi>f</mi><mo stretchy="false">(</mo><mo>⋅</mo><mo
   stretchy="false">)</mo> :MATH]
   represents an unspecified smooth function. The function is then solved
   using a monotonic cubic spline with six knots. Through the above
   process of fitting, each gene was assigned a relative weight which
   would be used to select the important genes in the next resampling
   procedure.

   Then a resampling test was applied to random gene sets for each
   samples. In the random selection process, the probability of each gene
   to be selected is based on the probability weight calculated in the
   above procedure. The weight resampling process was performed 1000 times
   in each sample. The mutation frequency was computed for each gene using
   following formula:
   [MATH:
   <mi>f</mi><mi>r</mi><mi>e</mi><mo>=</mo><mfrac><mrow><mo>#</mo><mo
   stretchy="false">(</mo><mi>s</mi><mi>e</mi><mi>l</mi><mi>e</mi><mi>c</m
   i><mi>t</mi><mi>i</mi><mi>n</mi><mi>g</mi><mspace
   width=".5em"></mspace><mi>t</mi><mi>h</mi><mi>e</mi><mspace
   width=".5em"></mspace><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mspace
   width=".5em"></mspace><mi>i</mi><mi>n</mi><mspace
   width=".5em"></mspace><mi>r</mi><mi>e</mi><mi>s</mi><mi>a</mi><mi>m</mi
   ><mi>p</mi><mi>l</mi><mi>i</mi><mi>n</mi><mi>g</mi><mspace
   width=".5em"></mspace><mi>t</mi><mi>e</mi><mi>s</mi><mi>t</mi><mo
   stretchy="false">)</mo></mrow><mrow><mn>1000</mn></mrow></mfrac> :MATH]
   2

   where
   [MATH: <mo>#</mo><mo
   stretchy="false">(</mo><mi>s</mi><mi>e</mi><mi>l</mi><mi>e</mi><mi>c</m
   i><mi>t</mi><mi>i</mi><mi>n</mi><mi>g</mi><mspace
   width=".5em"></mspace><mi>t</mi><mi>h</mi><mi>e</mi><mspace
   width=".5em"></mspace><mi>g</mi><mi>e</mi><mi>n</mi><mi>e</mi><mspace
   width=".5em"></mspace><mi>i</mi><mi>n</mi><mspace
   width=".5em"></mspace><mi>r</mi><mi>e</mi><mi>s</mi><mi>a</mi><mi>m</mi
   ><mi>p</mi><mi>l</mi><mi>i</mi><mi>n</mi><mi>g</mi><mspace
   width=".5em"></mspace><mi>t</mi><mi>e</mi><mi>s</mi><mi>t</mi><mo
   stretchy="false">)</mo> :MATH]
   represents the times and fre represents the frequency of the gene being
   selected across 1000 times in resampling test. Then we filter genes
   with fre >5% that indicates these genes may occurs at random. Those
   genes with fre >5% represents they are unlikely mutated at random.
   After resampling test, we obtained these significant genes as candidate
   modulators.

Gene co-expression modules analysis using WGCNA

   Weighted correlation network analysis is a framework for co-expression
   analysis to explore gene modules of high correlation to external sample
   traits^[158]62,[159]63. The gene expression correlation networks are
   based on quantitative measurements correlations. Assume that we are
   given an n × m expression matrix X where the row indices correspond to
   genes (
   [MATH: <mi>i</mi><mo>=</mo><mn>1, ...,</mn><mspace
   width=".25em"></mspace><mi>n</mi> :MATH]
   ) and the column indices (
   [MATH: <mi>j</mi><mo>=</mo><mn>1, ...,</mn><mspace
   width=".25em"></mspace><mi>m</mi> :MATH]
   ) correspond to samples:
   [MATH: <mi>X</mi><mo>=</mo><mrow><mo
   stretchy="true">(</mo><mrow><mtable><mtr><mtd><mrow><msub><mrow><mi>X</
   mi></mrow><mrow><mn>11</mn></mrow></msub></mrow></mtd><mtd><mrow><msub>
   <mrow><mi>X</mi></mrow><mrow><mn>12</mn></mrow></msub></mrow></mtd><mtd
   ><mo>⋯</mo></mtd><mtd><mrow><msub><mrow><mi>X</mi></mrow><mrow><mn>1</m
   n><mi>m</mi></mrow></msub></mrow></mtd></mtr><mtr><mtd><mrow><msub><mro
   w><mi>X</mi></mrow><mrow><mn>21</mn></mrow></msub></mrow></mtd><mtd><mr
   ow><msub><mrow><mi>X</mi></mrow><mrow><mn>22</mn></mrow></msub></mrow><
   /mtd><mtd><mo>⋯</mo></mtd><mtd><mrow><msub><mrow><mi>X</mi></mrow><mrow
   ><mn>2</mn><mi>m</mi></mrow></msub></mrow></mtd></mtr><mtr><mtd><mo>⋮</
   mo></mtd><mtd><mo>⋮</mo></mtd><mtd><mo>⋱</mo></mtd><mtd><mo>⋮</mo></mtd
   ></mtr><mtr><mtd><mrow><msub><mrow><mi>X</mi></mrow><mrow><mi>n</mi><mn
   >1</mn></mrow></msub></mrow></mtd><mtd><mrow><msub><mrow><mi>X</mi></mr
   ow><mrow><mi>n</mi><mn>2</mn></mrow></msub></mrow></mtd><mtd><mo>⋯</mo>
   </mtd><mtd><mrow><msub><mrow><mi>X</mi></mrow><mrow><mi>n</mi><mi>m</mi
   ></mrow></msub></mrow></mtd></mtr></mtable></mrow><mo
   stretchy="true">)</mo></mrow> :MATH]
   3

   The co-expression similarity s [ik] is applied to calculate the
   adjacency matrix:
   [MATH:
   <msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><m
   o>=</mo><mo stretchy="false">|</mo><mi>c</mi><mi>o</mi><mi>r</mi><mo
   stretchy="false">(</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>,</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>j</mi></mro
   w></msub><mo stretchy="false">)</mo><mo stretchy="false">|</mo> :MATH]
   4

   where s [ij] is the absolute value of the correlation coefficient
   between gene i (
   [MATH: <mi>i</mi><mo>=</mo><mn>1, ...,</mn><mspace
   width=".25em"></mspace><mi>n</mi> :MATH]
   ) and gene j (
   [MATH: <mi>j</mi><mo>=</mo><mn>1, ...,</mn><mspace
   width=".25em"></mspace><mi>n</mi> :MATH]
   ) in the expression profiles matrix X. The value of s [ij] is in [0,
   1]. The adjacency function α [ij] is defined with the co-expression
   similarity s [ij] by soft threshold β as following,
   [MATH:
   <msub><mrow><mi>α</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><m
   o>=</mo><mi>p</mi><mi>o</mi><mi>w</mi><mi>e</mi><mi>r</mi><mo
   stretchy="false">(</mo><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi><mi
   >j</mi></mrow></msub><mo>,</mo><mi>β</mi><mo
   stretchy="false">)</mo><mo>=</mo><msubsup><mrow><mi>s</mi></mrow><mrow>
   <mi>i</mi><mi>j</mi></mrow><mrow><mi>β</mi></mrow></msubsup> :MATH]
   5

   where β ≥ 1. To measure the topological structure with adjacency
   function, topological overlap matrix (TOM) is applied to build the
   matrix
   [MATH: <mi mathvariant="normal">Ω</mi><mo>=</mo><mo
   stretchy="false">[</mo><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</mi><mi
   >j</mi></mrow></msub><mo stretchy="false">]</mo> :MATH]
   ,
   [MATH: <mtable
   columnalign="left"><mtr><mtd><msub><mrow><mi>ω</mi></mrow><mrow><mi>i</
   mi><mi>j</mi></mrow></msub><mo>=</mo><mfrac><mrow><msub><mrow><mi>α</mi
   ></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>+</mo><msub><mrow><
   mi>u</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mrow><m
   i>m</mi><mi>i</mi><mi>n</mi><mo
   stretchy="false">(</mo><msub><mrow><mi>m</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>,</mo><msub><mrow><mi>m</mi></mrow><mrow><mi>j</mi></mro
   w></msub><mo
   stretchy="false">)</mo><mo>+</mo><mn>1</mn><mo>−</mo><msub><mrow><mi>α<
   /mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow></mfrac></mtd
   ></mtr><mtr><mtd><msub><mrow><mi>m</mi></mrow><mrow><mi>i</mi></mrow></
   msub><mo>=</mo><mrow><munder><mo>∑</mo><mrow><mi>k</mi><mo>,</mo><mi>k<
   /mi><mo>≠</mo><mi>i</mi></mrow></munder></mrow><msub><mrow><mi>α</mi></
   mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub></mtd></mtr><mtr><mtd><msu
   b><mrow><mi>u</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><mo>=<
   /mo><msub><mrow><mo>∑</mo></mrow><mrow><mi>k</mi></mrow></msub><msub><m
   row><mi>α</mi></mrow><mrow><mi>i</mi><mi>k</mi></mrow></msub><msub><mro
   w><mi>α</mi></mrow><mrow><mi>k</mi><mi>j</mi></mrow></msub></mtd></mtr>
   </mtable> :MATH]
   6

   in which the network structure is not only based on the two genes with
   direct adjacency correlation, but also on the indirect adjacency
   relationship.

   The gene expression network is constructed based on matrix Ω. Gene
   modules are detected by unsupervised hierarchical clustering method.
   Give the q th gene co-expression module, the eigengene expression is E
   ^(q). The correlation coefficient for the gene x [i] and the module E
   ^(q) can be defined as:
   [MATH: <msubsup><mrow><mi>k</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>=</mo><mi>c</mi><mi>o</mi><
   mi>r</mi><mo
   stretchy="false">(</mo><msub><mrow><mi>x</mi></mrow><mrow><mi>i</mi></m
   row></msub><mo>,</mo><msup><mrow><mi>E</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msup><mo stretchy="false">)</mo> :MATH]
   7

   where
   [MATH: <msubsup><mrow><mi>k</mi></mrow><mrow><mi>i</mi></mrow><mrow><mo
   stretchy="false">(</mo><mi>q</mi><mo
   stretchy="false">)</mo></mrow></msubsup><mo>∈</mo><mo
   stretchy="false">[</mo><mspace
   width="-.25em"></mspace><mo>−</mo><mspace
   width="-.25em"></mspace><mn>1,</mn><mspace
   width=".25em"></mspace><mn>1]</mn> :MATH]
   .

Modulators identification using module networks learning

   From the initial gene co-expression modules, we focused on the
   candidate modulators’ regulatory by regression tree model. Each
   candidate modulator is connected to a group of genes by maximizing the
   normal-gamma scoring function which describes the qualitative behavior
   of a candidate modulator that regulate the gene expression modules. A
   regression tree is formed with two basic blocks: decision nodes and
   leaf nodes. Each node corresponds to one of the selected modulators
   above. Our learning is an iterative process. In each iteration, a
   regulation procedure is found for each gene module and then each gene
   reassigned to the corresponding module due to prediction with the best
   regulatory function. We searched for the model with the highest score
   by using the Expectation Maximization (EM) algorithm. The scoring
   function defined as:
   [MATH: <mi>log</mi><mspace width=".10em"></mspace><mi>P</mi><mo
   stretchy="false">(</mo><mi>C</mi><mo>,</mo><mi>N</mi><mo
   stretchy="false">)</mo><mo>=</mo><mspace
   width=".25em"></mspace><mi>log</mi><mspace
   width=".10em"></mspace><mi>P</mi><mo
   stretchy="false">(</mo><mi>C</mi><mo
   stretchy="false">|</mo><mi>N</mi><mo
   stretchy="false">)</mo><mo>+</mo><mspace
   width=".25em"></mspace><mi>log</mi><mspace
   width=".10em"></mspace><mi>P</mi><mo
   stretchy="false">(</mo><mi>N</mi><mo stretchy="false">)</mo> :MATH]
   8

   where C is the candidate data and N the model structure of the network.
   The first term is the candidate data for a given model which meets
   normal gamma distribution. The second part is a penalty score on
   network complexity. The EM algorithm ensures the reliability of the
   model until convergence to a local maximum score. The essence of the
   algorithm consists of two steps: the first procedure is learning the
   best regulation program (regression tree) for each module. The tree is
   constructed from the root to its leaves. The second procedure is to
   find the association rules for corresponding module.

Dosage-sensitive and dosage-resistant analysis

   From the module networks learning, the candidate genes were generated.
   Then we applied a local polynomial regression fitting model to conduct
   dosage-sensitive and dosage-resistant analysis. In the model, a predict
   function was adopted to obtain n isometric points on the LOESS
   curve^[160]64. To achieve the interrelation between the copy number
   variation and the gene expression, a monotonicity function is defined
   as M:
   [MATH: <mi>M</mi><mo>=</mo><mfrac><mn>2</mn><mrow><mi>n</mi><mo
   stretchy="false">(</mo><mi>n</mi><mo>−</mo><mn>1)</mn></mrow></mfrac><m
   o>×</mo><munderover><mo>∑</mo><mrow><mi>j</mi><mn>=1</mn></mrow><mi>n</
   mi></munderover><munderover><mo>∑</mo><mrow><mi>i</mi><mn>=1</mn></mrow
   ><mi>j</mi></munderover><msub><mrow><mi>S</mi></mrow><mrow><mi>i</mi><m
   i>j</mi></mrow></msub> :MATH]
   9

   in which
   [MATH:
   <msub><mrow><mi>S</mi></mrow><mrow><mi>i</mi><mi>j</mi></mrow></msub><m
   o>=</mo><mrow><mo stretchy="true">{</mo><mrow><mtable
   columnalign="right"><mtr columnalign="right"><mtd
   columnalign="right"><mrow><mn>1,</mn></mrow></mtd><mtd
   columnalign="right"><mrow><msub><mrow><mi>s</mi></mrow><mrow><mi>j</mi>
   </mrow></msub><mo>></mo><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mspace width=".1em"></mspace></mrow></mtd></mtr><mtr
   columnalign="right"><mtd
   columnalign="right"><mrow><mn>0,</mn></mrow></mtd><mtd
   columnalign="right"><mrow><msub><mrow><mi>s</mi></mrow><mrow><mi>j</mi>
   </mrow></msub><mo>=</mo><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mspace width=".1em"></mspace></mrow></mtd></mtr><mtr
   columnalign="right"><mtd
   columnalign="right"><mrow><mo>−</mo><mn>1,</mn></mrow></mtd><mtd
   columnalign="right"><mrow><msub><mrow><mi>s</mi></mrow><mrow><mi>j</mi>
   </mrow></msub><mo><</mo><msub><mrow><mi>s</mi></mrow><mrow><mi>i</mi></
   mrow></msub><mspace
   width=".1em"></mspace></mrow></mtd></mtr></mtable></mrow></mrow> :MATH]
   10

   and n is the number of samples. To quantize the association between
   gene expression and CNVs (somatic mutations), another linear fitting
   model was adopted and a straight line with slope of K was obtained. The
   slope of linear model can reflect the relationship of variables in LOSS
   model to some extent. Hence the dosage sensitivity (DS) score can be
   defined as:
   [MATH: <mi>D</mi><mi>S</mi><mo>=</mo><mi>M</mi><mo>×</mo><mo
   stretchy="false">|</mo><mi>K</mi><mo stretchy="false">|</mo> :MATH]
   11

   Herein, for a gene, a larger DS indicates stronger dosage sensitivity.

Electronic supplementary material

   [161]Supplementary information^ (3.2MB, pdf)

Acknowledgements