Abstract

   In many cancers, significantly down- or upregulated genes are found
   within chromosomal regions with DNA copy number alteration opposite to
   the expression changes. Generally, this paradox has been overlooked as
   noise, but can potentially be a consequence of interference of
   epigenetic regulatory mechanisms, including microRNA-mediated control
   of mRNA levels.

   To explore potential associations between microRNAs and paradoxes in
   non-small-cell lung cancer (NSCLC) we curated and analyzed lung
   adenocarcinoma (LUAD) data, comprising gene expressions, copy number
   aberrations (CNAs) and microRNA expressions. We integrated data from
   1,062 tumor samples and 241 normal lung samples, including
   newly-generated array comparative genomic hybridization (aCGH) data
   from 63 LUAD samples.

   We identified 85 “paradoxical” genes whose differential expression
   consistently contrasted with aberrations of their copy numbers.
   Paradoxical status of 70 out of 85 genes was validated on sample-wise
   basis using The Cancer Genome Atlas (TCGA) LUAD data. Of these, 41
   genes are prognostic and form a clinically relevant signature, which we
   validated on three independent datasets. By meta-analysis of results
   from 9 LUAD microRNA expression studies we identified 24
   consistently-deregulated microRNAs. Using TCGA-LUAD data we showed that
   deregulation of 19 of these microRNAs explains differential expression
   of the paradoxical genes.

   Our results show that deregulation of paradoxical genes is crucial in
   LUAD and their expression pattern is maintained epigenetically, defying
   gene copy number status.

   Keywords: lung adenocarcinoma, copy number aberrations, microRNA, gene
   regulatory network, prognostic signature

INTRODUCTION

   Integration of array comparative genomic hybridization (aCGH) with mRNA
   microarray data has revealed significant associations between
   occurrence of copy number aberrations (CNAs) and differential gene
   expression in diverse cancers [[46]1–[47]7]. However, significantly
   downregulated genes have been often found to reside within chromosomal
   regions with increased number of copies (gains) and vice versa,
   creating a paradoxical signal. For example, Phillips et al. reported
   that 14% of the genes downregulated in prostate cancer reside within
   regions of DNA copy number gains, and approximately 9% of upregulated
   ones reside in regions of DNA copy number loss [[48]1]. Usually, this
   paradox is ignored as a noise, but can potentially be a consequence of
   interference of other regulatory mechanisms controlling mRNA
   transcription [[49]8].

   In recent years, the cancer research community has investigated how
   epigenetic regulators, known as microRNAs (miRNAs), form prognostic
   signatures and affect regulatory pathways that can lead to
   tumorigenesis. miRNAs are short non-coding RNAs that regulate the
   translation of mRNA by serving as guide molecules in mRNA silencing,
   mediated by various associated proteins [[50]9]. Targeting most
   protein-coding transcripts [[51]10], miRNAs are involved in diverse
   biological processes, including development and homeostasis [[52]11,
   [53]12]. Moreover, growing evidence implicates miRNAs as factors
   associated with major human pathologies, including cancer
   [[54]13–[55]16].

   In 2012 approximately 13% of all new cancer cases worldwide were
   cancers of lung (and bronchus), making lung cancer one of the most
   frequent cancer type (surpassed only by breast cancer in women)
   [[56]17]. Despite smoking cessation, and advances in detection and
   treatment, lung cancer remains the main cause of cancer-related death
   worldwide for both men and women [[57]18]. With nearly 160,000 deaths
   annually it kills more people than other common cancers combined,
   including colon, breast and prostate [[58]18]. The most common type of
   lung cancer is lung adenocarcinoma (LUAD), comprising approximately 45%
   of all lung cancer cases [[59]19, [60]20].

   In this paper, we integratively analyzed gene expression and CNA data
   from 12 publicly available LUAD datasets, and new CNA data obtained
   from 63 LUAD samples profiled at our institution (Figure [61]1). By
   combining and analyzing data from 1,062 tumor tissue samples and 241
   normal samples we identified genes whose differential expression
   consistently was in contrast to aberrations of their copy numbers.
   Paradoxical status of these genes was then validated on the sample
   level, using TCGA LUAD gene expression and CNA data. Furthermore, to
   assess whether the paradoxical expression patterns were caused by
   epigenetic disruptions in lung tumors, we compiled miRNA expression
   data from 406 LUAD samples and 321 normal lung samples. Using
   miRNA:gene associations from mirDIP [[62]21] and the measure of
   co-expression, we showed that this paradox can be explained by 19
   miRNAs consistently deregulated in LUAD.

Figure 1. Flowchart depicting sequence of analyses/computational steps as
performed and datasets as used.

   [63]Figure 1
   [64]Open in a new tab

   For more details see Materials and Methods section.

RESULTS

Identification of the paradoxical genes

   First, we examined the frequency and statistical significance of
   autosomal CNAs across 3 LUAD aCGH datasets, including our new data and
   two publicly available datasets (see Materials and Methods). We
   identified multiple chromosomal regions with significantly (p < 0.05,
   randomization test) high frequency of gains (more than two copies) or
   losses (less than two copies); frequencies and corresponding p-values
   are listed in [65]Supplementary Data 1. The most extensive positive
   aberrations identified occur on the q-arm of chromosomes 1, 7 and 8 as
   well the p-arm of chromosomes 5 and 7 (Figure [66]2A). The most
   significant copy number losses occurred on the q-arm of chromosomes 6,
   9, 13, 15, and 18, along with the p-arm of chromosomes 8 and 9.

Figure 2. Association between chromosomal copy number aberrations and
differential expression of genes.

   [67]Figure 2
   [68]Open in a new tab

   (A) Frequencies of gains (pointing outbound) and losses (pointing
   inbound) of the given chromosomal region as obtained from integrative
   analysis of three aCGH datasets. The aberration frequencies are
   depicted in range from 0-50% and regions with significant frequency of
   aberrations are highlighted by color (orange – losses, green – gains).
   Precise chromosomal locations of these paradoxical genes are depicted
   in the circular plot. (B) Tumor-vs-normal expression fold change of the
   paradoxical genes, obtained across 10 publicly available datasets. In
   Figures A and B, symbols of the downregulated genes are labeled red,
   while the symbols of the upregulated genes are labeled blue. (C) Venn
   diagram showing overlaps between up-/downregulated genes and genes
   residing within the regions of chromosomal copy number gain, or loss.

   To identify genes whose differential expression remained consistent
   across patient cohorts, we performed integrative analysis of 10
   publicly available gene expression datasets, comprising 740 LUAD
   samples and 241 normal tissue samples. Among 15,323 genes that were
   subjected to the robust rank analysis, we identified 1,309 genes that
   were significantly deregulated across the datasets (p < 0.01, robust
   rank aggregation), where 701 of these were downregulated genes, and 608
   are upregulated. Excluding 9 non-protein-coding genes, reduced the
   numbers to 600 upregulated and 700 downregulated genes (see
   [69]Supplementary Data 2). Non-protein coding genes involve
   downregulated C17orf91, and eight upregulated miRNA sequences: MIR7112,
   MIR6847, MIR7113, MIR671, MIR4647, MIR93, MIR25, MIR4721, all of which
   are intragenic miRNAs residing within upregulated genes. In subsequent
   meta-analysis of miRNA expression in LUAD, we found none of these
   miRNAs to be significantly deregulated.

   We identified 132 (14.6%) downregulated genes residing in regions with
   decreased number of copies and 102 (22%) upregulated ones residing in
   regions with increased number of copies (p < 2.2E-16, Chi-squared
   test). Importantly, 63 consistently downregulated and 22 consistently
   upregulated genes reside on chromosomal regions with opposite direction
   of aberration – gains and losses, respectively (Figure [70]2).
   Hereafter, we refer to these 85 genes as paradoxical genes
   ([71]Supplementary Table 1).

Validation of the CNAs and differential expression of the paradoxical genes

   While we identified paradoxical genes using data from diverse patient
   cohorts, we sought to validate our findings in an independent,
   homogeneous datasets. We used data from TCGA comprising: CNA, mRNA-seq
   and miRNA-seq LUAD data from 514 LUAD and 57 normal samples. This
   dataset was selected for validation solely on the basis of the CNA,
   mRNA and miRNA expression data availability, without considering
   clinicopathological characteristics of the data. Five of the 85
   paradoxical genes could not be evaluated due to the missing CNA or
   expression data. From the remaining 80, we successfully validated 70
   genes (p < 1E-4, randomization test), whose copy aberration status and
   differential expression confirmed results from the integrative analysis
   ([72]Supplementary Figure 1, [73]Supplementary Table 1). Further
   analysis only considers these 70 validated paradoxical genes.

   To test whether paradoxical deregulation occurs in the individual
   samples, we measured frequencies of paradoxical co-occurrence of
   up-/downregulation (expression z-score >/< +/−1.647) and losses/gains
   (log2 CNA >/< +/−0.2) of the paradoxical genes across individual TCGA
   LUAD samples (Figure [74]3). We found that frequency of paradoxical
   deregulation ranges from 9% (NFKBIA) to 74% (NPR1) of samples, median
   frequency equal to 46% and mean 44%. For all the 70 genes frequency of
   paradoxical deregulation exceeds the frequency of regular
   (non-paradoxical) deregulation.

Figure 3. Frequency of deregulation and CNAs of paradoxical genes.

   [75]Figure 3
   [76]Open in a new tab

   Barplot at the left shows frequencies of paradoxical and
   non-paradoxical co-occurrence of deregulated expression and CNAs.
   Barplots depicting frequencies of up- and downregulation (middle), gain
   and losses (right) of 70 validated paradoxical genes, as occur across
   TCGA LUAD samples. Colors of the gene labels indicate their
   deregulation/CNA status as obtained from the integrative analysis.

Co-occurrence between deregulated miRNAs and paradoxical genes

   We hypothesize that differential expression of paradoxical genes can be
   to a large extent explained by deregulation of the miRNAs that target
   these genes, either directly or through regulatory mediators, such as
   transcription factors. We thus performed meta-analysis of 9 papers
   reporting differentially expressed miRNAs in LUAD to identify
   consistently deregulated miRNAs. We found 24 such miRNAs (p < 0.05,
   robust rank analysis, see Methods section), 13 of which were
   upregulated (hsa-mir-21, hsa-mir-182, hsa-mir-210, hsa-mir-9,
   hsa-mir-183, hsa-mir-135b, hsa-mir-130b, hsa-mir-200b, hsa-mir-191,
   hsa-mir-31, hsa-mir-196b, hsa-mir-196a, hsa-mir-200a, ordered by
   significance) and 11 downregulated (hsa-mir-126, hsa-mir-145,
   hsa-mir-30a, hsa-mir-218, hsa-mir-139, hsa-mir-195, hsa-mir-486,
   hsa-mir-143, hsa-mir-144, hsa-mir-34c, hsa-mir-16) (Figure [77]4,
   [78]Supplementary Data 3).

Figure 4. Ranking of the differentially expressed miRNAs as reported across 9
LUAD miRNA studies.

   [79]Figure 4
   [80]Open in a new tab

   The lower the rank the greater the reported significance (and/or
   expression fold change) of the corresponding miRNA. Height of the bars
   denotes total number of reported miRNAs in each study.

   We then tested co-occurrence between the miRNAs deregulation and
   emergence of the paradoxical genes within the same cohort of patients.
   We first assessed differential expression of the 24 consistently
   deregulated miRNAs using TCGA-LUAD miRNA-seq data. Nineteen
   significantly deregulated miRNAs (out of 24, p = 1.18E-8,
   hypergeometric test) validated the results of the meta-analysis
   (hsa-mir-130b, hsa-mir-135b, hsa-mir-139, hsa-mir-143, hsa-mir-144,
   hsa-mir-182, hsa-mir-183, hsa-mir-195, hsa-mir-196a, hsa-mir-196b,
   hsa-mir-200a, hsa-mir-200b, hsa-mir-21, hsa-mir-210, hsa-mir-218,
   hsa-mir-30a, hsa-mir-31, hsa-mir-486, hsa-mir-9). While five miRNAs
   (hsa-mir-191, −34c, −126, −145, −16) did not validate; hsa-mir-191,
   -34c failed to pass the validation criteria only due to insufficient
   expression fold change, although their expression was altered
   significantly ([81]Supplementary Table 2). Therefore, only 19 validated
   miRNAs are used for further analysis.

Association between deregulated miRNAs and paradoxical genes

   According to mirDIP [[82]21], 46 of the 70 paradoxical genes (65.7%)
   are targeted by subsets of the 19 deregulated miRNAs (p = 4.8E-3,
   hypergeometric test; Figure [83]5). Moreover, for 35 of these 46
   targeted paradoxical genes, we found at least one deregulated miRNA
   that targets the given gene with its expression status contrasting the
   expression status of the given gene, implying that paradoxical
   expression of this gene could be explained by the miRNA deregulation.

Figure 5. The network of interactions between deregulated miRNAs and their
paradoxical gene targets as obtained from mirDIP.

   [84]Figure 5
   [85]Open in a new tab

   Rectangles and circles represent miRNAs and genes, respectively. Red
   color denotes downregulated transcripts, while blue denotes upregulated
   ones. Size of nodes corresponds to number of interactions (degree).
   Solid red lines indicate miRNA:gene interactions between inversely
   deregulated transcripts, indicating potential causal associations.

   To further asses the association between expression of individual
   miRNAs and paradoxical genes we calculated partial correlation between
   them [[86]22], using copy number status of the paradoxical genes as a
   controlling variable. We found 369 significantly co-expressed
   miRNA:gene pairs (27% of all miRNA:gene combinations), 362 (98.1%) of
   which are explanatory, i.e., there is a positive correlation between
   miRNAs and genes deregulated in the same direction, or negative
   correlation between inversely deregulated ones (Figure [87]6A). We
   found 64 paradoxical genes (91.4%) whose correlation with at least one
   of the 19 validated miRNAs is among the top 5% of the correlations
   measured between 10E+4 random pairs of genes and miRNAs, and whose
   paradoxical expression can be explained by deregulation of the upstream
   miRNA.

Figure 6. Correlation between deregulated miRNAs and paradoxical genes.

   [88]Figure 6
   [89]Open in a new tab

   (A) Partial correlations between deregulated miRNAs and paradoxical
   genes as measured across TCGA LUAD data (red denotes negative
   correlation, blue denotes positive correlation, darker shade indicates
   significant correlations, p < 0.05). Plus signs denote partial
   correlation with causal explanation of gene deregulation, minus signs
   denote correlations that are significant but non-explanatory. (B)
   Barplot showing multiple correlations between paradoxical genes and en
   block deregulated miRNAs as calculated across TCGA LUAD data. Curve on
   the right depicts distribution of the same measure across all the genes
   in the TCGA LUAD data. Dashed line denotes 95th percentile of the
   distribution; there are 23 (32.9%) paradoxical genes whose multiple
   correlation coefficient falls among the top 5% of the highest values.

   As downstream effects of the deregulation of individual miRNAs may be
   combined, we aimed to evaluate potential associations between
   expression of individual paradoxical genes and en bloc expression of
   the deregulated miRNAs. We calculated coefficients of multiple
   (multivariate) correlation (CMC) between the paradoxical genes and
   deregulated miRNAs. The value of CMC can be interpreted as the
   correlation between dependent variable (gene expression) and its best
   prediction that can be computed linearly from the set of independent
   variables (expression of miRNAs). We found 23 paradoxical genes (32.9%)
   with CMCs in the top 5% of the values of the same measure as calculated
   across 17,745 genes covered by TCGA-LUAD RNA-seq data (p = 1.7E-14,
   hypergeometric test; Figure [90]6B).

Clinical significance of the paradoxical genes

   Using KMplot [[91]23], we assessed the association of the 70
   paradoxical genes with patient disease-free survival. We found 41
   (58.6%) of these genes as significantly associated with survival (FDR <
   0.05). We assume that down-regulation of significantly positive genes
   (HR < 1, FDR < 0.05) as well as the up-regulation of significantly
   negative ones (HR > 1, FDR < 0.05) worsens the survival prognosis.
   Under this assumption, with the exception of three genes (ELMO1,
   DENND3, SPINT1), the actual deregulation of paradoxical genes is
   associated with a negative impact on patient survival (Figure [92]7A;
   [93]Supplementary Table 3). This implies that the gene expression
   paradoxes we identified here mostly worsen the prognosis of LUAD
   patients.

Figure 7. Clinical significance of the paradoxical genes.

   [94]Figure 7
   [95]Open in a new tab

   (A) Venn diagram showing overlapping subsets of paradoxical genes with
   up-/ downregulated expression and subsets with positive (HR < 1, FDR <
   0.05) and negative (HR > 1, FDR < 0.05) association with LUAD
   prognosis, as obtained from KMplot. (B–D) Kaplan-Meier plots showing
   survival curves in the three independent validation cohorts, as
   stratified based on the Cox proportional hazards calculated from
   paradoxical genes expression. Numbers in the bottom left, indicate
   resulting hazard ratio (HR), associated statistical significance of
   patient stratification (p), concordance index (c-index), area under ROC
   curve (AUC) calculated at five years.

   Using TCGA-LUAD RNA-seq and matching clinical data, we constructed a
   multivariate Cox prognostic model, where expressions of the paradoxical
   genes served as prognostic variables. The model was validated using
   three independent publicly available gene expression datasets and
   associated clinical data: Botling et al. [[96]24], Okayama et al.
   [[97]25] and Der et al. [[98]26] (Figures [99]7B-7D). The resulting
   concordance index, area under ROC curve, hazard ratio between risk
   groups and associated P-value, demonstrated robust prognostic potential
   of paradoxical genes signature.

Pathway enrichment analysis of the paradoxical genes

   To elucidate biological functions of the paradoxical genes we performed
   a comprehensive pathway enrichment analysis. Using Pathway Data
   Integration Portal (pathDIP) [[100]27] we identified 22 pathways
   significantly enriched by the 70 paradoxical genes (FDR < 0.05,
   hypergeometric test). A list of all pathways and respective gene
   memberships is provided in [101]Supplementary Data 4. Interestingly,
   several of the enriched pathways are related to lipid metabolism and
   signaling (adipogenesis, LPA receptor mediated events, regulation of
   lypolysis) are key players in carcinogenesis [[102]28]. Moreover,
   several enriched guidance molecule pathways (ephrin signaling,
   semaphorin interactions, integrin, DCC-mediated attractive signaling)
   are noted as cancer-drug targets [[103]29]. PTEN-dependent cell cycle
   arrest, apoptosis pathway as well as hemostasis are known to play a
   role in cancer [[104]30–[105]32].

DISCUSSION

   Integrating three LUAD aCGH datasets, we identified several chromosomal
   regions with extensive copy number aberrations. Weir and colleagues
   [[106]33] reported similar profile of copy number aberrations in 371
   LUAD samples, confirming gains on 1q, 5p, 7p, 7q and 8q as well as
   deletions on 6q, 8p, 9p, 9q, 13q, 18q (Table [107]1). Lee et al.
   [[108]34] obtained similar results using Molecular Inversion Probe
   assays on 12 LUAD samples, confirming gains on 1q, 5p, 7p, 7q and 8q
   and losses on 6q, 8p, 18q (but not losses on 9p, 9q, 13q and 15q).
   While the individual aberrations vary greatly among individuals, as
   even the most frequent aberrations appear only in less than 50% of
   samples, the overall CNA profile of LUAD is conserved across the
   patient cohorts.

Table 1. List of the large scale copy number aberrations in LUAD.

   CNA Arm Frequency [%] Weir et al., 2007 [[109]33] Lee et al., 2012
   [[110]34]
   Gain: 1q 30-45 x x
   5p 22-45 x x
   7p 19-34 x x
   7q 16-20 x x
   8q 15-35 x x
   Loss: 6q 7-13 x x
   8p 7-26 x x
   9p 13-20 x
   9q 7-12 x
   13q 7-12 x
   15q 7-8 x
   18q 8-21 x x
   [111]Open in a new tab

   Frequency indicates range of relative number of aberrations occurring
   across the span of the given region. Checkmarks indicate aberrations
   confirmed by the above studies.

   By integrative analysis of multiple gene expression and copy number
   datasets, we found significant association between CNA status and
   differential expression of genes. Similar associations were previously
   reported in other cancer types [[112]2–[113]5, [114]7]. However, we
   also discovered 85 paradoxical genes whose expression was in opposite
   direction to their CNA. Seventy of these genes were subsequently
   validated across a homogeneous LUAD data cohort from TCGA. Paradoxical
   expression of these genes was validated even on the individual samples,
   proving that such paradoxical gene expression is a well preserved
   feature of the molecular profile of LUAD.

   Expression of paradoxical genes is associated with miRNAs consistently
   deregulated across multiple LUAD patient cohorts. Deregulation of these
   miRNAs inverts the effects of the genomic aberrations occurring in
   tumors. This is demonstrated by significant overlap between paradoxical
   genes and targets of these miRNAs, as well as by the two methods we
   applied here to measure correlation between the paradoxical genes and
   given miRNAs. Although correlation does not imply causality, the
   results of our analysis strongly suggest that differential expression
   of paradoxical genes is caused by deregulation of miRNAs.

   We tested prognostic relevance of the paradoxical genes and found 41
   (58.6%) paradoxical genes significantly associated with patient
   survival. Paradoxical expression of 39 genes has negative impact on
   prognosis. We also developed paradoxical gene signature that was
   validated on a three independent validation datasets [[115]24–[116]26].

   While the majority of the paradoxical genes are novel and their
   association with LUAD prognosis has not been investigated thoroughly,
   there are 17 validated paradoxical genes that have been previously
   associated with prognosis of other cancers: ADAM28 [[117]35, [118]36],
   ANGPT1 [[119]37], CA2 [[120]38], CA24 [[121]39], DAB2 [[122]40], HOXA5
   [[123]41, [124]42], IL6 [[125]43], KLF10 [[126]44], MCL1 [[127]45], NES
   [[128]46], NUSAP1 [[129]47], PDZD2 [[130]48], RGS17 [[131]49, [132]50],
   RUNX1T1 [[133]51, [134]52], SELC14L1 [[135]53], SEMA5A [[136]54] and
   SEPP1 [[137]55] (Table [138]2).

Table 2. List of validated paradoxical genes whose association with cancer
prognosis has previously been reported.

   Gene Association to cancer prognosis Ref.
   ADAM28 Overexpression correlates with cell proliferation and lymph node
   metastasis [[139]35]
   Serological and histochemical marker for NSCLC [[140]36]
   ANGPT1 Role in the prognosis of patients with oral squamous cell cancer
   [[141]37]
   CA2 Downregulated in gastric cancer and proposed as an independent
   prognostic factor for patient survival [[142]38]
   CD24 Expression at early stages of breast cancer indicates a highly
   invasive tumor [[143]39]
   DAB2 An important tumor suppressor, frequently downregulated in various
   tumors [[144]40]
   HOXA5 Downregulation is associated with poor prognosis in NSCLC
   [[145]41]
   Shown to prevent tumor progression and metastasis in colon cancer
   [[146]42]
   IL6 SNP associated with risk of multiple cancers [[147]43]
   KLF10 Potential clinical predictor for progression of pancreatic cancer
   [[148]44]
   MCL1 Key molecule for acquiring epithelial-to-mesenchymal
   transition-associated chemo-resistance in NSCLC [[149]45]
   NES Marker of cancer stem cells [[150]46]
   NUSAP1 Encodes protein that is proposed biomarker for prostate cancer
   recurrence [[151]47]
   PDZD2 Shown to induce senescence or quiescence of prostate, breast and
   liver cancer cells via transcriptional activation of p53 [[152]48]
   RGS17 Induces tumor cell proliferation lung and prostate cancers
   [[153]49]
   Regulator of cell survival and chemoresistance in ovarian cancer
   [[154]50]
   RUNX1T1 Predictor of liver metastasis in pancreatic endocrine tumours
   [[155]51]
   Associated with proliferation and senescence inhibition in
   t(8;21)-positive leukaemic cells [[156]52]
   SELC14L1 Proposed marker for predicting prognosis and progression of
   prostate cancer [[157]53]
   SEMA5A Over-expressed pancreatic cancer cells, regulates tumorigenesis,
   proliferation, invasion and metastasis, and serve as a target for
   diagnosis and treatment of pancreatic cancer [[158]54]
   SEPP1 Inversely related to pancreatic cancer risk [[159]55]
   [160]Open in a new tab

   From the panel of consistently deregulated miRNAs, several were
   recently associated with cancer progression and prognosis. Most
   notably, miRNAs hsa-mir-21 and -196b are well established oncomirs
   [[161]56], and hsa-mir-196a is known to be associated with NSCLC
   [[162]57]. The role of hsa-mir-30a varies across human cancer types
   [[163]58]. Hsa-mir-135b reverses chemoresistance of NSCLC [[164]59].
   Hsa-mir-139 is associated with aggressive tumor behavior and disease
   progression in breast cancer [[165]60], and is believed to inhibit
   bladder cancer proliferation and self-renewal [[166]61]. Hsa-mir-143
   inhibits tumor growth of breast cancer [[167]62]. Hsa-mir-144 has been
   shown to induce cell cycle arrest and apoptosis in pancreatic cancer
   cells [[168]63]. Hsa-mir-182 promotes prostate cancer progression
   [[169]64]. Hsa-mir-195 inhibits the proliferation and invasion of
   pancreatic cancer cells [[170]65]. Hsa-mir-218 downregulation
   contributes to epithelial-mesenchymal transition and tumor metastasis
   in lung cancer [[171]66]. Hsa-mir-130b is a part of the new prognostic
   marker for patient risk assessment and as an indicator of therapy
   resistance in prostate cancer [[172]67]. Similarly, hsa-mir-183 in
   combination with hsa-mir-19b were recently proposed as biomarkers of
   lung cancer [[173]68], and miRNAs hsa-mir-145 and -9 as biomarkers for
   early-stage cervical cancer [[174]69].

   Genomic instability is one of the cancer hallmarks [[175]70] that
   results in CNAs and differential expression of various genes. However,
   paradoxical genes with expression patterns opposite to gene dosage
   status often are dismissed as noise and overlooked in genome-wide
   cancer gene discovery efforts [[176]8]. Prognostic significance of the
   paradoxical genes suggests that their deregulation in LUAD is crucial
   for cancer progression and is maintained by the cancer cells despite
   the CNAs affecting expression of these genes in an inverse manner. We
   found that deregulation of the paradoxical genes is maintained at the
   epigenetic level by a group of deregulated miRNAs. These findings
   highlight the importance of integrative analysis, which combines
   information across diverse types of high-throughput data.

MATERIALS AND METHODS

Copy number aberrations: sample collection, preparation and data processing

   The samples used in this study were from the banked resected tumors
   collected in the BR.10 adjuvant chemotherapy trial [[177]71]. The study
   has received approval by the Institutional Research Ethics Board. A
   total of 142 formalin-fixed paraffin embedded (FFPE) and 16 snap-frozen
   samples were included. Haematoxylin and eosin stained slides from FFPE
   blocks first were reviewed by a lung pathologist to locate tumor rich
   areas (tumor cellularity > 60%), and then the block was cored at this
   area. Cored specimens were de-paraffinized by incubation in xylene
   overnight, and then washed with ethanol and air dried. Qiagen ATL
   buffer (QIAamp^© DNA extraction kit cat. 51306, Germantown, MD) were
   added. Specimens were digested by proteinase K at 55°C overnight at
   450rpm (Eppendorf^© Thermomixer R, Fisher Scientific). DNA isolation
   followed the manufacturer’s protocol (Qiagen, Cat. 51306, Germantown,
   MD). Samples of isolated genomic DNA were quantified by Nanodrop 1000
   (Thermo Scientific, Wilmington, DE) and electrophoresed in 0.8% agarose
   gel to visualize DNA size distribution. Severely degraded samples (80%
   of DNA fragments with size < 20bp) were excluded. Eight out of 142 FFPE
   and none of the snap-frozen samples were excluded. The final cohort
   contains 134 FFPE and 16 snap-frozen samples.

   Test and reference DNA were labeled using Cy3 and Cy5 dCTPs
   respectively; 200 ng of genomic DNA was labeled using the BioPrime DNA
   labeling system (Invitrogen). Prior to hybridization, test and
   reference labeled DNA were combined and purified using a ProbeQuant
   Sephadex G-50 Column (Amersham, GE Healthcare Life Sciences, Chicago,
   IL) to remove unincorporated nucleotides. Then 100 μg of Human Cot-1
   DNA (Invitrogen) was added to the labeled sample prior to precipitation
   with 0.1 volume 3M sodium acetate and 2.5 volumes of ethanol. The DNA
   pellet was resuspended in 20 μl DIG Easy hybridization solution (Roche,
   Indianapolis, IN), 2.5 μl (20 μg/μl) sheared herring sperm DNA and 2.5
   μl (100μg/μl) yeast tRNA (Calbiochem, San Diego, CA). DNA was denatured
   at 85°C for 10 minutes and repetitive sequences were blocked at 37°C
   for one hour prior to hybridization.

   Prehybridization was carried out using 20 μl DIG Easy hybridization
   buffer (Roche), 2.5 μl 10% BSA and 2.5 μl (20 μg/μl) sheared herring
   sperm DNA, at 45°C for 1 hour. Hybridization was carried out at 45°C
   for 24-48 hours. Arrays were washed for 5 × 5 min., in 0.1 x SSC, 0.1%
   SDS at room temperature in the dark with agitation. Each array was then
   rinsed 5 times in a clean slide box containing 0.1 x SSC with
   agitation. Slides were then dried with (oil free) nitrogen air stream
   and stored in the dark until imaging.

   Array image capture and data normalization were performed as previously
   described [[178]72]. Briefly, post-hybridization arrays were scanned
   using a CCD-based imaging system (Virtek ChipReader), and quantitated
   using Soft-Worx Tracker spot analysis software (Applied Precision,
   Issaquah, WA).

   Data were log2 transformed, and replicate clones having standard
   deviations > 0.075 or signal-to-noise ratios in each dye channel of < 3
   were filtered out. A multi-step normalization was then performed to
   control for biases caused by the array (e.g., spatial biases or
   differences in background signal), the dyes used for labeling, or the
   DNA sample quality [[179]73, [180]74]. The amount of “copycat'”
   correction required for each sample was plotted in a histogram of all
   samples; those that required too much correction and did not lie within
   a normal distribution were deemed to be poor quality DNA, and were
   eliminated from analysis. By these criteria, 35 samples were
   eliminated, leaving 115 samples (including 63 LUAD samples used here)
   from 113 patients for further analysis. Data from all 115 samples are
   publicly accessible through: [181]http://ophid.utoronto.ca/aCGH/.

Analysis of the copy number aberrations data

   In addition to our newly-produced CNA data, we analyzed two publicly
   available aCGH datasets acquired from LUAD tumor samples (see Table
   [182]3). Each of the public datasets was first normalized, segmented
   and additionally underwent post-segmentation normalization using
   methods provided by Bioconductor package CGHcall (v2.22.0) [[183]75].
   All three datasets then underwent a “calling” process using the CGHcall
   method from the same package, converting the continuous log-ratios on
   each probe, to one of the three discrete values (calls) corresponding
   to: (i) decreased number of copies (loss), (ii) normal copy and (iii)
   increased number of copies (gain) [[184]75].

Table 3. Summary of the public datasets used in this study.

   Author & Year No. of samples (total/normal) Source & Notes
   CNAs: Chitale et al., 2009 [[185]93] 199
   [186]http://cbio.mskcc.org/public/
   Job et al., 2010 [[187]97] 60 E-TABM-926, ArrayExpress
   Gene expression: Bhattacharjee et al., 2001 [[188]98] 207/17
   [189]http://www.broadinstitute.org/MPR/lung/
   Beer et al., 2002 [[190]99] 96/10 [191]GSE68571
   Stearman et al., 2005 [[192]100] 39/19 [193]GSE2514
   Yap et al., 2005 [[194]101] 58/9 E-MEXP-231, ArrayExpress
   Su et al., 2007 [[195]102] 54/27 [196]GSE7670
   Landi et al., 2008 [[197]103] 107/50 [198]GSE10072
   Rohrbeck et al., 2008 [[199]85] 15/5 [200]GSE6044
   Hou et al., 2010 [[201]86] 109/64 [202]GSE19188
   Girard et al., 2011 50/20 [203]GSE31547, Unpublished
   Okayama et al., 2012 [[204]25] 246/20 [205]GSE31210
   miRNA expression: Yanaihara et al., 2006 [[206]87] 208/104
   Cho et al., 2009 [[207]88] 20/10
   Crawford et al., 2009 [[208]89] 20/8
   Dacic et al., 2010 [[209]90] 12/6
   Yu et al., 2010 [[210]91] 40/20
   Lee et al., 2011 [[211]92] 12/6
   Jang et al., 2012 [[212]94] 206/103
   Ma et al., 2014 [[213]95] 108/54
   Nadal et al., 2014 [[214]96] 101/10
   [215]Open in a new tab

   As the individual datasets come with different probesets, obtained copy
   numbers calls correspond to different chromosomal segments and cannot
   be compared directly. We then integrated results acquired from
   individual datasets by assigning a set of chromosomal positioning
   “anchors” that comprised the starts and ends of chromosomal locations
   of all the probes in the four datasets, as described by Guo et al.
   [[216]76]. Then for each anchor, if the anchor was within the
   chromosomal location of the probe from any of the datasets, acquired
   vector of states corresponding to the probe were assigned to this
   anchor. Conversely, if an anchor was outside of any of the probes of
   the given dataset, a vector of missing values was created and assigned
   to the anchor. Anchors with missing values from more than one datasets
   were removed. Number of losses and gains were calculated across the
   anchors and their statistical significance was evaluated by p-values,
   calculated by comparing actual gains/losses counts to those obtained
   from 10^6 permutations.

Analysis of gene expression datasets

   We analyzed 10 publicly available gene expression datasets (Table
   [217]1) which satisfied the following criteria: (i) were originally
   from studies on tissue samples from surgically resected human LUAD
   tumors, (ii) contained at least one sample of noncancerous normal
   tissue for comparison, and (iii) were produced by using Affymetrix
   platforms to enable uniform processing and analysis of all the
   datasets. We first normalized and summarized each dataset by Gene Chip
   Robust Multiarray Averaging (gcrma, v2.38.0) [[218]77]. For each
   individual dataset, we then evaluated differential expression of the
   genes using Bioconductor package limma (v3.32.7) [[219]78]. Based on
   the expression fold change, genes were classified as either up- or
   downregulated, and then ranked according to statistical significance,
   which was evaluated by FDR-adjusted p-value. Analyzing 10 datasets
   resulted in 10 rankings for upregulated genes and 10 for downregulated
   ones. To identify consistently deregulated genes, obtained rankings
   were subjected to robust rank aggregation analysis implemented in R
   package RobustRankAggreg (v1.1) [[220]79]. This analysis detects genes
   that are ranked consistently better than expected under the null
   hypothesis of uncorrelated inputs, and assigns a p-value as a
   significance score for each gene. The stability of the resulting
   significance scores was assessed by the leave-one-out correction, in
   which the same analysis was repeated 10 times, each time excluding one
   of the rankings. Acquired p-values from each round were averaged into a
   corrected p-value. Genes whose significance score was smaller than
   chosen threshold (corrected p < 0.01) were further considered as
   consistently significantly deregulated genes.

Meta-analysis of the miRNA expression

   Compared to gene expression studies, fewer miRNA expression profiles
   from LUAD are available, and various platforms are used, often
   including custom arrays. Therefore, to provide analysis of miRNA
   expression in LUAD, instead of acquiring and processing expression
   data, we summarized reported results of 9 published miRNA expression
   studies (Table [221]1). Full text and (if applicable) supplementary
   data of each of the studies were carefully examined, and miRNAs with
   significantly altered expression were selected for further analysis.
   miRNA names were standardized according to miRBase (release 21)
   [[222]80]. All miRNAs were classified as either up- or downregulated
   and ranked according to their reported statistical significance. If
   this was not reported, expression fold change was used instead.
   Examining 9 studies we obtained 9 rankings for upregulated miRNAs and 9
   for downregulated ones. Analogously to gene expression analysis
   described in the previous section, obtained miRNA rankings subsequently
   were subjected to the robust rank aggregation analysis and
   leave-one-out correction of the obtained p-values. miRNAs whose
   significance score was smaller than a chosen threshold (corrected p <
   0.05), comprised the resulting list of consistently significantly
   deregulated miRNAs.

Acquiring chromosomal locations and copy number status of the deregulated
genes

   Using Bioconductor package biomaRt (v2.22) [[223]81] we determined the
   chromosomal locations of the deregulated genes from the Ensembl (v75,
   Feb. 2014) database and compared these locations with chromosomal
   coordinates of the aberrant regions. Deregulated genes whose
   chromosomal locations overlapped with aberrant regions were counted and
   statistical significance of the association between the aberrations and
   differential gene expression was then evaluated using Chi-square test.

Identification of the miRNA-target pairs

   We used microRNA Data Integration Portal, v2.3.2.0 (mirDIP;
   [224]http://ophid.utoronto.ca/mirDIP) [[225]21] to acquire data on
   human miRNAs and their respective targets. mirDIP integrates data from
   14 miRNA resources and supports a search for miRNA-target pairs under
   user-defined filters, including a number of independent confirmations
   of given pairs, confidence criteria, etc. We restricted our search to
   only miRNA-target pairs that fell among the top third of the most
   confident predictions from at least two different sources. miRNA names
   were standardized as described above, and symbols of their gene targets
   were standardized by HGNC symbol checker
   ([226]http://www.genenames.org, version from September 2015). To assess
   the significance of overlap between targets of deregulated miRNAs and
   paradoxical genes, we performed hypergeometric testing, using 15,323
   genes that were subjected to robust rank analysis as a population and
   7,836 of these genes that, according to mirDIP are targeted by at least
   one of the deregulated miRNAs as a number of “successes” in the
   population.

Calculation of the miRNA:target partial correlation and multiple correlation
coefficients

   Partial correlation coefficients between gene and miRNA expressions
   were calculated using R package ppcor v1.1, using copy number of the
   given gene as a third – controlling variable. Statistical significance
   of the obtained values was calculated by two-sided comparison with
   distribution of the same measure obtained across 10^4 random miRNA:gene
   pairs. To distinguish whether correlations between miRNA and given
   genes may also imply causal association, we compared the sign of the
   correlation and copy number status of the miRNA and gene. If the miRNA
   and gene were deregulated in the same direction, mutual correlation
   must be positive to indicate a causal association. If the miRNA and
   genes were inversely deregulated, mutual correlation must be negative
   to indicate causal association.

   Multiple correlation coefficients between gene and en bloc miRNA
   expression C was calculated as follows:

   C ^2 = c^TR^−1c

   where c denotes a vector of Pearson coefficients of correlation between
   a given genes and miRNA expressions, R denotes a matrix of Pearson
   coefficients of correlation between miRNA expressions.

Evaluation of the prognostic significance of the paradoxical genes

   Prognostic properties of the individual genes were evaluated by KMplot
   ([227]http://kmplot.com/analysis/) [[228]23], version 2015, using only
   LUAD patient data and corresponding disease-free survival censored at
   10 years. If multiple probe sets were mapped to the same gene, we used
   only JetSet probes mapping to a given gene. Obtained hazard ratios (HR)
   and associated p-values were then summarized and multiple testing
   adjustment of the p-values was subsequently computed using false
   discovery rate (FDR) method.

   To evaluate the multivariate prognostic potential of the paradoxical
   genes we developed a Cox proportional hazards model, where expressions
   of 70 validated paradoxical genes served as covariates. The model was
   derived using R package glmnet [[229]82] (v2.0.2), applying LASSO (L1)
   regularization to prevent over-fitting. TCGA-LUAD RNA-seq data were
   standardized by converting to z-scores and along with the corresponding
   clinical data were used as “training data”. The resulting model was
   then validated on three independent datasets, and its predictive
   performance was first evaluated by a concordance index (function
   survConcordance from R package survival [[230]83], v2.38.3), and an
   area under receiver operating characteristics curve (AUC), measured at
   the fifth year after initial time point (function AUC.cd from the R
   survAUC package, v 1.0.5). Patients were then separated into two groups
   based on the predicted risk score, using its 40^th percentile as a
   threshold. This threshold was selected based on ROC analysis of the
   model using training data. Validated HR between these two groups, as
   well as associated statistical significance (log-rank test) were
   calculated (function survdiff from the survival package) and
   Kaplan-Meier survival curves of both groups were plotted (for more
   details see [[231]84]).

Pathway enrichment analysis

   Using Pathway Data Integration Portal v2.5.1.2
   ([232]http://ophid.utoronto.ca/pathDIP), we performed comprehensive
   pathway enrichment analysis across 20 major pathway databases
   [[233]27]. We considered literature curated gene:pathway memberships as
   well as those predicted according to experimentally detected
   protein-protein interactions (including interactions experimentally
   detected between orthologues plus FpClass interactions with minimum
   confidence level for predicted associations equal 0.95; for more
   details see pathDIP documentation).

SUPPLEMENTARY MATERIALS FIGURE AND TABLES

   [234]oncotarget-09-9137-s001.pdf^ (2.3MB, pdf)
   [235]oncotarget-09-9137-s002.doc^ (194KB, doc)
   [236]oncotarget-09-9137-s003.xlsx^ (1.6MB, xlsx)
   [237]oncotarget-09-9137-s004.xlsx^ (1.3MB, xlsx)
   [238]oncotarget-09-9137-s005.xlsx^ (18.7KB, xlsx)
   [239]oncotarget-09-9137-s006.xlsx^ (1.5MB, xlsx)

Acknowledgments