Abstract

   Extracellular vesicles (EVs) mediate intercellular communications in
   the tumor microenvironment and contribute to the aggressive phenomenon
   of cancers. Although EVs in body fluids are supposed to be ideal
   biomarkers for cancer diagnosis and prognosis, it remains difficult to
   distinguish the tumor-derived EVs from those released by other tissues.
   We hypothesized that analyzing the EV-related molecules in tumor
   tissues would help to estimate the prognostic value of tumor-specific
   EVs. Here, we investigate the expression of coding genes of proteins
   carried by small EVs (sEVs) in primary lung adenocarcinoma. Based on
   the protein-protein interaction network, we identified three network
   modules (3-PPI-Mod) as a signature that could predict recurrence. This
   signature was validated in three independent datasets and demonstrated
   better prognostic value than signature generated from gene expression
   alone. Meanwhile, the high-risk subgroup assigned by the signature
   could benefit from adjuvant chemotherapy, although it was not
   beneficial in unselected patients. Two out of three modules were
   enriched by proteins identified in sEVs from non-small-cell lung cancer
   cells. Furthermore, the two modules were remarkably correlated with
   intratumoral hypoxia score. These results suggest that the 3-PPI-Mod
   signature was enriched in tumor-derived sEVs and could serve as a
   prognostic and predictive biomarker for lung adenocarcinoma.

   Keywords: exosome, NSCLC, microenvironment, gene expression, adjuvant
   chemotherapy

Introduction

   Lung cancer is one of the most common malignancies worldwide.[43]^1
   According to pathomorphology, nearly 85% of all lung cancers are
   non-small-cell lung cancers (NSCLCs).[44]^2 Lung adenocarcinoma (ADC)
   accounts for 35% of NSCLCs. Despite the improvement of treatment, the
   5-year overall survival of NSCLC is below 20%.[45]^3 Loco-regional and
   distant relapse are the major causes of cancer-related death for NSCLC.
   Even in stage I NSCLC, about 30% of patients would suffer recurrent
   disease, and many patients would die of it after surgical
   treatment.[46]3, [47]4 Although adjuvant chemotherapy (ACT) improves
   outcome in late-stage NSCLC, the survival benefit of ACT in early-stage
   patients remains controversial.[48]^5 Thus, exploring novel prognostic
   biomarkers would be helpful to select patients for ACT and improve
   prognosis in early-stage NSCLCs.

   Extracellular vesicles (EVs) are cell-released membrane particles that
   regulate intercellular communication by transporting functional
   molecules (e.g., proteins, nucleic acids, and lipids) to recipient
   cells.[49]^6 Cells can secret diverse types of EVs with a wide range of
   sizes and different cellular origins. Exosomes mainly range from
   50–150 nm in diameter, which are generated inside multivesicular
   endosomes.[50]^7 Other EVs (with a diameter from 100 to 1,000 nm), such
   as microvesicles (MVs), ectosomes, or microparticles, are directly
   budded from cellular membrane.[51]^7 Given that the specific
   subcellular origin was not strictly demonstrated in many literatures,
   the generic term EVs is used in this study, as recommended by the
   International Society for Extracellular Vesicles (ISEV).[52]^8

   In recent years, increasing evidences have suggested that EVs
   play an important role in cancer development and progression.
   Cancer-derived EVs are implicated in various carcinogenesis processes,
   including malignant transformation, angiogenesis, immunosuppression,
   invasion, and treatment resistance.[53]9, [54]10, [55]11 On the other
   hand, EVs released by the tumor microenvironment could also influence
   the traits of cancer cells.[56]^12 Hypoxia, a common feature of tumor
   microenvironment, was demonstrated to induce cancer EVs that transport
   aggressive and metastatic phenomenon to the recipient cells.[57]13,
   [58]14, [59]15, [60]16 Moreover, cancer-released EVs could modify the
   distant microenvironment to form a pre-metastatic niche to facilitate
   the formation of metastatic lesions, suggesting that cancer EVs could
   act as both local and distant effects.[61]^17

   Cancer-derived EVs detected in various body fluids are proposed as
   novel noninvasive biomarkers. For instance, circulating small EVs
   (sEVs, with a diameter less than 200 nm) carried microRNAs
   (miRNAs)[62]18, [63]19 and proteins,[64]20, [65]21, [66]22 which are
   promising diagnostic and prognostic biomarkers for NSCLC. Park
   et al.[67]^23 examined the proteomics of sEVs isolated in malignant
   pleural effusions (PEs) from metastatic lung ADCs. The study identified
   a lot of well-known cancer-related proteins, such as EGFR, RAS, and
   Src-family kinase.[68]^23 However, the sEVs identified in body fluids
   may represent a mixed population released by both tumors and other
   tissues. To date, it remains a great challenge to distinguish
   tumor-specific sEVs from those released by other tissues due to the
   lack of specific biomarkers.[69]^24 It was reported that the protein
   levels in sEVs released to body fluids was consistent with protein
   expression of the NSCLC tissues.[70]^25 Therefore, we proposed that
   integrative analysis of primary tumors with secreted sEVs may be
   feasible to identify tumor-specific sEV-related biomarkers for NSCLC.

   In this study, we analyzed the intratumoral gene-expression profile of
   the coding genes of sEV proteins identified in malignant PEs from
   advanced lung ADC.[71]^23 Based on a protein-protein interaction (PPI)
   network approach, a sEV-associated gene signature was established for
   recurrence prediction. This signature was further validated for its
   prognostic value in three retrospective datasets and one prospective
   dataset from the public database ([72]Table S1). Moreover, the gene
   signature was further compared with proteomics of sEVs isolated from
   conditional medium of NSCLC cells as well as peripheral blood from
   NSCLC patients. The association between the signature and tumor
   microenvironment, including hypoxic status and stromal cell abundance,
   were also investigated.

Results

Identifying Relapse-Associated PPI Modules from sEV Proteins

   [73]Figure 1 depicts an overall flow chart of this study. For the sEV
   proteins identified in malignant PEs, 568 proteins were mapped onto the
   reference PPI network. The sEV-associated PPI network was integrated
   with the gene-expression profile of the training dataset. Next, the
   module searching procedure identified 144 modules, which showed locally
   maximal discriminative scores for recurrent status. Three permutation
   methods were used to estimate the significance of the discriminative
   score for each module. As a result, 21 modules showed significantly
   higher discriminative scores for recurrence than those by chance (p <
   0.001 for all three methods, [74]Figure S1). Details of 21 candidate
   modules are presented in [75]Figure S2.

Figure 1.

   [76]Figure 1
   [77]Open in a new tab

   The Workflow of This Study

   The PPI network of sEV proteins isolated from PEs was integrated with
   the gene expression profile of the training dataset. Candidate modules
   with locally maximal relapse scores were identified by a greedy
   searching approach. Random forest algorithm was used to establish a
   network-based signature for relapse risk. The signature 3-PPI-Mod was
   further validated for prognosis in three independent datasets.
   Moreover, the predictive value for adjuvant chemotherapy in ADCs was
   evaluated in a prospective dataset conducted from the JBR.10 trial.
   Biological validation was performed by comparing the signature with
   various databases. Abbreviations: PPI, protein-protein interaction;
   sEVs, small extracellular vesicles; 3-PPI-Mod, 3 PPI modules; ADC,
   adenocarcinoma; RFS, relapse-free survival; DSS, disease-free survival;
   OS, overall survival.

   The expression score profiles of the 21 PPI modules are presented in
   [78]Figure 2A. All modules were clustered into two major clusters by
   unsupervised clustering algorithm. One cluster with 14 modules was more
   likely upregulated in recurrent tumors, and another cluster with 7
   modules was inversely correlated with recurrence. Patients were also
   grouped into two major subgroups, which were significantly associated
   with recurrent status (adjusted p < 0.001) and EGFR mutation (adjusted
   p = 0.036) ([79]Figure 2A). However, there was no statistically
   significant association of patient subgroups with other
   clinicopathological factors, such as age, sex, smoking index, tumor
   stage, KRAS status, or TP53 status.

Figure 2.

   [80]Figure 2
   [81]Open in a new tab

   Establishing a Prognostic Signature Based on a sEV-Associated Network

   (A) Expression score profiles of candidate PPI modules. The upper
   heatmap shows the expression scores of the 21 PPI modules. Unsupervised
   hierarchical clustering analysis was performed for modules (rows) and
   samples (columns). The lower panel indicates clinicopathological
   variables of the tumor samples. The p value was calculated by
   chi-square test and then adjusted by FDR for multiple testing. (B)
   Feature selection based on RF algorithm. The x axis indicates the
   number of used variables (modules). The y axis is the out-of-bag (OOB)
   error of each prediction model. The arrow indicates that the
   combination of 3 PPI modules achieved the selection criteria with the
   smallest feature number and maximal prediction accuracy. (C) Expression
   scores of the 3 modules that were selected in (B). The p value was
   calculated by t test. (D) The recurrent risk predicted by 3-PPI-Mod was
   used to perform ROC analysis. AUROC and 95% CI are presented. (E)
   Scatterplotting of the predictive risk of recurrence. The x axis
   indicates patient index. The y axis indicates the predictive risk of
   recurrence (OOB). A cutoff of 0.5 was used to classify patients into
   low- and high-risk subgroups. Red and green colors indicate patients
   with and without recurrent disease, respectively. (F and G) Survival
   analysis for 3-PPI-Mod in all patients (F) and in stage I only (G).
   Kaplan-Meier curves and log-rank test were used to compare OS between
   different risk subgroups identified in (E). Abbreviations: PPI,
   protein-protein interaction; ADC, adenocarcinoma; FDR, false discovery
   rate; RF, random forest; ROC, receiver operating characteristic curve;
   AUROC, area under the receiver operating characteristic curve; CI,
   confident interval; OS, overall survival.

Establishing a Predictive Signature for Cancer Recurrent Risk

   We then thought to establish an optimal model to predict recurrent risk
   for patients using the 21 PPI models. By using the random forest (RF)
   algorithm, we found that the combination of three PPI modules achieved
   the criteria of the smallest feature number and the best prediction
   accuracy ([82]Figure 2B, out-of-bag [OOB] error = 0.216 ± 0.038). Two
   out of three modules (Mod_8 and Mod_10) were upregulated, and the last
   module (Mod_137) was downregulated in recurrent tumors ([83]Figure 2C).
   Next, the three PPI modules were used to construct a prediction model,
   which was named 3-PPI-Mod signature. The recurrent risk of patients was
   predicted by the signature, with an area under the receiver operating
   characteristic curve (AUROC) of 0.84 (95% confident interval [CI],
   0.77–0.91) ([84]Figure 2D). Using 0.5 as the cut-off, patients were
   divided into two groups with high or low risk for recurrence,
   respectively ([85]Figure 2E). Overall, 78.4% (91/116) of patients were
   correctly classified when compared with the true recurrent status.
   Kaplan-Meier curves confirmed that the 3-PPI-Mod signature could
   predict overall survival (OS) for all patients and those with stage I
   disease (both p < 0.001, [86]Figures 2F and 2G). Multivariate Cox
   regression demonstrated that the 3-PPI-Mod signature was an independent
   prognostic factor for OS (adjusted hazard ratio [HR] = 6.58, 95% CI,
   3–14.4, p < 0.001, [87]Table S2).

   Within the 3-PPI-Mod signature, 5, 6, and 5 genes were included in
   Mod_8, Mod_10, and Mod_137, respectively. One gene (Cofilin-1 [CFL1])
   was overlapped by Mod_8 and Mod_10. Mod_10 was connected with Mod_8 and
   Mod_137 within the PPI network ([88]Figure 3A). Gene Ontology (GO) and
   pathway enrichment analysis showed that all three modules were
   significantly correlated with focal adhesion ([89]Figure 3B). Notably,
   genes in Mod_8 and Mod_10 were enriched in extracellular exosome
   ([90]Figure 3B). Mod_137 showed association with diverse biological
   processes, including protein kinase binding, membrane raft, and T cell
   receptor signaling pathway ([91]Figure 3B).

Figure 3.

   [92]Figure 3
   [93]Open in a new tab

   Biological Characteristics of the 3-PPI-Mod Signature

   (A) The view of interaction of 3-PPI-Mod signature. The presented
   network was generated by mapping the signature proteins on the
   reference PPI network. Three modules are shown by different colors. (B)
   Enrichment analysis for three modules. The significant GO terms or KEGG
   pathways were listed on the left. Bars indicate the enrichment
   significance (log10-transformed FDR). The color legend indicates
   different modules. Abbreviations: PPI, protein-protein interaction;
   3-PPI-Mod, 3 PPI modules; GO, gene ontology; FDR, false discovery rate.

Validation of the 3-PPI-Mod Signature in Independent Cohorts

   For each validation dataset, the recurrent risks of patients were
   predicted by the 3-PPI-Mod signature, which was established in the
   training dataset. Kaplan-Meier curves demonstrated that patients who
   were classified as high risk had significantly shorter relapse-free
   survival (RFS) time than those classified as low risk (log-rank test:
   Gene Expression Omnibus [GEO]: [94]GSE50081, p = 0.005; GEO:
   [95]GSE30219, p = 0.003; GEO: [96]GSE31210, p < 0.001) ([97]Figures
   4A–4C). The adjusted HRs (95% CI) for high-risk versus low-risk
   subgroups were 2.24 (1.07–4.67), 3.74 (1.54–9.07), and 3.34 (1.81–6.15)
   in GEO: [98]GSE50081, [99]GSE30219, and [100]GSE31210, respectively
   ([101]Table S3). Similarly, the high-risk subgroup was significantly
   correlated with unfavorable OS compared with the low-risk subgroup
   (GEO: [102]GSE50081, HR [95% CI] = 1.92 [1.03–3.58], p < 0.001; GEO:
   [103]GSE30219, HR [95% CI] = 2.41[1.25–4.63], p = 0.006; GEO:
   [104]GSE31210, HR [95% CI] = 2.46[1.09–5.58], p = 0.001) ([105]Figures
   4D–4F; [106]Table S3). Subgroup analysis revealed that the 3-PPI-Mod
   signature remained significant in distinguishing RFS and OS for
   patients with stage I disease ([107]Figures S3A–S3F).

Figure 4.

   [108]Figure 4
   [109]Open in a new tab

   Survival Analysis of 3-PPI-Mod Signature in Three Validation Datasets

   (A–F) In each dataset, patients were predicted to be in low- or
   high-risk subgroups by the signature. Kaplan-Meier curves and log-rank
   test were used to compare RFS (A–C) and OS (D–F) of patients in
   different risk subgroups: GEO: [110]GSE50081 (A and D), [111]GSE30219
   (B and E), [112]GSE31210 (C and F). Survival data within 10 years are
   shown. Abbreviations: 3-PPI-Mod, 3 PPI modules; RFS, relapse-free
   survival; OS, overall survival.

   We further validated the 3-PPI-Mod signature in a prospective dataset
   conducted from the JBR.10 clinical trial, which was designed to
   evaluate the advantage of ACT in stage IB to II NSCLCs. When all ADC
   patients were taken into consideration, the ACT arm showed no
   significant improvement of disease-specific survival (DSS) compared
   with the observation (OBS) arm ([113]Figure 5A, log-rank test, p =
   0.503). The 3-PPI-Mod signature classified 32 and 39 of 71 ADC patients
   as low- and high-risk subgroups, respectively. Although ACT did not
   improve outcome in the low-risk subgroup ([114]Figure 5B, log-rank
   test, p = 0.189), it significantly prolonged DSS in the high-risk
   subgroup ([115]Figure 5C, log-rank test, p = 0.019). Similar results
   were observed in OS analysis (ACT versus OBS, log-rank test: all ADCs,
   p = 0.907; low-risk subgroup, p = 0.115; high-risk subgroup, p = 0.064)
   ([116]Figures 5D–5F). Multivariate Cox’s model revealed a significant
   interaction effect between risk groups and ACT (test for interaction:
   DSS, p = 0.011; OS, p = 0.013) ([117]Table S4). For non-ADC patients,
   however, there was no evidence demonstrating that ACT could improve DSS
   or OS, even if in subgroups stratified by the 3-PPI-Mod signature
   ([118]Figure S4; [119]Table S4).

Figure 5.

   [120]Figure 5
   [121]Open in a new tab

   Predictive Value of 3-PPI-Mod Signature for ACT Benefit in ADC Patients
   from the JBR.10 Dataset

   (A–F) Results with lung ADC patients are presented. Postoperative DSS
   (A–C) and OS (D–F) between ACT arm and observation arm (OBS) were
   compared by Kaplan-Meier curves and log-rank test. Patients were
   stratified by 3-PPI-Mod signature. The survival comparisons were
   performed in all patients (A and D), low risk subgroup (B and E), and
   high risk subgroup (C and F), respectively. Abbreviations: ACT,
   adjuvant chemotherapy; 3-PPI-Mod, 3 PPI modules; DSS, disease-specific
   survival; OS, overall survival; OBS, observation.

The sEV-Associated Signature Exhibited Better Prognostic Value Than Signature
Derived from Gene Expression Alone

   The sEV-associated 3-PPI-Mod signature was compared with an 8-gene
   signature, which was constructed based on gene expression alone
   ([122]Figures S5A–S5D). The 8-gene signature was established by using
   the same algorithm as used for the 3-PPI-Mod signature. In the training
   dataset, the 8-gene signature showed similar prediction performance as
   the 3-PPI-Mod signature in distinguishing relapse and OS ([123]Figures
   S5E and S5F). In the validation datasets, however, the C-index of
   3-PPI-Mod signature was higher than that of the 8-gene signature in
   predicting both RFS and OS ([124]Figures S5G and S5H). Kaplan-Meier
   analysis also confirmed that the 8-gene signature could not
   consistently distinguish RFS and OS across different validation cohorts
   when considering neither all patients nor the stage I subgroup
   ([125]Figures S6A and S6B). Meanwhile, the 8-gene signature could not
   predict ACT benefit in the JBR.10 dataset ([126]Figure S7). These
   findings suggest that the sEV-associated signature exhibited improved
   prognostic value compared with signature derived from gene expression
   alone, especially for independent cohorts.

The 3-PPI-Mod Reflects Tumor-Stroma Interaction and Hypoxic Tumor
Microenvironment

   The proteins within the 3-PPI-Mod signature were further validated by
   the proteomics of sEVs isolated from NSCLC cell supernatants[127]^26
   and serums of NSCLC patients.[128]^22 Overall, 10 and 8 proteins of the
   signature were overlapped with those identified in sEVs from NSCLC
   cells and circulating serums, respectively (both p < 0.001, by
   hypergeometric test) ([129]Figures 6A and 6B). More than 90% of the
   overlapped proteins came from Mod_8 and Mod_10. For Mod_137, only
   one protein (CTNNB1) was overlapped with proteomics of NSCLC-derived
   sEVs and none with that of circulating sEVs ([130]Figures 6A and 6B).

Figure 6.

   [131]Figure 6
   [132]Open in a new tab

   The 3-PPI-Mod Reflects Interaction between Cancer Cells and Tumor
   Microenvironment

   (A and B) Comparison of 3-PPI-Mod signature with other proteomic
   databases of sEVs from NSCLCs. The venn diagram shows the overlapping
   of proteins of three modules with those identified in sEVs from NSCLC
   cell supernatants (A) and circulating serum (B). (C) Correlation
   analysis of the signature with stromal cell abundance and
   microenvironment status. The abundance of 8 immune cells and 2
   non-immune cells were calculated by MCP-counter. The microenvironment
   status for hypoxia, angiogenesis, and inflammation were estimated as
   described in [133]Materials and Methods. Circles in the matrix indicate
   the correlation coefficient r, and red and blue colors indicate
   positive and negative correlations, respectively. A larger circle size
   represents a higher correlation. The raw p value was adjusted by FDR.
   Correlations with FDR > 0.05 were not presented. (D) GSEA of hypoxia
   metagene against the risk subgroups classified by 3-PPI-Mod. The NES
   and q value (adjusted p value) are presented. (E) Diagram shows
   Pearson’s correlation between hypoxic score and HIF1A expression level.
   (F) Boxplot shows intratumoral hypoxic scores between low- and
   high-risk subgroups. The p value was calculated by t test. (G) Boxplot
   shows HIF1A expression levels between tumors with low and high risk.
   The p value was calculated by t test. Abbreviations: PPI,
   protein-protein interaction; 3-PPI-Mod, 3 PPI modules; FDR, false
   discovery rate; NSCLC, non-small cell lung cancer; MCP,
   microenvironment cell populations; GSEA, gene set enrichment analysis;
   NES, new enrichment score.

   We hypothesized that Mod_137 may be associated with non-cancer cells.
   The abundance of different stromal cells was estimated by
   microenvironment cell population (MCP)-counter.[134]^27 Surprisingly, a
   higher expression score of Mod_137 was correlated with elevated
   abundance of fibroblasts, endothelial cells, and diverse types of
   immune cells ([135]Figure 6C). In contrast, Mod_8 and Mod_10 expression
   scores did not show a remarkably positive correlation with those of the
   stromal components. Similar results were observed when analysis was
   conducted by using individual genes within each module
   ([136]Figure 6C).

   The association of 3-PPI-Mod signature with tumor microenvironment
   status was analyzed. We found that Mod_8 and Mod_10 were positively
   correlated with hypoxia, while Mod_137 was associated with angiogenesis
   and inflammation ([137]Figure 6C). The hypoxia metagene was
   significantly enriched in high-risk tumors predicted by the 3-PPI-Mod
   signature (new enrichment score [NES] = 1.88, q value = 0.005)
   ([138]Figure 6D). There was a significant correlation between
   intratumoral hypoxic score and hypoxia inducible factor 1 (HIF1)
   expression level, which were both upregulated in the high-risk subgroup
   identified by the 3-PPI-Mod signature ([139]Figures 6E–6G). However,
   the angiogenesis and inflammation status was not associated with
   3-PPI-Mod subgroups, based on gene set enrichment analysis (GSEA)
   ([140]Figures S8A and S8B) or expression scores ([141]Figures S8C and
   S8D).

Discussion

   In the present study, we evaluate the intratumoral gene expression
   level of coding genes of sEV proteins in lung ADC. In the training
   dataset, we identified a 3-PPI-Mod signature that could distinguish
   patients with high risk of recurrence. This signature was also
   validated in three independent datasets and demonstrated better
   prognostic value than the gene signature established by using gene
   expression alone. Moreover, the patients assigned as high risk by the
   signature could benefit from ACT, although no significant benefit was
   achieved for unselected patients. Further analysis revealed that the
   sEV-associated signature is highly correlated with intratumoral hypoxic
   status and might be derived from both cancer cells and stromal cells.
   To the best of our knowledge, this is the first comprehensive analysis
   of sEV-associated genes in predicting outcome of lung ADC based on
   large-number populations from several independent cohorts.

   To date, it remains a great challenge to discriminate exosomes from
   other sEVs such as MVs.[142]^24 Exosomes are derived from endosomal
   multivesicular bodies that fused with the cytoplasma membrane, while
   MVs are usually considered to bud directly from the cytoplasma
   membrane.[143]^9 The PE-derived “MVs”[144]^23 and NSCLC cell-released
   “exosomes”[145]^26 were isolated by using differential ultracentrifuge
   followed by density gradient methodology, which recovers mixed EV
   populations. For these EVs, the multivesicular body origin or purity
   were not further demonstrated. Therefore, we used the general term sEVs
   in this study to refer to these EVs, which were characterized with
   diameters less than 200 nm.[146]23, [147]26

   The sEV cargos are suggested as promising cancer biomarkers for
   non-invasive detection due to the presence in various body fluids
   (e.g., plasma, PEs, sputum, and urine). However, it remains difficult
   to evaluate the role of tumor-specific sEVs, which may be influenced by
   those released from other tissues.[148]^24 It was reported that the
   protein expression of sEVs released to body fluids was consistent with
   that of the tumor tissues, suggesting feasibility for analyzing primary
   tumors to reflect the status of circulating sEVs.[149]^25 In this
   study, by investigating the intratumoral expression of coding genes of
   proteins carried within sEVs,[150]^23 we identified a prognostic gene
   signature for lung ADC. Most proteins of the signature were confirmed
   by sEV proteomics from NSCLC cells,[151]^26 suggesting that the
   signature may be enriched by tumor-specific sEVs. It is possible that
   the gene expression in tumor tissues could influence cancer
   progression. Nevertheless, secretion of the gene products to
   circulating blood through sEVs may further promote recurrence or
   distant metastasis.[152]17, [153]28 This may be explainable by our
   observation that the sEV-associated signature exhibited better
   prognostic value than signature that was established from gene
   expression alone. Therefore, this study provides a promising strategy
   to evaluate the prognostic value of tumor-specific sEVs from complex
   body fluids.

   The human PPI network has been demonstrated to be informative in
   biomarker development for cancer when being integrated with
   gene-expression profiles.[154]^29 Based on the PPI network approach, we
   identified three network modules that were highly correlated with
   cancer recurrence. Although these modules were established in the
   training dataset, their prognostic value was confirmed in external
   cohorts with hundreds of patients. As we previously reported, the PPI
   network approach could identify vital biological themes that reflect
   mechanisms underlying cancer behaviors.[155]^30 In this study, most of
   the modules (e.g., Mod_8 and Mod_10) were associated with exosome,
   focal adhesion, and cytoskeleton regulation, which are involved in
   cancer invasion and metastasis.[156]^31 These results suggest that the
   3-PPI-Mod signature represents malignant traits of cancer and is robust
   in outcome prediction across different cohorts.

   Hypoxia, a frequent feature in solid tumors, could induce lung cancer
   cells to release more sEVs that promote migration via miR-23a
   delivery.[157]^16 In our signature, Mod_8 and Mod_10 were positively
   correlated with a hypoxia score that could predict prognosis of NSCLC
   as reported previously.[158]^32 This was consistent with our findings
   that Mod_8 and Mod_10 were associated with higher recurrent risk. CFL1,
   a common gene between Mod_8 and Mod_10, was correlated with
   invasiveness and unfavorable outcome in NSCLC.[159]33, [160]34
   Functional analysis revealed that CFL1 participates in actin dynamics
   and contributes to membrane extension direction in cell
   migration.[161]^35 Moreover, many other genes (WYHAG, NCKAP1, ACTB,
   GNA13, and RDX) identified in Mod_8 and Mod_10 also play an important
   role in cancer invasion and metastasis.[162]36, [163]37, [164]38,
   [165]39, [166]40 However, the association with hypoxia of these genes
   in NSCLC remains unclear. We found that most of these genes were
   positively correlated with intratumoral hypoxic score. Thus, we propose
   that the expression and prognostic role of the sEV-associated signature
   may be promoted by hypoxic microenvironment, although extended
   experimental validation is required.

   Unlike Mod_8 and Mod_10, Mod_137 had only one protein that was
   presented in sEVs from NSCLC cells. By analyzing the microenvironment
   cell populations,[167]^27 we found that four proteins in Mod_137 were
   correlated with diverse stromal cells. This observation was supported
   by the fact that FYN, PTPRC (CD45), and THY1 (CD90) might be involved
   in lymphocytes,[168]41, [169]42 macrophages,[170]^43 and
   fibroblasts[171]^44 in tumor microenvironment. Moreover, PTPRC was also
   identified in sEVs released by dentritic cells.[172]^45 Interestingly,
   four proteins in Mod_137 were significantly correlated with endothelial
   cell abundance, which was consistent with the fact that Mod_137 was
   associated with angiogenesis score. Overall, the sEV proteins derived
   from cancer cells together with stromal cells may contribute to
   aggressive microenvironment and influence postoperative recurrence in
   lung ADC.

   One limitation of this study is the relative small sample set of the
   sEVs cohort, which was used to establish the gene signature. To address
   this issue, the 3-PPI-Mod signature was further compared with
   proteomics of circulating sEVs from another cohort with 26 NSCLC
   patients.[173]^22 The results confirmed that most proteins of Mod_8 and
   Mod_10 were presented in circulating sEVs, which was isolated by
   anti-CD9 antibodies.[174]^22 Notably, none of the proteins in Mod_137
   were identified in these circulating CD9+ sEVs. This could be explained
   by the fact that Mod_137 is probably a representative for sEVs from
   non-cancer cells. Meanwhile, it was reported that a subpopulation of
   major histocompatibility complex (MHC) class II-bearing sEVs from
   dentritic cells could escape from anti-CD9 capture.[175]^45 Taken
   together, these findings suggest that the 3-PPI-Mod signature is
   consistently present in different sEV proteomic datasets of NSCLC.

   In summary, this study developed a sEV-associated gene-expression
   signature (3-PPI-Mod) that could predict prognosis and guide ACT
   benefit for lung ADC. The signature was enriched by cancer-secreted
   sEVs and was remarkably associated with intratumoral hypoxic status.
   This study provided a promising strategy to evaluate the prognostic
   value of sEV-related molecules in tumor tissues and shed new light on
   the mechanisms underlying metastasis and recurrence of lung ADC.

Materials and Methods

Patients and Clinical Characteristics

   The proteomic database of sEVs isolated from malignant PEs from three
   patients with lung ADC was obtained from Park et al.’s [176]^23 study.
   There were two females and one male. All of the patients suffered stage
   IV disease, and two patients had metastases at distant organs. Another
   proteomic database of serum sEVs from 26 NSCLC patients was employed
   for validation, including 14 cases with lung ADC.[177]^22

   Gene expression profiles of primary lung ADCs from 623 patients were
   collected from five independent datasets in GEO. These datasets were
   characterized with geographical and technical diversity
   ([178]Table S1). GEO: [179]GSE14814 was conducted from the JBR.10
   trial, which included NSCLC patients who were randomly assigned into
   the ACT and OBS arms. For datasets with multiple pathological types of
   NSCLC, we analyzed the subgroup of lung ADC only. The clinical
   parameters and follow-up information of each cohort were obtained from
   GEO. Cases without prognostic information were excluded from analysis.
   Finally, there were 116 lung ADC samples in GEO: [180]GSE13213, 127
   lung ADC samples in GEO: [181]GSE50081, 83 lung ADC samples in GEO:
   [182]GSE30219, 226 lung ADC samples in GEO: [183]GSE31210, and 71 lung
   ADC samples in GEO: [184]GSE14814. Clinical characteristics of patients
   from each dataset are summarized in [185]Table 1. The involvement of
   human subjects from publicly available resources is in accordance with
   the Declaration of Helsinki.

Table 1.

   Clinical Characteristics in Five Datasets of Lung ADC
   Training Validation
   Variables GEO: [186]GSE13213 GEO: [187]GSE50081 GEO: [188]GSE30219 GEO:
   [189]GSE31210 GEO: [190]GSE14814
     __________________________________________________________________

   Age
     __________________________________________________________________

   < 60 ys 43 (37.07) 19 (14.96) 46 (55.42) 96 (42.48) 35 (49.3)
   ≥ 60 ys 73 (62.93) 108 (85.04) 37 (44.58) 130 (57.52) 36 (50.7)
     __________________________________________________________________

   Gender
     __________________________________________________________________

   Female 57 (49.14) 62 (48.82) 18 (21.69) 121 (53.54) 34 (47.89)
   Male 59 (50.86) 65 (51.18) 65 (78.31) 105 (46.46) 37 (52.11)
     __________________________________________________________________

   Smoking
     __________________________________________________________________

   Never 56 (48.28) 23 (18.11) – 115 (50.88) –
   Ever 60 (51.72) 92 (72.44) – 111 (49.12) –
   N/A 0 (0) 12 (9.45) – 0 (0) –
     __________________________________________________________________

   Stage
     __________________________________________________________________

   I 78 (67.24) 92 (72.44) 79 (95.18) 168 (74.34) 42 (59.15)
   II 13 (11.21) 35 (27.56) 3 (3.61) 58 (25.66) 29 (40.85)
   III 25 (21.55) 0 (0) 1 (1.20) 0 (0) 0 (0)
     __________________________________________________________________

   EGFR
     __________________________________________________________________

   Mutation 45 (38.79) – – 127 (56.19) –
   Wild type 71 (61.21) – – 99 (43.81) –
     __________________________________________________________________

   KRAS
     __________________________________________________________________

   Mutation 14 (12.07) – – – –
   Wild type 102 (87.93) – – – –
     __________________________________________________________________

   TP53
     __________________________________________________________________

   Mutation 38 (32.76) – – – –
   Wild type 77 (66.38) – – – –
   N/A 1 (0.86) – – – –
     __________________________________________________________________

   Postsurgery Recurrence
     __________________________________________________________________

   No 58 (50) 87 (68.50) 56 (67.47) 162 (71.68) –
   Yes 58 (50) 37 (29.13) 27 (32.53) 64 (28.32) –
   N/A 0 (0) 3 (2.36) 0 (0) 0 (0) –
     __________________________________________________________________

   ACT
     __________________________________________________________________

   No – – – – 32 (45.07)
   Yes – – – – 39 (54.93)
   [191]Open in a new tab

   The table shows the number (%) of patients in different groups. ys,
   years; ACT, adjuvant chemotherapy; N/A, not available.

Proteomic Database of sEVs

   The proteomics of sEVs isolated from malignant PEs,[192]^23 serum
   samples,[193]^22 and supernatant of NSCLC cells HCC827 and A549[194]^26
   were profiled by mass spectrometer. These proteins were mapped to a PPI
   network and gene expression profiles by Entrez gene IDs.

Gene-Expression Data Processing

   Raw data from the training dataset (GEO: [195]GSE13213) with the
   Agilent platform were processed with “median-scaled” normalization by R
   package “limma”. Raw data fom four validation datasets (GEO:
   [196]GSE50081, [197]GSE30219, [198]GSE31210, and [199]GSE14814) with
   the Affymetrix platform were downloaded from GEO. These data were
   processed and normalized with “RMA” method by R package “affy”. Probe
   annotation of each dataset was downloaded from the GEO database. All
   gene expression profiles were log[2]-transformed.

Processing of PPI Databases

   Three public databases were employed to construct the reference PPI
   network in this study. PPI network from the Human Protein Reference
   Database (HPRD, release 9.0)[200]^46 was composed of 9,466 proteins and
   36,895 interactions. Protein-protein relationships labeled with
   “Compound/Binding” in Kyoto Encyclopedia of Genes and Genomes (KEGG)
   pathways were extracted, resulting in 2,986 proteins and 31,047
   interactions. Protein-protein relationships labeled with “Direct
   Complex” in Reactome pathways were extracted (4,885 proteins and 22,763
   interactions). These databases were merged into a reference PPI
   network, with 11,591 proteins and 77,059 interactions (after removing
   duplicated interactions and protein-self loops). Network mapping across
   different databases was done by Entrez gene IDs.

Identification of Relapse-Associated PPI Modules

   The sEV proteins from PEs were mapped to the reference PPI network. The
   largest component of the mapped network was used for subsequent
   analysis, including 568 proteins and 1,632 interactions. This network
   was integrated with the gene expression profile of the training dataset
   (GEO: [201]GSE13213). PPI modules (sub-networks) that could
   discriminate recurrent status were used as previously
   described.[202]^29 In a given module M with m genes, the expression
   score (e) of M in sample j was defined as:
   [MATH: <mrow><msub><mtext>e</mtext><mi>j</mi></msub><mo
   linebreak="goodbreak">=</mo><munder><mo>∑</mo><mrow><mi>i</mi><mo
   linebreak="badbreak">∈</mo><mi>m</mi></mrow></munder><mrow><mfrac><mrow
   ><msub><mi>Z</mi><mrow><mi>i</mi><mi>j</mi></mrow></msub></mrow><mrow><
   msqrt><mi>m</mi></msqrt></mrow></mfrac></mrow><mo>,</mo></mrow> :MATH]

   where Z[ij] is the Z-transformed gene expression value of gene i.

   Then, the discriminative score S(M) of module M was defined as the
   mutual information (MI) between e and relapse class (c):
   [MATH:
   <mrow><mtext>S</mtext><mrow><mo>(</mo><mtext>M</mtext><mo>)</mo></mrow>
   <mo
   linebreak="badbreak">=</mo><mtext>MI</mtext><mrow><mo>(</mo><mrow><msup
   ><mtext>e</mtext><mtext>'</mtext></msup><mo>,</mo><mtext>c</mtext></mro
   w><mo>)</mo></mrow><mo
   linebreak="goodbreak">=</mo><munder><mo>∑</mo><mrow><mtext>x</mtext><mo
   linebreak="badbreak">∈</mo><msup><mtext>e</mtext><mtext>'</mtext></msup
   ></mrow></munder><mrow><munder><mo>∑</mo><mrow><mtext>y</mtext><mo
   linebreak="badbreak">∈</mo><mtext>c</mtext></mrow></munder><mrow><mi>p<
   /mi><mrow><mo>(</mo><mrow><mtext>x</mtext><mo>,</mo><mtext>y</mtext></m
   row><mo>)</mo></mrow><mtext>log</mtext><mfrac><mrow><mi>p</mi><mrow><mo
   >(</mo><mrow><mtext>x</mtext><mo>,</mo><mtext>y</mtext></mrow><mo>)</mo
   ></mrow></mrow><mrow><mi>p</mi><mrow><mo>(</mo><mi>x</mi><mo>)</mo></mr
   ow><mi>p</mi><mrow><mo>(</mo><mi>y</mi><mo>)</mo></mrow></mrow></mfrac>
   </mrow></mrow><mo>,</mo></mrow> :MATH]

   where e’ represents the discretized form of e, by discretizing
   expression score into 8
   [MATH:
   <mrow><mrow><mo>(</mo><mrow><msub><mrow><mtext>log</mtext></mrow><mn>2<
   /mn></msub><mrow><mo>(</mo><mi>n</mi><mo>)</mo></mrow><mo>+</mo><mn>1</
   mn></mrow><mo>)</mo></mrow></mrow> :MATH]
   (where n is sample numbers) equally spaced bins as described
   previously.[203]^47

   A greedy searching program was performed to identify modules with
   locally maximal recurrent score S(M) using the similar parameters as in
   the previous study.[204]^29 Briefly, each protein in the PPI network
   (568 proteins) was seeded and iteratively expanded with a specified
   distance d (d = 2 in this study). For each expansion, the protein that
   yielded the largest improvement of S(M) among all candidate proteins
   was added to the module. The module expansion stopped when further
   protein addition did not increase the S(M) by a specified improvement
   rate of 10%. Modules with only one protein were removed, and modules
   with >80% common proteins were merged.

   Three permutation methods were utilized to estimate the significance of
   a candidate module M (with m genes): (1) stimulating random PPI modules
   with m proteins in the network; (2) shuffling the expression value of m
   genes; (3) shuffling the recurrent class of samples. Each method was
   performed by 10,000 times. The random discriminative scores were used
   as null distribution of S(M). PPI modules with a p value < 0.001 by all
   three methods were considered to be significant. The gene expression
   score of candidate modules was used in the following modeling process.

Development of Predictive Signature for Recurrence

   To construct a PPI network-based gene signature, the gene expression
   score (as defined above) of candidate modules was used. Feature
   selection and modeling was performed based on RF algorithm using the R
   package “varSelRF”.[205]^48 In detail, the predictive importance for
   recurrence of each candidate module was estimate by an initial RF with
   5,000 trees. A stepwise backward selection procedure was used to
   identify the optimal combination of candidate modules for recurrence
   prediction. In each iteration, 10% of features were excluded, and the
   remaining features were used to build an RF with 3,000 trees. This
   program was stopped when there were only two features left. The final
   RF model was selected with the smallest feature number that achieved
   the minimum OOB error among all iterative results. The recurrence risk
   of patients was defined by the OOB probability in the final RF model.

   To assess the efficiency of the PPI network-based methodology, we
   established a gene-based RF prediction model using the same program and
   parameters as used in the sEV network-based method. The predictive
   results of the two methods were compared by ROC curve analysis. The
   AUROCs with 95% confidence interval (CI) were calculated by the R
   package “pROC”.[206]^49 Also, the two methods were compared by the
   Harrell’s C-index in predicting RFS and OS in the independent datasets.
   A higher C-index means more accuracy in survival predicting.

Estimation of Tumor Microenvironment Status

   A hypoxia metagene across different cancer types was obtained from a
   previous study.[207]^32 A core angiogenesis signature for primary tumor
   was identified by Masiero et al.[208]^50 A set of inflammatory
   cytokines[209]^51 was used to estimate the intratumoral inflammation
   level. The intratumoral hypoxia, angiogenesis, and inflammation scores
   were calculated by averaging the Z-normalized expression value of the
   corresponding signature genes as we described previously.[210]^52 The
   abundance of immune and non-immune cells in the tumor microenvironment
   was calculated by MCP-counter, a robust estimator for
   tissue-infiltrating cell populations by gene expression
   profiles.[211]^27

Other Statistical Methods

   GO and KEGG pathway enrichment analysis for PPI modules was performed
   by the online tool DAVID.[212]^53 The significance of overlapping
   between two gene sets was calculated by hypergeometric test. The
   association between two category variables of patients was calculated
   by Chi-squared test or Fisher’s exact test. GSEA of was used to compare
   interested gene sets against the patient subgroups classified by
   3-PPI-Mod.[213]^54 The correlations of PPI modules with hypoxia,
   angiogenesis, and inflammation scores as well as the stromal cell
   abundance were calculated by Pearson’s correlation. Multiple testing
   was adjusted by false discovery rate (FDR) using the Benjamini and
   Hochberg’s method. A comparison of patient survival between low- and
   high-risk groups assigned by 3-PPI-Mod was performed by Kaplan-Meier
   curve and log-rank test. DSS and OS of patients in the ACT arm and the
   OBS were compared by Kaplan-Meier curve and log-rank test, with
   stratification by the 3-PPI-Mod signature. Multivariate Cox’s models
   were performed to assess the prognostic and predictive value of
   3-PPI-Mod signature. All statistical analyses were performed by R
   software (version 3.3.1). A p value < 0.05 was considered to be
   significant.

Author Contributions

   Conception and design: J.L. and Q.L.; Development of methodology: B.C.
   and W.D.; Acquisition of data: S.M., Q.W., M.L., H.L., and T.C.;
   Analysis and interpretation of data: B.C., W.D., and S.M.; Writing,
   review, and/or revision of the manuscript: B.C., J.L., Q.L., G.Z., and
   X.Y.; Administrative, technical, or material support: G.Z. and X.Y.;
   Study supervision: Q.W., M.L., and H.L. All authors read and approved
   the final manuscript.

Conflicts of Interest

   The authors declare no competing interests.

Acknowledgments