Abstract

   Non-small cell lung cancer (NSCLC) is histologically classified into
   lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LSCC).
   However, some tumors are histologically ambiguous and other
   pathophysiological features or microenvironmental factors may be more
   prominent. Here we report integrative multiomics analyses using data
   for 229 patients from a Korean NSCLC cohort and 462 patients from
   previous multiomics studies. Histological examination reveals five
   molecular subtypes, one of which is a NSCLC subtype with PI3K-Akt
   pathway upregulation, showing a high proportion of metastasis and poor
   survival outcomes regardless of any specific NSCLC histology.
   Proliferative subtypes are present in LUAD and LSCC, which show strong
   associations with whole genome doubling (WGD) events. Comprehensive
   characterization of the immune microenvironment reveals various immune
   cell compositions and neoantigen loads across molecular subtypes, which
   predicting different prognoses. Immunological subtypes exhibit a hot
   tumor-enriched state and a higher efficacy of adjuvant therapy.

   Subject terms: Non-small-cell lung cancer, Cancer genomics, Proteomics,
   Immunogenetics
     __________________________________________________________________

   Subtyping of non-small cell lung cancer can be challenging based on
   pathology. Here, the authors utilise multi-omics analysis of 229
   patients to identify further subtypes, and altered immune composition
   between subtypes.

Introduction

   Lung cancer is a major health concern worldwide, accounting for 18% of
   global cancer-related deaths^[78]1. Surgical resection, with or without
   postoperative adjuvant therapy, is currently the first-line treatment
   for locally advanced or early-stage lung cancers^[79]2. In advanced
   cases, genomics-based targeted therapies and immunotherapies using
   immune checkpoint inhibitors provide promising treatment options^[80]3,
   often in addition to chemotherapy. Although the mortality rates
   associated with lung cancer are continuously decreasing with
   appropriate surgical and medical treatment^[81]4, the 5-year survival
   rates for localized, regional, and metastatic non-small cell lung
   cancer (NSCLC) remain unsatisfactory (64%, 37%, and 8%, respectively),
   mainly due to recurrence after treatment^[82]5. Therefore, tailoring
   treatments based on patient stratification according to molecular
   characteristics is of increasing interest to improve patient survival.

   High-throughput omics approaches have facilitated the classification of
   NSCLC molecular subtypes and candidate molecular targets as well as
   prognostic or predictive biomarkers for lung cancer treatment^[83]6.
   Initially, The Cancer Genome Atlas Research Network identified the
   molecular pathways of lung cancer and classified lung squamous cell
   carcinoma (LSCC) into classical, basal, secretory, and primitive
   subtypes^[84]7, and lung adenocarcinoma (LUAD) into terminal
   respiratory unit, proximal inflammatory, and proximal proliferative
   subtypes^[85]8. Recently, proteomic and proteogenomic
   approaches^[86]9–[87]14 have been introduced to identify novel subtypes
   and their druggable targets.

   Although these efforts have advanced our understanding of cancer
   biology, the clinical implementation of these findings in
   high-throughput approaches remains challenging^[88]15,[89]16. First,
   many proteogenomic studies have focused on a single subtype of NSCLC
   (such as LUAD or LSCC), but the approach excludes cases in which the
   pathological classification is ambiguous and/or
   discrepant^[90]17,[91]18, which eventually leads to the formation of
   study cohort that may not be representative of real NSCLC population.
   Second, the sensitive prediction of postoperative survival beyond those
   achieved using conventional prognostic factors is a prerequisite for
   personalized surgical oncology treatment. Further characterization of
   causative biomarkers associated with prognosis may identify crucial
   treatment targets^[92]19, and requires a multiomics-based analysis of
   well-annotated clinical data. Finally, most multiomics data have been
   generated from resected lung cancer tissues, and adjuvant chemotherapy
   after surgery is the primary treatment for locally advanced lung
   cancer. Adjuvant chemotherapy largely relies on platinum doublet and
   platinum-based chemotherapeutic regimens^[93]20 in addition to a
   cytotoxic agent (e.g., pemetrexed, gemcitabine, vinorelbine, or
   paclitaxel). Although this approach has improved patient
   survival^[94]21, few studies^[95]22 have investigated the molecular
   basis of the response to postoperative adjuvant treatment, and
   consensus biomarkers for predicting treatment efficacy have not yet
   been identified.

   In this study, we conducted a comprehensive multiomics analysis to
   define the molecular subtypes of NSCLC using a Korean NSCLC discovery
   cohort of 229 patients and a replication cohort of 462 NSCLC patients
   from previous multiomics studies. Our aim was to expand the scope of
   histological subtyping and identify molecular subtypes of NSCLC with
   potential prognostic and therapeutic implications. We further
   characterized the subtypes by integrating a large-scale single-cell RNA
   sequencing dataset^[96]23 of NSCLC and evaluated their cellular
   specificity and histopathological relevance. In addition, our study
   included an extensive histological review of patient data for
   tumor-infiltrating lymphocytes (TILs), identified potential neoantigens
   and cryptic peptides characterizing the immune microenvironment, and
   noted the varying efficacy of adjuvant therapies between subtypes.
   Overall, our study represents a considerable advance in the field of
   NSCLC research and has important implications for precision medicine
   and personalized therapies.

Results

Identification of subtypes in NSCLC patients by multiomics

   The multiomics analysis utilized a retrospective cohort of 229 Korean
   patients diagnosed with NSCLC at Asan Medical Center in Seoul, Korea
   (Supplementary Data [97]1a). Tumor samples with matched normal tissues
   or blood samples were collected via surgery between 2010 and 2019.
   Patient demographics, disease parameters (including histology and
   tumor-node-metastasis [TNM] staging), survival, and treatment response
   indicated good coverage of disease severity and patient population.
   Histologically, the tumor samples included 139 adenocarcinomas (LUADs,
   61%), 63 squamous cell lung carcinomas (LSCCs, 27%), and 27 tumors of
   other types (12%) (Fig. [98]1a, Supplementary Data [99]1a). LUAD cases
   were almost equally distributed between the sexes (66 males and 74
   females), whereas LSCC cases were found mostly among males (97%,
   n = 61), which was previously shown in other Korean LSCC
   cohort^[100]24. Self-reported smoking status indicated that 61% of the
   patients had a smoking history, with a higher prevalence of smoking in
   males and patients with LSCC. The TNM-based stages ranged from IA1 to
   IVA, with approximately 40% of the patients having late-stage disease
   (IIIA, n = 75; IIIB, n = 9; IVA, n = 11; Fig. [101]1a). Approximately
   half of the patients had lymph node metastases (51%, n = 116) at the
   time of pathological diagnosis after surgery. Adjuvant therapy,
   including chemotherapy (CTx) or radiation therapy (RTx), was
   administered to 48% of the patients (110/229 patients; 59 CTx, 16 RTx,
   and 35 CTx and RTx) according to the NCCN guidelines. Tumor recurrence
   was observed in 54% of the patients who received adjuvant therapy, and
   the recurrence rate was similar across treatments (CTx, 56%, 33/59;
   RTx, 50%, 8/16; CTx and RTx, 54%, 19/35) (Fig. [102]1a) and
   histological diagnoses (53%, AD, 72/137; 44%, SC, 27/62) (Supplementary
   Fig. [103]1a).

Fig. 1. Identification of multiomics subtypes in Korean NSCLC patients.

   [104]Fig. 1
   [105]Open in a new tab

   a Summary of clinical information for the Korean NSCLC patient cohort.
   Bar plots show NSCLC histology, tumor stages, pathologic N status, and
   tumor recurrence status with adjuvant treatment. b Numbers of
   quantified proteins and PTM sites identified in global proteomic,
   phosphoproteomic, and acetylproteomic analyses. The number of features
   quantified at <30% missing values across 229 samples is represented by
   the yellow line. c Overview of NMF clustering. Other tumor suppressor
   genes consisted of CDKN2A, STK11, KEAP1, RB1, PPP2R1A, and SMARCA4.
   Other oncogene alterations consisted of frameshift deletions; in-frame
   deletion/insertion and missense mutations in KRAS, ERBB2, and PIK3CA;
   exon skipping in MET; and gene fusion in ALK, ROS1, and RET. Copy
   number loss was defined as homozygous deletion (absolute copy number
   <0.5). d Enrichment of the five identified NMF clusters for clinical
   variables. Tests indicating statistical significance (P < 0.05,
   two-sided Fisher’s exact test) are colored according to the odds ratio
   (OR). Box size indicates the proportion of the cohort characterized by
   a given clinical variable. e Pathway enrichment analyses of the five
   subtypes using GSVA and PTM-SEA. Pathways with statistical significance
   (FDR < 0.05, permutation) and positive enrichment scores (z-score) are
   represented by dots.

   We generated genomic, transcriptomic, proteomic, phosphoproteomic, and
   acetylproteomic datasets from the samples. A genomic dataset was
   generated by whole-exome sequencing (WES) of 228 normal adjacent tissue
   (NAT)-matched tumors and one tumor-only sample with a read depth
   sufficient for variant discovery (tumor: ~300X, NAT: ~100X). We
   observed 33,301 somatic small mutations that contained single
   nucleotide variants (SNVs) and indels (on average, 145 per sample),
   470,836 copy-number alterations encompassing amplification, gain,
   heterozygous deletion, and homozygous deletion (on average, 2056 per
   sample) and a 2.7 tumor mutation burden (TMB) score in 229 tumors. For
   transcriptomic analysis, we performed bulk RNA-seq for 205 tumors and
   85 matched NATs for deep coverage (approximately 120 M reads per
   sample), enabling gene expression quantification and alternative
   splicing isoform discovery. We acquired 60,688 transcripts and selected
   20,088 transcripts, based on low-count genes across samples, for
   subsequent analysis. For proteomic analysis using tandem mass tag
   (TMT)-based isobaric labeling, proteomic data were collected from 229
   tumor samples and 26 matched NATs. A total of 10,788 proteins, 40,738
   phosphosites, and 5975 acetylation sites were observed in at least 30%
   of the samples and quantified as a log[2] ratio to the common reference
   (CR) sample (Fig. [106]1b).

   For multiomics analysis, we integrated proteomic, phosphoproteomic, and
   acetylproteomic data and conducted non-negative matrix factorization
   (NMF) clustering to identify multiomics subtypes in the 229 NSCLC
   samples. We identified five multiomics subtypes: metabolic (Subtype 1),
   alveolar-like (Subtype 2), proliferative (Subtype 3), hypoxic (Subtype
   4), and immunogenic (Subtype 5) (Fig. [107]1c), characterized based on
   genetic mutations, clinical phenotypes, and molecular pathways
   (Fig. [108]1d, e, Supplementary Data 1b–d).

   Subtype 1 was composed mainly of LUAD females (64%, 35/55) with EGFR
   and TP53 mutations, as well as a high frequency of whole genome
   doubling (WGD) events (i.e., phenomena in which more than half of the
   chromosomes are gained; Fig. [109]1c), suggesting a chromosomally
   unstable co-driven subtype. Significant enrichment of CDKN2A copy
   number loss in Subtype 1 (OR: 3.63, P = 2.35 × 10−^2, Fisher’s exact
   test) also supports the observation (Fig. [110]1d). Subtype 2 mainly
   comprised patients with LUAD (71%, 32/45) with EGFR mutations (49%,
   22/45) and without WGD events (Fig. [111]1c), suggesting a
   chromosomally stable oncogene-driven subtype. These samples showed a
   significantly lower frequency of TP53 mutations (6%; odds ratio [OR]:
   0.04, P = 3.1 × 10^−9, Fisher’s exact test; Fig. [112]1d) and much
   lower tumor mutational burden (TMB) than samples representing other
   subtypes (Subtype 2 = 0.9 variants per Mb, others = 3.2 variants per
   Mb; P = 3.6 × 10^−10, Wilcoxon ranked sum test) (Supplementary
   Fig. [113]1b, c). Despite the upregulation of the EGFR pathway in both
   subtypes, molecular pathways were specifically enriched for each
   cluster. Subtype 1 exhibited significant upregulation of proteins
   involved in oxidative phosphorylation (adjusted P = 1.7 × 10^−6,
   Wilcoxon rank-sum test), mitochondrial matrix (adjusted P = 8.1 ×
   10^−9, Wilcoxon rank-sum test), and cellular respiration (adjusted
   P = 4.2 × 10^−5, Wilcoxon rank-sum test), indicating an association
   with metabolic pathways. Conversely, Subtype 2 was characterized by a
   significant upregulation of phosphorylation in the IL-33 (adjusted
   P = 1.8 × 10^−15, Wilcoxon rank-sum test) and Notch pathways (adjusted
   P = 2.4 × 10^−16, Wilcoxon rank-sum test) (Fig. [114]1e), implicating
   these molecular features in early tumorigenesis and early-stage
   disease.

   Subtype 3 was significantly associated with WGD events (OR: 13.5,
   P = 4.6 × 10^−10, Fisher’s exact test), TP53 (OR: 13.5, P = 4.6 ×
   10^−10, Fisher’s exact test) and PIK3CA mutations (OR: 3.89, P = 3.16 ×
   10−^2, Fisher’s exact test), and was more prevalent in patients with
   LSCC (65%, 34/52 patients), in males (83%, 43/52 patients), and smokers
   (85%, 44/52 patients) (Fig. [115]1d). Subtype 3 exhibited a highly
   proliferative phenotype, as evidenced by significant enrichment in cell
   cycle-related pathways, including E2F/MYC target (adjusted P = 4.7 ×
   10^−22, 1.3 × 10^−20, Wilcoxon rank-sum test), G2M checkpoint (adjusted
   P = 2.4 × 10^−23, Wilcoxon rank-sum test), and cyclin-dependent kinase
   (CDK) target pathways (adjusted P = 5.3 × 10^−24, Wilcoxon rank-sum
   test) (Fig. [116]1e). Thus, this subtype can be defined as a
   chromosomally unstable tumor suppressor-deficient proliferative
   subtype.

   Subtype 4 was not associated with any specific histological type of
   NSCLC but was significantly enriched for metastasis (OR: 3.0, P = 5.6 ×
   10^−3, Fisher’s exact test). We found that phosphorylated sites in this
   subtype were upregulated in hypoxia (adjusted P = 2.3 × 10^−4, Wilcoxon
   rank-sum test), PI3K-Akt (adjusted P = 4.7 × 10^−3, Wilcoxon rank-sum
   test), and neutrophil degranulation (adjusted P = 1.7 × 10^−4, Wilcoxon
   rank-sum test) pathways, and proteins were also enriched for neutrophil
   degranulation (adjusted P = 4.6 × 10^−6, Wilcoxon rank-sum test),
   suggesting a potential role in the promotion of tumor migration,
   invasion, and metabolism in tumor metastasis. Therefore, Subtype 4 can
   be considered as a chromosomally stable mesenchymal subtype.

   Subtype 5 showed a significantly elevated proportion of
   tumor-infiltrating lymphocyte (TIL)-associated patterns (OR: 3.6,
   P = 1.2 × 10^−2, Fisher’s exact test; Fig. [117]1d) and enrichment of
   immune-related pathways, such as TNFα signaling via NF-κB (adjusted
   P = 1.1 × 10^−6, Wilcoxon rank-sum test, Fig. [118]1e), suggesting that
   this subtype was a high-immune and chromosomally stable
   tumor-suppressor-driven inflammatory subtype. KRAS mutation was
   significantly enriched in Subtype 5 (OR: 4.51, P = 3.53 × 10^−2,
   Fisher’s exact test) (Supplementary Fig. [119]1d), with four cases
   presenting concurrent TP53 mutation. STK11 and KEAP1 mutations were
   each identified only in 1 case, although the trends were not
   statistically significant.

A NSCLC subtype associated with poor prognosis and frequent metastasis

   To replicate our subtype classification, we utilized multiomics or
   proteomics data from 462 patients with NSCLC obtained from previous
   studies, including two LUAD studies by Gillette et al.^[120]10
   (n = 110) and Xu et al.^[121]11 (n = 103), an LSCC study by Satpathy et
   al.^[122]13 (n = 108), and an NSCLC study by Lehtio et al.^[123]14
   (n = 141). We compared the top features of the NMF subtypes between our
   cohort and those of other studies and found significant overlaps
   (FDR < 0.01, Fisher’s exact test): the terminal respiratory unit
   subtype of adenocarcinoma (Subtype 1), inflammatory subtypes (Subtypes
   2 and 5), and proliferative subtypes of LUAD and LSCC (Subtype 3)
   (Fig. [124]2a and Supplementary Data [125]2a). In contrast, Subtype 4
   was distinct, showing enrichment for phosphorylation features
   associated with the EMT-enriched (LSCC)^[126]13 subtype and acetylation
   features from the inflammatory subtypes of LUAD and LSCC (Supplementary
   Fig. [127]2a, b). We also performed a combined NMF analysis for a total
   of 447 patients with NSCLC by integrating proteome, phosphoproteome,
   and acetylome datasets from our study with those from previous
   studies^[128]10,[129]13 (hereafter called the “combined CPTAC
   dataset”). We identified five subtypes, referred to as “Combined NMF”
   (Fig. [130]2b). The combined NMF showed a highly consistent pattern
   with the feature overlap analysis (Fig. [131]2a and Supplementary
   Data [132]2a) and confirmed that the four subtypes (Subtype 1, 2, 3, 5)
   showed a consistency with the previously identified NMF subtypes
   (Fig. [133]2b, Supplementary Fig. [134]2c and Supplementary
   Data [135]2b). Subtype 4 did not cluster with any single subtype of
   LUAD or LSCC in the combined NMF analysis (Fig. [136]2b). A comparison
   without the Korean cohort (“CPTAC NSCLC”), constituting 51% (229/447)
   of the Combined NMF, showed 77% of patients maintained their subtype
   classification when including the Korean cohort (Fig. [137]2c).
   Additionally, Subtype 4 consistently comprised 18.8% (43/229) of the
   Korean NSCLC cohort and 18.3% (40/218) of the combined CPTAC dataset
   without histological types. Collectively, our findings suggest that
   Subtype 4 represents a NSCLC subtype that requires further clinical and
   molecular characterization.

Fig. 2. Novel NSCLC subtype associated with poor survival.

   [138]Fig. 2
   [139]Open in a new tab

   a Overlap of subtype features between the five NMF subtypes in this
   study and subtypes identified in previous NSCLC multiomics studies.
   Protein enrichment is in the heatmap. Full rectangle and asterisk
   indicate significant overlaps (Two-sided fisher’s exact test adjusted P
   ≤ 0.05, Benjamini-Hochberg adjustment), faint rectangle indicates
   overlaps which pass only nominal P value (Two-sided fisher’s exact test
   P ≤ 0.05, two-sided fisher’s exact test adjusted P > 0.05), and blank
   indicates overlaps which is not significant (two-sided fisher’s exact
   test P > 0.05). b, c Reclassification of samples from previously
   defined multiomics subtypes^[140]10,[141]13 according to our combined
   NMF subtypes. The statistical significance of the relationship is
   visually represented by the clarity and transparency of the lines
   (Supplementary Fig. [142]2c). d Subtype 4-specific kinase activity
   scores estimated from phosphoproteomic data and the kinase-substrate
   network database (PHONEMeS). The colors of the points represent the
   estimated kinase activity scores. The sizes of the points represent the
   -log[10](FDR) of the kinase activity estimates. There were two
   significantly upregulated kinases: CSNK2A1 and GSK3B (FDR < 0.05). e
   Expression of poor prognosis markers containing phosphorylated sites on
   SLK (S347) and LRRFIP1 (S581) is shown for our study (Subtype 4,
   n = 43; others, n = 186) and CPTAC (LUAD (Subtype 4, n = 26; others,
   n = 84), LSCC (Subtype 4, n = 15; others, n = 93)). Wilcoxon rank-sum
   test was performed to test the differences in expression. The color of
   the dots in the right panel represents the study type in CPTAC. For
   box-plots, middle line, median: box edges, 25th and 75th percentiles;
   whiskers, most extreme points that do not exceed ±1.5 × IQR. f
   Cancer-specific overall survival length according to the expression of
   poor prognosis markers in our study and the CPTAC dataset (integrated
   with LUAD and LSCC). The p-value was calculated with the log-rank test.
   g Intracellular signaling pathways underlying poor prognosis in Subtype
   4. The blue box represents the main signaling pathways, including the
   HIF-1, VEGF, PI3K-AKT, and NF-κB signaling pathways. The red triangular
   nodes are kinases identified as significantly upregulated in Subtype 4.
   The colors of the points represent the log[2]FC values obtained through
   differential expression (DE) analysis of Subtype 4 and the other
   subtypes. The border style of the point indicates the prognostic
   direction of the feature. h Cancer-specific overall survival length
   between our subtypes indicating significant changes in survival
   probability (y-axis) over time (x-axis). i Survival curves for patients
   with (n = 35) and without metastasis (n = 8) in Subtype 4 (n = 43) and
   ( j) patients without metastasis in each subtype (n = 91). The p-value
   was calculated with the log-rank test (h–j).

   To examine the molecular characteristics of Subtype 4, we extracted
   distinct features (proteins, phosphorylation, or acetylation) of the
   subtypes in our NMF clustering analysis (Supplementary Data [143]1e).
   We found that the majority of NMF features present in Subtype 4 were
   phosphorylated sites (96%, 178/186, Supplementary Data [144]1e),
   indicating that phospho-kinase interactions are a major signature.
   Therefore, we investigated the kinase activity of this subtype using
   phosphoproteomic data. We found significant enrichment of two kinases,
   CSNK2A1 (FDR = 2.3 × 10^−7) and GSK3B (FDR = 1.9 × 10^−2), which are
   known to phosphorylate various proteins in the PI3K-AKT signaling
   pathway, in Subtype 4 compared to other subtypes (Fig. [145]2d,
   Supplementary Fig. [146]2d and Supplementary Data [147]2c). Upon
   evaluating the relationship between the activity and expression levels
   of significant kinases (P < 0.05), we observed a moderate correlation
   (Supplementary Fig. [148]2e). Survival analysis based on feature
   expression (Supplementary Data [149]2d) showed that most of the
   unfavorable prognostic factors were phosphorylated sites differentially
   expressed in Subtype 4 (91%, 104/114) (Supplementary Fig. [150]2f).
   Notably, STE20-like serine/threonine-protein kinase at serine 347 (SLK
   (S347)), a protein phosphorylated by CSNK2A1, was significantly
   upregulated in Subtype 4 (adjusted P = 8.0 × 10^−3, Benjamini–Hochberg
   adjustment, Supplementary Data [151]3a) and associated with unfavorable
   prognostic features (P = 3.0 × 10^−6, log-rank test) (Fig. [152]2e, f).
   In the combined CPTAC dataset, we also found increased phosphorylation
   of SLK (S347) in Subtype 4 and poor survival outcomes (Fig. [153]2e, f,
   Supplementary Fig. [154]2g and Supplementary Data [155]2e). SLK
   mediates apoptosis downstream of the ErbB2 and PI3K pathways^[156]25
   and is activated in a CSNK2A1-dependent manner^[157]26. Recent studies
   have reported that high SLK expression is associated with reduced
   overall survival in HER2-positive patients^[158]27 and in
   glioma^[159]28. Using both our cohort and the CPTAC cohort to check the
   ROC curve, we found that SLK (S347) is an effective marker for
   distinguishing Subtype 4 not only in our cohort but also more
   effectively in the CPTAC cohort (Supplementary Fig. [160]2h).
   Collectively, these results suggest that SLK participates in tumor
   progression and could be a key marker specific to Subtype 4.
   Furthermore, Subtype 4 showed a significant upregulation of
   leucine-rich repeat flightless-interacting protein 1 at serine 581
   (LRRFIP1 (S581)) phosphorylation (adjusted P = 1.3 × 10^−2,
   Benjamini–Hochberg adjustment, Supplementary Data [161]3a), which was
   correlated with unfavorable prognostic features (P = 1.2 × 10^−3,
   log-rank test) (Fig. [162]2e, f). Similarly, in the combined CPTAC
   dataset, elevated phosphorylation of LRRFIP1 (S581) was observed in
   Subtype 4, which coincided with poor survival outcomes (Fig. [163]2e,
   f, Supplementary Fig. [164]2g and Supplementary Data [165]2e). Notably,
   LRRFIP1 stimulates the epithelial–mesenchymal transition (EMT) pathway
   by modulating the Wnt/β-catenin signaling pathway^[166]29. A recent
   study^[167]30 demonstrated that high LRRFIP1 expression was associated
   with reduced overall survival in glioma. These findings collectively
   suggest that LRRFIP1 may also contribute to cellular invasion and
   metastasis and could serve as a key marker specific to Subtype 4.

   Subtype 4 included numerous prognostic features in the HIF-1 and
   PI3K-AKT signaling pathways (Fig. [168]2g and Supplementary
   Fig. [169]2i). Among these unfavorable features, the significantly
   upregulated ANGPT2 protein is known to increase in a hypoxic
   environment, which can promote the release of pro-angiogenic cytokines,
   such as VEGF, through HIF-1 accumulation. We also observed increased
   acetylation of TIMP3 at K65, which is presumably involved in the
   ANGPT2-induced hypertensive response. In contrast, CDKN1B, which was
   significantly downregulated (adjusted P = 7.1 × 10^−11, two-sided
   t-test, Supplementary Data [170]3a), was a favorable prognostic feature
   in our cohort (P = 1.3 × 10^−4, log-rank test, Supplementary
   Data [171]2d) and the combined CPTAC dataset (P = 1.6 × 10^−2, log-rank
   test). Given that the loss of CDKN1B increases the risk of cancer
   metastasis^[172]31, the reduced expression of CDKN1B implies an
   advanced phenotype of Subtype 4.

   We evaluated the clinical significance of Subtype 4 according to the
   survival rates of patients with this subtype. Subtype 4 in the Korean
   NSCLC cohort was associated with a significantly poorer survival rate
   compared to that in the other subtypes (P = 8.4 × 10^−3, log-rank test,
   Fig. [173]2h), possibly influenced by stage distribution. Since Subtype
   4 showed the highest rate of metastasis among the five subtypes,
   metastasis may have an important impact on the poor survival outcomes.
   Interestingly, there was no variation in the survival rates of patients
   with and without metastasis in Subtype 4 (P = 9.5 × 10^−1, log-rank
   test, Fig. [174]2i), suggesting that metastasis may not be the sole
   mechanism leading to poorer survival. When comparing survival rates
   among non-metastatic patients of the five subtypes, we observed a
   significant disadvantage for patients with Subtype 4 (P = 2.6× 10^−2,
   log-rank test, Fig. [175]2j), indicating that other biological factors
   may contribute to the poor prognosis of this subtype. In contrast, we
   did not find any significant difference in overall survival between
   Subtype 4 and the other subtypes in the combined CPTAC cohort (P = 7.9
   × 10^−1, log-rank test, Supplementary Fig. [176]2j), indicating the
   effect of differences in ethnicity and treatment records.

Cellular landscape of the five subtypes of NSCLC

   Exploring the tumor microenvironment is crucial for understanding the
   mechanisms underlying cancer progression and for developing effective
   therapeutic strategies to target not only cancer cells but also the
   surrounding microenvironment. We assessed the tumor microenvironment of
   the five NSCLC subtypes based on cell type specificity through a
   comparative analysis of subtype-specific genes with a range of cell
   type-specific genes using an integrated single-cell RNA (scRNA)
   sequencing dataset^[177]23 of NSCLC patients. First, we computed the
   differentially expressed genes (DEGs) for each subtype compared to the
   NAT samples (Supplementary Data [178]3b). As expected, we found that
   the DEG sets of all five subtypes were significantly upregulated in
   tumor cell types (FDR < 0.001, permutation; Supplementary Fig. [179]3a
   and Supplementary Data [180]3c), highlighting the overall degree of
   tumorigenicity of the subtypes.

   Next, we identified DEGs for each subtype by comparing all subtypes
   (DEG[subtype], e.g., Subtype 1 vs. other subtypes) (Supplementary
   Data [181]3d). In Subtypes 1 and 3, the DEG[subtype] was enriched for
   cell types related to tumors (FDR < 0.01, permutation): Subtype 1
   corresponding to LUAD and LUAD mitotic cell types and Subtype 3
   corresponding to LSCC and LSCC mitotic cell types as defined^[182]23,
   suggesting actively proliferating tumor cell components (Fig. [183]3a,
   Supplementary Fig. [184]3b and Supplementary Data [185]3e). In
   contrast, the DEG[subtype] sets of Subtypes 2, 4, and 5 were enriched
   in neutrophils. While a high proportion of NAT-associated neutrophils
   (NANs) was present in these subtypes (Supplementary Data [186]3e), the
   tumor-associated neutrophil (TAN) cell types were enriched in Subtypes
   4 and 5, with the largest proportion in Subtype 4. Furthermore, the
   DEG[subtype] set of Subtype 2 appeared to be enriched for alveolar-type
   fibroblasts and endothelial cells, implying that these tumors have
   stromal components similar to those of the normal alveolar interstitium
   (Supplementary Data [187]3e). Most immune-related cell types were
   enriched in the DEG[subtype] set of Subtype 5 (Supplementary
   Data [188]3e), reflecting a high proportion of TIL patterns.

Fig. 3. Landscape of cell type-specific subtype characteristics.

   [189]Fig. 3
   [190]Open in a new tab

   a UMAP plot of single-cell types specific to Subtypes 1 to 5, using
   each DEG[subtype]. The color of each point represents the module score
   of each cell; the more relevant the module is to the cell type, the
   higher the score and the redder the color. UMAP information was
   obtained from the original study^[191]23. b Representative histologic
   images of the subtypes. The tumor cell (T) and stromal (S) components
   are separately labeled. Note the irregularly fused tumor glands in
   subtype 1 tumor compared to those in subtype 2 tumors composed of
   small, uniform tumor cells lying within the elastic stroma similar to
   the normal alveolar wall. Dense stromal inflammatory cell infiltration
   in Subtype 5 tumors. Scale bars for b = 100 μm. c Proportions of
   samples with different pathologic diagnoses within each subtype. LUAD
   was predominant in Subtypes 1 and 2, whereas LSCC was predominant in
   Subtype 3. d–f Histologic patterns of LUADs in each subtype. The
   predominant patterns of Subtype 1 and 2 LUADs were most commonly acinar
   or papillary but were quite heterogeneous (d). The proportion of the
   lepidic pattern, considered to indicate noninvasive LUAD, was enriched
   mostly in Subtype 2 LUADs, suggesting that Subtype 2 is most like early
   LUAD (Subtype 1, n = 55; Subtype 2, n = 34, Subtype 3, n = 10; Subtype
   4, n = 26; Subtype 5, n = 20) (e). Consistently, the proportion of
   high-grade histologic patterns (including solid, micropapillary,
   cribriform, and complex glandular patterns) was lowest in Subtype 2
   LUADs. The high-grade histologic pattern was more extensive in Subtype
   1 than in Subtype 2, but these subtypes were remarkably heterogeneous
   compared to Subtype 3–5 LUADs, which were mostly composed of high-grade
   histologic patterns (Subtype 1, n = 55; Subtype 2, n = 34, Subtype 3,
   n = 10; Subtype 4, n = 24; Subtype 5, n = 20) (f). For box-plots,
   middle line, median: box edges, 25th and 75th percentiles; whiskers,
   most extreme points that do not exceed ±1.5 × IQR. g–I Lymphovascular
   invasion (g), lymph node metastasis (h), and tumor necrosis (i) were
   less common in Subtype 2 tumors, which also implies that Subtype 2
   tumors are in a clinically early, nonprogressed stage. j
   Microscopically, the stromal component was more extensive in Subtype 2,
   4, and 5 tumors (Subtype 1, n = 55; Subtype 2, n = 43, Subtype 3,
   n = 52; Subtype 4, n = 43; Subtype 5, n = 34). For box-plots, middle
   line, median: box edges, 25th and 75th percentiles; whiskers, most
   extreme points that do not exceed ±1.5 × IQR. k–l Tumor-infiltrating
   lymphocytes (k) and stromal neutrophilic infiltration (l) were most
   extensive in Subtype 5 tumors. The p-value was calculated using the
   chi-square test (c, d, g–I, k, and l) the Kruskal-Wallis test (e, f,
   and j). m Summary of the histopathologic characteristics of the NSCLC
   subtypes.

   We correlated these findings with the histopathological features of the
   subtypes (Fig. [192]3b–m). Subtypes 1 and 2 were predominantly
   well-to-moderately differentiated adenocarcinomas, in which acinar and
   papillary patterns were predominant (Fig. [193]3c, d). Comparisons of
   the histological patterns for LUAD cases showed that the lepidic
   pattern, reminiscent of the normal alveolar structure of the lung
   parenchyma and a non-invasive component of lung adenocarcinoma^[194]32,
   was more extensive in Subtype 2 (Fig. [195]3e), confirming the
   DEG[subtype] findings. In contrast, Subtype 1 had a higher proportion
   of high-grade histologic patterns, including solid, micropapillary,
   cribriform, and complex glandular patterns^[196]33 compared to those in
   Subtype 2 (Fig. [197]3f). Consistent with these observations, Subtype 2
   tumors exhibited less frequent lymphovascular invasion (Fig. [198]3g)
   and lymph node metastasis (Fig. [199]3h). Subtype 3 mainly consisted of
   squamous cell carcinomas and a subset of solid-predominant
   adenocarcinomas (Fig. [200]3c, d). Subtype 3 tumors had enlarged
   pleomorphic nuclei with high mitotic activity and frequent tumor
   necrosis (Fig. [201]3i), which was consistent with the DEG[subtype]
   analysis. In contrast to Subtypes 4 and 5, the stromal components of
   Subtype 3 tumors were relatively scarce (Fig. [202]3b–j). Subtype 4
   tumors had moderate-to-large amounts of desmoplastic stromal components
   with variable amounts of inflammatory cells, whereas Subtype 5 tumors
   exhibited high tumor infiltration by immune cell components in
   approximately half of the cases (Fig. [203]3k). Likewise, in the
   DEG[subtype] analysis, tumors of Subtypes 4 and 5 were more likely to
   be accompanied by moderate-to-high stromal neutrophilic infiltration,
   but the proportion of such tumors was not high (Fig. [204]3l).
   Collectively, we obtained representative histological images of the
   five subtypes that reflected their histopathological characteristics
   and clinical relevance.

Proteogenomic features underlying whole-genome doubling in NSCLC subtypes

   The proliferative subtype is common in NSCLC and is mainly
   characterized by the upregulation of cell cycle pathways, including E2F
   target, G2M checkpoint, and the MYC targets v1 and v2. Previous
   proteogenomic study^[205]10 reported a proliferative subtype of LUAD,
   the dominant proximal-proliferative cluster, which accounts for 27% of
   patients with LUAD (30/110 samples; Supplementary Fig. [206]4a).
   Similarly, a proliferative subtype, called the proliferative-primitive
   cluster, affects 16% of patients with LSCC (17/108 samples,
   Supplementary Fig. [207]4b). When considering the classical subtype,
   which also exhibits upregulation of cell cycle pathways, proliferative
   subtypes collectively constituted approximately 44% of patients with
   LSCC^[208]13 (47/108 samples).

   Our multiomics analysis identified Subtype 3 as a proliferative subtype
   characterized by a WGD event. In Subtype 3, 75% of the patients (39/52)
   exhibited WGD (Fig. [209]4a and Supplementary Data [210]4a, c).
   Copy-number signature analysis revealed a significant association
   between Subtype 3 and the CN14 signature (OR: 8.4, Fisher’s exact test;
   Fig. [211]4b and Supplementary Fig. [212]4d), which is indicative of
   high ploidy (absolute copy numbers 3 to 8), relatively large segment
   size ( > 40 Mb), and whole-chromosome or chromosome arm-scale losses
   prior to a single genome-doubling event. In particular, Subtype 3
   showed amplification on chromosome 3q and significant enrichment for
   the co-occurrence of SOX2 amplification and TP53 mutations compared to
   the other subtypes (OR: 16.4, Fisher’s exact test; Fig. [213]4c, d and
   Supplementary Data [214]4e, f). Furthermore, Subtype 3 in the combined
   NMF analysis, which integrated the CPTAC LUAD and LSCC datasets also
   showed a high frequency of WGD (64%, 41/64; Fig. [215]4a and
   Supplementary Data [216]4b, d) and enrichment of the CN10 (28%, 18/64)
   and CN16 signatures (17%, 11/64), which indicated focal and chromosomal
   losses before single- and twice-genome doubling, respectively
   (Supplementary Fig. [217]4c and [218]e). Subtype 3 exhibited a greater
   TMB compared to that in the other subtypes in both our (P = 5.7 ×
   10^−11, Wilcoxon rank-sum test) and previous studies (P = 6.9 × 10^−6,
   Wilcoxon rank-sum test). Although LSCC is the most common histology
   (67%, 35/52 in Korean NSCLC and 66%, 42/64 in CPTAC NSCLC), Subtype 3
   is an NSCLC tumor characterized by the presence of WGD events due to
   copy number alterations and cell cycle pathway enrichment.

Fig. 4. Proteogenomic features underlying whole-genome doubling (WGD) in
NSCLC subtypes.

   [219]Fig. 4
   [220]Open in a new tab

   a Barplot showing WGD fraction in each multiomics subtype from our
   study and CPTAC NSCLC patients. b Overlap of copy number signatures for
   the five multiomics subtypes, with the colors indicating the odds ratio
   from one-sided Fisher’s exact test. The COSMIC v3 signature and
   etiology for each signature are indicated on the y-axis. *P < 0.05,
   **P < 0.01, ***P < 0.001, ****P < 0.0001 (one-sided Fisher’s exact
   test). c Top 10 most significantly enriched copy number variations
   (CNVs) in Subtype 3 (FDR < 0.1). The y-axis indicates the cytoband and
   the x-axis shows the log[10]-scaled FDR from linear regressions
   comparing Subtype 3 and other samples. d Enriched co-mutations in each
   subtype in both our and the CPTAC cohorts. EGFR mutation, missense
   mutation, in-frame deletion, frameshift deletion, and amplifications
   were counted. For TP53 and CDKN2A, amplifications were excluded, and
   for SOX2, only amplifications were counted. *P < 0.05, **P < 0.01,
   ***P < 0.001, ****P < 0.0001 (one-sided Fisher’s exact test). e
   Protein-level gene set enrichment analysis (GSEA) revealing upregulated
   and downregulated pathways in Subtype 3 of both cohorts. The x- and
   y-axis are enrichment scores (ES) from the current study and CPTAC
   NSCLC data, respectively. Labeled pathways are the top-5 upregulated
   pathways in Subtype 3. The Molecular Signatures Database (MSigDB)
   hallmark gene set v7.4 was used for GSEA. f Subtype 3-specific kinase
   activity scores. The sizes of the points indicate -log[10](FDR) from
   kinase activity estimation. Significantly up- and downregulated kinases
   are labeled (FDR < 0.05). g Elevated protein expression of XPO1 in
   Subtype 3 is shown for our study (Subtype 1, n = 55; Subtype 2, n = 45;
   Subtype 3, n = 52; Subtype 4, n = 43; Subtype 5, n = 34) and CPTAC LSCC
   (Subtype 1, n = 11; Subtype 2, n = 19; Subtype 3, n = 42; Subtype 4,
   n = 15; Subtype 5, n = 21). Kruskal-Wallis test was performed to test
   the differences in expression. WGD status is marked by red dots and the
   y-axis shows log[2] protein expression levels. For box-plots, middle
   line, median; box edges, 25^th and 75^th percentiles; whiskers, most
   extreme points that do not exceed ± 1.5 × IQR. h Sample information
   (top), drug response curve (middle), and IC50 (bottom) for selinexor
   (XPO1 inhibitor) for lung organoids highlighting a higher sensitivity
   in WGD-positive LSCC organoids. Three technical replicates were tested
   in each organoid sample. For IC50 barplot, dots indicate each
   replicate, and error bars indicate average ± 1 standard deviation. i
   WGD-related pathway underlying Subtype 1 LUAD tumors and Subtype 3 LSCC
   tumors. Significantly upregulated kinases are highlighted with red
   triangles (FDR < 0.05) and mutations are shown in purple boxes. Kinase
   activity scores are estimated from phosphoprotein expression. The
   log[2]fold changes from DE analyses are indicated by the color in each
   box. For the phosphoproteome, only features with FDR less than 0.1 are
   displayed.

   Consistent with previous studies, Subtype 3 showed upregulation of
   proteins, PTMs, and kinases in cell cycle pathways (Fig. [221]4e, f,
   Supplementary Fig. [222]4f, g and Supplementary Data [223]3a,
   [224]4g–j). The key proteins and PTMs involved in this subtype included
   SRSF1 (proteins S199 and K179), SRSF2 (T25 and S26), and XPO1 (protein
   K693). SRSF1 and SRSF2, both proto-oncogenes, modulate the splicing
   patterns of tumor suppressor genes, kinases, and kinase receptors into
   their oncogenic isoforms, thereby promoting cell cycle
   activity^[225]34. XPO1 was upregulated at both the protein and
   acetylated protein levels in this subtype (Fig. [226]4g).
   Overexpression of XPO1 has been observed in many malignancies^[227]35
   and its inhibition could potentially reduce cell proliferation and
   promote cell cycle arrest in proliferative subtypes. To assess this
   hypothesis, we selected tumor organoids from the organoid biobank of SG
   Medical, Inc. (Seoul, Korea) generated from WGD-positive LUAD and LSCC
   patients and WGD-negative controls. After treating the organoids with
   selinexor, a targeted XPO1 inhibitor, we found higher drug sensitivity
   in WGD-positive LSCC organoids than in other samples (Fig. [228]4h and
   Supplementary Data [229]4k). This finding suggests that targeting XPO1
   with selinexor may be a promising therapeutic approach for patients
   with LSCC and WGD tumors, which warrants further investigation into its
   potential clinical applications.

   To a lesser extent, Subtype 1 also included patients with WGD (51%,
   28/55 patients; Fig. [230]4a). Subtype 1 was significantly enriched in
   the CN15 signature (OR: 4.8, Fisher’s exact test; Fig. [231]4b), evenly
   distributed ploidy (absolute copy number from 2 to more than 9), and a
   segment size of 1 to 10 Mb. This subtype was predominantly found in
   patients with LUAD and was associated with co-occurring mutations in
   EGFR and tumor suppressor genes (TP53 or CDKN2A) (OR: 5.3, Fisher’s
   exact test; Fig. [232]4d), implying that subtype 1 is LUAD-prevalent
   and WGD is activated through LUAD-specific tumor evolution events. In
   the combined NMF analysis, we validated the proportion of WGD-positive
   patients (46%, 19/41 patients; Fig. [233]4a) and the enrichment of the
   WGD-related signature CN16 in Subtype 1 from previous LUAD and LSCC
   studies (34%, 14/41 patients; Supplementary Fig. [234]4c).

   WGD characterizes two major subtypes of NSCLC but seems to harbor
   different driver genes and underlying pathways for each subtype
   (Fig. [235]4i and Supplementary Data [236]4l). Subtype 1 was
   characterized by a high rate of in-frame deletion and copy number gain
   in EGFR, with a high mutational burden on tumor suppressor genes.
   Significantly upregulated kinases in Subtype 1 included RPS6KA3, which
   mediates cell growth signaling initiated by EGFR activation, CDK1,
   CDK2, and CDK4, indicating an accelerated cell cycle (FDR < 0.1)
   (Supplementary Fig. [237]4h and Supplementary Data [238]4m).
   Additionally, CDK12 and PIK3C3, which are related to DNA repair and
   autophagy, respectively, were significantly upregulated (FDR < 0.1).
   Since increased cell proliferation induces hypoxia and nutrient
   depletion^[239]36, these kinases may be activated in response to WGD to
   promote the cell’s adaptation to the nutrient-deprived environment and
   stabilize the genome, ultimately leading to cell survival. Similarly,
   together with CDK1, CDK2, and CDK4, kinases related to the DNA damage
   and nutrient deprivation responses were activated in Subtype 3 LSCC
   samples (Supplementary Fig. [240]4i and Supplementary Data [241]4m).
   However, the putative driver mutations of WGD in Subtype 3 are TP53
   mutations and copy number amplification on chromosome 3q, where many
   cell cycle genes reside, including SOX2, ATR, STAG1, GSK3B, TFDP2, and
   MCM2. Since LUSC tumors tend to have copy number gain in the 3q
   arm^[242]37, Subtype 3 may have benefited from a 3q gain after WGD,
   thus having a higher WGD fraction than Subtype 1 samples. In addition,
   unlike Subtype 1, CHEK1 and EEF2K play important roles after undergoing
   WGD. EEF2K is known to inactivate EEF2 by phosphorylation under dietary
   restriction, which in turn contributes to cell survival^[243]38.
   Overall, these results show that Subtypes 1 and 3 represent
   proliferative subtypes that have undergone WGD in LUAD and LSCC tumors,
   respectively, and that selinexor may be an effective treatment for
   patients with Subtype 3 will requires further validation study.

Heterogeneous immune landscapes in NSCLC

   Understanding the tumor immune microenvironment (TIME) is crucial in
   the molecular characterization of cancer subtypes^[244]39,[245]40. To
   profile the TIME in patients with NSCLC, two clustering analyses based
   on cell type^[246]41 and pathways^[247]42 were performed, with
   enrichment scores inferred from curated gene signatures. We identified
   three major immune clusters, namely, hot-tumor-enriched (HTE),
   cold-tumor-enriched (CTE), and NAT-enriched, across 205 tumors and 85
   NATs (Fig. [248]5a and Supplementary Data [249]5a). HTE and CTE immune
   clusters overlapped considerably between the cell type- and
   pathway-based clusters (Supplementary Fig. [250]5a and Supplementary
   Data [251]5b). HTE tumors were mainly enriched in CD8+ and CD4 + T
   cells, regulatory T (Treg) cells, B cells, natural killer (NK) cells,
   neutrophils, dendritic cells (DCs), monocytes, and macrophages at the
   cell-type level, compared to CTE tumors. In the pathway-based
   clustering results, the HTE cluster showed a greater activation of
   immune-related pathways than the CTE cluster. Because the cell-type
   enrichment score was inferred based on curated gene signatures, we
   confirmed the cell-type specificity of the immune clusters using
   scRNA-seq data from multiple NSCLC studies^[252]23. From the top-300
   DEGs in the HTE (DEG[HTE]) and CTE (DEG[CTE]) clusters, we investigated
   the enrichment of each DEG in scRNA-seq data. The DEG[HTE] set was
   enriched in CD8+ and CD4 + T cells, regulatory T cells, B cells, NK
   cells, neutrophils, DCs, monocytes, and macrophages, whereas the
   DEG[CTE] set was primarily enriched in epithelial cells, including LUAD
   and LSCC tumor cells (Fig. [253]5b, top, and Supplementary
   Data [254]5c). Additionally, the DEGs of each cell type inferred from
   the gene signatures were enriched in the corresponding cell types in
   the scRNA-seq data (Supplementary Fig. [255]5b). The cell-type
   enrichment score was also highly correlated with the level of TILs
   (Fig. [256]5b, bottom, and Supplementary Data [257]5d). These results
   indicate that the single-cell and immunohistochemical analyses achieved
   good immune clustering based on the cell type score.

Fig. 5. Landscapes of immune clusters and cell types across NSCLC subtypes
and cohorts.

   [258]Fig. 5
   [259]Open in a new tab

   a Immune subtyping based on cell type and pathway enrichment scores.
   Cell type-based clustering was performed with 205 tumor and 85 normal
   adjacent to the tumor (NAT) samples, and pathway-based clustering was
   performed using only tumor samples. The tumor-infiltrating lymphocyte
   (TIL) pattern, clinical histology (diagnostics, DX), multiomics
   subtype, tumor stage, and tissue information are described. IC, immune
   cluster. b (top) DEG[HTE] and DEG[CTE] were used to generate the UMAP
   plot of scRNA-seq data. A two-sided t-test was conducted to assess the
   statistical significance of the differences in gene expression. The
   color of each point represents the module score of each cell; higher
   scores are shown in red. UMAP information was obtained from multiple
   NSCLC studies^[260]23. (bottom) The correlations of 10 cell types and
   immune clusters with the pattern of TILs were analyzed. The sizes and
   colors of the circles indicate the statistical significance and
   correlation coefficient of the correlations, respectively. The
   horizontal black dotted line indicates P = 0.05. c Hazard ratios for
   overall survival (OS, left) and relapse-free survival (RFS, right)
   related to various cell types in the cell type-based immune cluster. A
   hazard ratio lower than zero (blue box) indicates that the
   hot-tumor-enriched (HTE) status or a high cell type score was
   associated with prolonged survival. Error bars (gray lines) represent
   mean ± 95% confidence interval (CI). Red text indicates statistical
   significance in the survival analysis by the log-rank Mantel‒Cox test
   (n = 174). d Correlations of the RNA expression, protein expression,
   and protein activity of 10 immunomodulators with immune cluster status
   for our cohort as well as other lung cancer multiomics
   cohorts^[261]10,[262]13. Correlation coefficients and p-values were
   obtained from a generalized linear model (GLM). e Correlations between
   the expression or activity of immunomodulators and the status of driver
   mutations in our cohort and the Satpathy and Gillette cohorts. The top
   associations between immunomodulators and known driver genes are
   described. f The left boxplot shows the RNA and protein expression of
   SLAMF7 in samples (n = 205) with wild-type or mutant SMARCA4 (n = 205);
   the right boxplot shows the RNA and protein expression of SLAMF7 in HTE
   and cold-tumor-enriched (CTE) samples (n = 174). The two-sided t-test
   was performed to test the differences in expression. The box represents
   the 25th and 75th percentiles, the central mark denotes the median, and
   the whiskers extend to the most extreme points within ±1.5 × IQR. g
   (left) Box (top) and balloon (bottom) plots showing the mean expression
   of marker genes of HTE and 10 cell types across the multiomics subtypes
   (n = 174). The marker genes were defined as the top−300 and −30 most
   overexpressed genes in HTE samples and highly cell type-enriched
   samples, respectively. (right) The bar (top) and box (middle and
   bottom) plots show the mutation frequency of SMARCA4 and RNA/protein
   expression of SLAMF7 across multiomics subtypes (n = 174),
   respectively. The Kruskal‒Wallis test was performed to assess the
   differences among the multiomics subtypes. The box represents the 25th
   and 75th percentiles, the central mark denotes the median, and the
   whiskers extend to the most extreme points within ±1.5 × IQR.

   To analyze the prognostic value of the immune landscape in patients
   with NSCLC, each immune cell enrichment score, along with the immune
   cluster, was tested against overall and relapse-free patient survival
   rates (Fig. [263]5c, Supplementary Fig. [264]5c and Supplementary
   Data [265]5e). HTE status was the most favorable factor for patient
   survival, and high enrichment of CD8+ and CD4 + T cells, B cells, and
   NK cells were positively correlated with a good prognosis, in contrast
   to the enrichment of epithelial cells or neutrophils, which had
   negative correlations. None of the immune-related pathway-based
   enrichment scores were significantly associated with patient survival,
   although most showed a positive association with prolonged survival
   (Supplementary Fig. [266]5d). HTE tumors with high enrichment of CD8+
   and CD4 + T cells and immune-related pathways are known to be
   associated with favorable prognoses^[267]43. Additionally, Previous
   studies^[268]10,[269]13 reported that Tregs were enriched in HTE
   tumors, but had an immunosuppressive effect and were associated with
   poor prognosis. We found a concordant trend of Treg enrichment in HTE
   tumors (Fig. [270]5a, b); patients with Treg-enriched HTE had worse
   survival rates than those with low Treg levels (Supplementary
   Fig. [271]5e and Supplementary Data [272]5f).

   Further analysis of the immune clusters was performed based on their
   cell-type composition, which showed significant correlations with
   patient survival. We performed a pathway enrichment analysis comparing
   the HTE and CTE immune clusters using multiomics features
   (Supplementary Fig. [273]5f and Supplementary Data [274]5g). A set of
   immune-related and EMT signaling pathways was enriched in HTE tumors,
   whereas cell cycle-related pathways and glycolysis were enriched in CTE
   tumors. Specifically, some metabolic pathways and oxidative
   phosphorylation were enriched in CTE tumors at the phosphoprotein level
   but were enriched in HTE tumors at the acetylated-protein level. Many
   of these results are consistent with the findings of previous lung
   cancer multiomics studies^[275]10,[276]13.

   The differences between HTE and CTE immune clusters may be better
   elucidated by examining their respective regulators. To identify
   putative regulators of patient immune landscapes, we analyzed RNA and
   protein expression data and inferred protein regulatory activity
   according to the systematical influence of a protein in the
   transcription of relevant targets^[277]44. We performed a regression
   analysis of gene expression or activity and enrichment scores for cell
   type or immune clusters. We found that the majority of
   immunomodulators, which mainly comprised cancer cell ligands and immune
   cell receptors, were positively correlated with HTE tumors and most
   immune cells, but not epithelial cells^[278]45 (Supplementary
   Fig. [279]5g and Supplementary Data [280]5h). Notably, the expression
   and activity of VEGF-A were negatively correlated with HTE tumors,
   which contrasts the patterns observed with other immunomodulators.
   VEGF-A was negatively associated with TILs and has potential
   implications for cancer risk^[281]46,[282]47. Specifically, 10
   immunomodulators showed significant correlations with immune cluster
   status, with a consistent correlation direction not only in our cohort,
   but also in other lung cancer multiomics studies^[283]10,[284]13
   (Fig. [285]5d and Supplementary Data [286]5i).

   We profiled the associations between the immunomodulators and known
   driver mutations to determine the putative mechanism(s) underlying
   changes in immunomodulators. Because Subtype 5 is immunogenic, we only
   considered the most activated or repressed immunomodulators to identify
   the potential regulatory mechanisms that distinguish it from the other
   subtypes. Ultimately, the three immunomodulators showed significant
   correlations with mutations in one of the eight driver genes, with
   consistent patterns observed in other multiomics NSCLC
   cohorts^[287]10,[288]13, although these correlations were not
   statistically significant (Fig. [289]5e and Supplementary
   Data [290]5j). Among these immunomodulators, the SMARCA4 mutation was
   positively correlated with the expression of SLAMF7 at both the RNA and
   protein levels, and positively correlated with HTE status in both our
   study cohort and the independent cohort^[291]10 (Fig. [292]5f,
   Supplementary Fig. [293]5h, and Supplementary Data [294]5k). Based on
   these results, SMARCA4 is a potential regulator of the immunomodulator
   SLAMF7, which is associated with HTE status.

   We further analyzed the distribution of the immune landscape among the
   multiomics subtypes (Fig. [295]5g, left, and Supplementary
   Data [296]5l). The expression of HTE feature genes (DEG[HTE]) was
   highest in Subtype 5, which had immunogenic characteristics. This
   subtype also showed the highest activation of CD8+ and CD4 + T cells,
   Treg cells, B cells, NK cells, neutrophils, and macrophage marker
   genes. Neutrophils and Tregs, which are associated with a poor
   prognosis in all patients and specifically in THE patients,
   respectively (Fig. [297]5c and Supplementary Fig. [298]5e), were also
   found to be enriched in Subtype 5. This enrichment may be linked to the
   intermediate survival characteristics observed in this subtype, despite
   Subtype 5 exhibiting the strongest THE tumor signal. Moreover, the
   highest frequencies of SMARCA4 mutations and SLAMF7 overexpression were
   observed in Subtype 5 (Fig. [299]5g, right). We confirmed that the
   distribution of cell types within the immune cluster and their
   correlation with the multiomics subtype were consistent in the
   integrated cohort combining our cohort with those of recent multiomics
   studies^[300]10,[301]13 (Supplementary Fig. [302]5i and Supplementary
   Data [303]5m). Our results showed that HTE tumors were positively
   associated with better prognosis in patients with NSCLC. These tumors
   were enriched in immunogenic Subtype 5, which was also positively
   associated with mutations in the putative regulator SMARCA4, which
   targets SLAMF7, an immunomodulator.

Multiomics profiling of neoantigens and immune clusters

   Neoantigens are tumor-specific antigens generated by tumor cells and
   are key factors affecting the immune landscape of cancer
   patients^[304]48. Therefore, we predicted neoantigens in our NSCLC
   cohort using a multiomics dataset. Neoantigens derived from somatic
   mutations in the coding regions were predicted using WES data and
   confirmed by MS analysis to determine whether they were expressed as
   peptides of the “confirmed neoantigen candidates.” Furthermore, we
   identified peptides derived from non-coding and non-annotated
   transcripts that were distinct from the canonical neoantigens. Among
   them, we selected “cryptic MHC class I-associated peptides (MAPs),”
   which were described as noncanonical peptides predicted to bind to MHC
   class I molecules in a previous study^[305]49–[306]51. We identified
   “confirmed cryptic MAPs” based on a previously determined expression
   threshold^[307]52. We inferred 85,430 neoantigen candidates and 775
   cryptic MAPs (Supplementary Fig. [308]6a) and annotated the origin of
   the cryptic MAPs based on the matched transcripts (Supplementary
   Fig. [309]6b and Supplementary Data [310]6a). Non-annotated isoforms,
   pseudogenes, and untranslated regions (UTRs) accounted for 90.97% of
   the sources of cryptic MAPs, which is consistent with previous
   studies^[311]53–[312]58.

   We tested the associations of the neoantigen candidates, confirmed
   neoantigen candidates, cryptic MAPs, and confirmed cryptic MAPs with
   patient survival (Fig. [313]6a and Supplementary Data [314]6b).
   Interestingly, only confirmed cryptic MAPs showed a strong positive
   correlation with improved survival, although the number of cryptic MAPs
   was notably low across patients (Supplementary Fig. [315]6c and
   Supplementary Data [316]6b). We also found 12 confirmed cryptic MAPs in
   more than three patients, called them recurrent cryptic MAPs, some of
   which were derived from the same gene of origin (Supplementary
   Fig. [317]6d and Supplementary Data [318]6c). Furthermore, the presence
   of recurrent cryptic MAPs was significantly correlated with prolonged
   survival (Fig. [319]6b and Supplementary Data [320]6b), indicating the
   prognostic value of cryptic MAPs in patients with NSCLC.

Fig. 6. Clinical relevance of neoantigens and cryptic MAPs, and their
associations with multiomics subtypes.

   [321]Fig. 6
   [322]Open in a new tab

   a Survival estimated according to the type of neoantigen and cryptic
   MAP. A lower hazard ratio (blue box) indicates that a high load of
   neoantigens or cryptic MAPs is associated with prolonged overall
   survival, and a high hazard ratio (red box) indicates the opposite.
   Hazard ratios for individual trials and overall effects are given with
   95% CIs. Log (HR) values and their corresponding 95% confidence
   intervals (CIs) are depicted in grey. b Kaplan‒Meier curve showing the
   survival of two groups of patients (n = 204) according to whether they
   did (blue line) or did not have (red line) recurrent cryptic MAPs. The
   p-value was derived by comparing the curves with the log-rank
   Mantel‒Cox test. c Correlations between the number of cryptic MAPs and
   enrichment scores of 10 cell types and the immune cluster. Correlation
   coefficients were calculated by a linear regression model with the
   covariates of sample batches and histological diagnosis. The size of
   the dots indicates the degree of the -log[10]-scaled p-value, and the
   color of the dots represents the strength of the correlation
   coefficient. The bold-lined dots indicate statistical significance. d
   Kaplan‒Meier curve showing the survival patterns of four groups of
   patients (n = 174) stratified by cryptic MAP load and immune cluster.
   The p-value was obtained by comparing curves between the two groups
   with the largest difference in the log-rank Mantel‒Cox test. e
   Enrichment analysis of the four groups described in Fig. [323]4d for
   the multiomics subtypes. The x- and y-axis indicate enrichment and
   statistical significance calculated using a two-sided Fisher’s exact
   test with the Benjamini‒Hochberg adjustment, respectively. The size of
   each dot indicates the level of significance. f Features of patients
   with multiomics Subtype 5 disease who had a low cryptic MAP load with
   an HTE status, activated APM, and activated NF-κB pathway. g
   Kaplan–Meier curve of recurrence-free survival according to treatment
   status (chemotherapy [CTx] or chemoradiation therapy [CRTx]) in
   patients categorized by multiomics subtype. The p-value was derived by
   comparing the curves with the log-rank Mantel‒Cox test.

   We also investigated the association between cryptic MAPs, immune
   clusters, and cell-type enrichment. Cryptic MAPs were positively
   correlated with most immune cell types, including HTE status, although
   some correlations were weak or insignificant (Fig. [324]6c and
   Supplementary Data [325]6b). In our cohort, patient prognosis was
   evaluated after stratification according to immune cluster criteria and
   cryptic MAP load. Patients with high cryptic MAP load and HTE status
   showed the longest survival, whereas those with low cryptic MAP load
   and CTE status showed the worst survival (Fig. [326]6d and
   Supplementary Data [327]6b).

   We determined the association between the integrated immune landscape
   and multiomics subtypes by evaluating the distributions of the
   multiomics subtypes based on the combined status of the cryptic MAP
   load and immune clusters. Multiomics Subtype 5 was most enriched in the
   low cryptic MAP load and HTE groups (Fig. [328]6e and Supplementary
   Data [329]6d). Interestingly, a similar survival trend and multiomics
   subtype distribution were obtained when the immune cluster was replaced
   with antigen processing, presentation machinery (APM), and TIL pattern
   (Supplementary Fig. [330]6e–g and Supplementary Data [331]6e, f), which
   was proposed as a critical factor for evaluating the immune-activating
   potential with neoantigens^[332]14. Notably, patients with Subtype 5,
   who showed moderate survival, had a low cryptic MAP load with HTE and
   activated APM. This is consistent with the large proportion of TILs
   observed in single-cell and histopathological analyses
   (Fig. [333]3a–k). Patients with Subtype 5 also showed high activity of
   the TNF-α-pathway via NF-κB in addition to APM. Consequently, the
   moderate survival associated with this subtype appeared to be linked to
   a low load of cryptic MAPs and high enrichment of immunosuppressive
   Tregs and neutrophils despite having an active immune landscape with
   HTE status (Fig. [334]6f and Supplementary Data [335]1a). Remarkably,
   the use of adjuvant chemotherapy or chemoradiation therapy
   substantially enhanced the survival of patients with Subtype 5 tumors
   (Fig. [336]6g). In contrast, no substantial improvement in survival was
   observed in patients with other subtypes who underwent adjuvant
   chemotherapy or chemoradiation therapy. Subtype 5 also demonstrated the
   most favorable prognosis compared to other subtypes in the overall
   population treated with adjuvant chemotherapy, despite the marginal
   statistical significance (data not shown). This underscores that
   subtype 5 could be associated with a clinical benefit from adjuvant
   chemotherapy. These results imply that understanding patient prognosis
   and multiomics subtypes requires a multifactorial consideration of the
   TIME. Furthermore, the distribution of HLA alleles before and after the
   binding prediction was consistent with previous reports on the Korean
   population or lung cancer studies^[337]59–[338]62, thereby
   corroborating the reliability of our results (Supplementary
   Fig. [339]6h–j).

Discussion

   In this study, we conducted a comprehensive multiomics analysis of 229
   patients from a Korean NSCLC cohort and identified five NSCLC subtypes
   enriched for WGD, oncogenes, metastasis, and the immune
   microenvironment. The phosphoproteome dataset was the most informative
   for subtype identification, contributing 80% (911 out of 1,134) of the
   features, while global proteome and acetylome data also played crucial
   roles in decoding signaling pathways across the identified subtypes. By
   extending our analysis to integrate multiomics data from 462 patients
   with NSCLC, we validated our subtype classification, confirming its
   alignment with previously established multiomics subtypes. Utilizing an
   extensive single-institute clinical dataset allowed us to delineate
   detailed histopathological beyond histologic subtype and clinical
   relevance of our multiomics subtypes, such as prevalent alterations of
   targetable oncogenic drivers, histologic grade of LUAD, metastatic
   potential, and tumor immune response, and finally complement previous
   findings on the underlying biology of NSCLC.

   We identified a NSCLC subtype, Subtype 4, which was associated with a
   high frequency of metastasis and poor outcomes, independent of the
   NSCLC histological type. This subtype contained a large proportion of
   desmoplastic stromal components, suggesting that predominant tumor
   invasion and metastasis contribute to its aggressive behavior.
   Increased phosphorylation in hypoxia and activation of the PI3K-Akt
   signaling pathway were key characteristics of this subtype. In
   particular, the upregulation of SLK and LRRFIP1 phosphorylation
   predicted poor survival outcomes and has been consistently observed in
   other NSCLC multiomics studies. Further research on these events may
   identify targets for therapeutic intervention and enhance our
   understanding of poor outcomes in NSCLC.

   Characterizing NSCLC according to multiomics subtypes has important
   implications for personalized treatment strategies. Our study revealed
   that Subtype 3, characterized by high levels of chromosomal instability
   and XPO1 expression, represents a highly proliferative NSCLC subtype
   with co-occurring TP53 and cell cycle gene amplification.
   Patient-derived organoid experiments indicated that selinexor, a
   targeted inhibitor, could be effective against Subtype 3 with WGD. In
   Subtype 1, WGD only occurred in a limited number of patients with LUAD
   (51%, 28/55 patients) and was enriched for co-occurring mutations in
   EGFR and tumor suppressor genes. However, XPO1 expression was not
   remarkably increased and selinexor did not show strong efficacy in this
   subtype, suggesting potential subtype-specific differences in XPO1
   expression and therapeutic response to XPO1 inhibition. These findings
   suggest that WGD events are associated with tumor evolution via genomic
   alterations specific to the subtype, even in WGD-predominant
   subtype^[340]7,[341]63.

   We demonstrated that Subtype 5, enriched in the inflammatory tumor
   microenvironment, exhibited extraordinary improvements in survival with
   conventional adjuvant chemotherapy. This finding implies that the
   selection of postoperative adjuvant treatment for NSCLC could be
   helpful only for the case with high tumor immune response.
   Additionally, patients within this subtype could potentially benefit
   from immune checkpoint inhibitor therapy. Identifying suitable patients
   for coventional adjuvant chemotherapy could be facilitated by various
   biomarkers indicating a positive tumor immune response, including PD-L1
   immunostaining, T-cell marker presence, and the histological nature of
   the tumor immune microenvironment. In contrast, adjuvant chemotherapy
   did not significantly alter the survival of node-positive patients with
   Subtypes 1 or 2, which are enriched in driver oncogene alterations.
   Given the minimal survival improvement with cytotoxic chemotherapy in
   EGFR-mutant LUAD observed in prior trials^[342]64,[343]65, targeted
   therapies aimed at driver mutations may offer better outcomes for these
   patients.

   Multiomics-based profiling has the potential to enhance our
   understanding of the TIME. The HTE signature exhibited significant
   activation in Subtype 2, but was repressed in Subtype 4, which is
   inconsistent with our histopathological observations. This highlights
   the value of multiomics profiling as a complementary tool to
   histopathological examination in the functional characterization of the
   tumor microenvironment. We identified SLAMF7 as an immunomodulator
   significantly associated with HTE and SMARCA4 mutations in Subtype 5.
   Given the efficacy of immune checkpoint inhibitors in NSCLC patients
   with SMARCA4 mutations^[344]66, SLAMF7, and SMARCA4 are promising
   molecular targets for enhancing cancer immunotherapy. Our multiomics
   analyses also revealed cryptic MAPs derived from non-coding and
   non-annotated transcripts, which were subsequently confirmed using
   proteomics. Unlike neoantigen candidates, cryptic MAPs were
   significantly correlated with favorable survival and immune cell
   enrichment. Thus, experimental validation of MHC-binding prediction
   using immunopeptidomics for recurrent cryptic MAPs could facilitate the
   development of a cancer vaccine.

   Our study provides a comprehensive profile of the multiomics subtypes
   of NSCLC, but there are several limitations to consider. First, we used
   exome sequencing to infer whole-genome doubling, which may be less
   accurate than whole-genome sequencing. Ethnic differences may also be
   present in certain subtypes, particularly in Subtype 1 for EGFR female
   patients, although some subtypes were consistently found in both our
   cohort and the CPTAC cohorts. Additionally, the confirmation of cryptic
   MAPs by proteomics is limited owing to the experimental availability of
   normal samples. Further studies using larger normal sample cohorts are
   needed to confirm these findings. A prospective cohort study is also
   needed to test the effectiveness of drugs, such as selinexor, in the
   various subtypes. Finally, more comprehensive studies are needed to
   understand the effectiveness of selinexor in treating the WGD subtype.

Methods

Human subjects with investigation of clinicopathologic features

   A total of 229 samples, self-reported as Korean ethnicity, were
   histologically defined as NSCLC and selected for this study. Tumors and
   normal tissues adjacent to the tumor (NAT) were harvested under the
   Clinical Proteomic Tumor Analysis Consortium (CPTAC) guidelines from
   Asan Medical Center, Korea, with the approval of the Institutional
   Review Board of Asan Medical Center (Approval number: 2019-1210). In
   brief, we retrieved the 408 NSCLC cases whose cold ischemic time of
   fresh frozen tumor and NAT tissue sample was less than 15 minutes from
   the NSCLC cases of the bioresource center of Asan Medical Center
   deposited from January 2010 to March 2019. To investigate the impact of
   proteogenomic findings on post-operative therapy and metastasis, we
   preferentially selected 137 NSCLC patients showing lymph node
   metastasis at the pathologic examination of resection specimens. We
   further included 113 cases with absent lymph node metastasis at the
   time of surgical resection. With the further exclusion of 21 cases
   showing inadequate nucleic acid quality metrics, 229 patients were
   finally enrolled in the study. The clinical information of the enrolled
   patients, including age, gender, smoking history, adjuvant therapy,
   presence of post-operative recurrence, and treatment histories, was
   reviewed and documented by the thoracic oncologists (W.J.J, C.M.C., and
   J.C.L.). All the histopathologic slides of the resection specimens
   removed from the enrolled patients were independently reviewed by two
   thoracic pathologists (H.S.H. and S.J.J.) and documented for pathologic
   findings such as pathologic diagnosis, grade, tumor size, pleural
   invasion, histologic pattern (for non-mucinous adenocarcinoma,
   lymphovascular invasion, spreading through alveolar space, and lymph
   node metastasis. The cases with discrepancy between the independent
   assessments were discussed in consensus meetings. Based on these
   findings, all enrolled patients were re-staged according to the 8th
   lung cancer TNM stage^[345]67. In addition, pattern of
   tumor-infiltrating lymphocytes (TIL) was reviewed by the pathologists
   and classified according to the classification by Saltz et al.^[346]68.

Whole exome sequencing

   Genomic DNA was isolated from a FFPE (formalin-fixed paraffin embedded)
   tumor and normal samples (NATs or blood buffy coat^[347]69 if NATs are
   not available). We generated whole exome sequencing (WES) libraries
   using SureSelect V6-Post (Agilent, CA, USA). Pooled libraries were run
   on the Illumina NovaSeq to obtain an average of 300x depth per tumor
   library and 100x depth per NAT and blood buffy coat library. The raw
   Illumina BCL (base calls) binary files were converted into Fastq files
   using the Illumina package bcl2fastq. Sequencing read data was checked
   for quality and adaptor/overrepresented sequence (ORS) by FastQC
   (v0.11.9). There was no ORS that could be a quality problem, and since
   the adapter sequence was less than 1% in all samples, the additional
   trimming process was skipped. Fastq files were aligned to the human
   reference GENCODE GRCh38.p13 v32 primary assembly genome by BWA
   (v0.7.17). Among the three algorithms of BWA, the latest BWA-MEM was
   selected as it is suitable for Illumina sequences ranging from 70 bp to
   1 mbp with fast and accurate performance. After alignment, we sorted
   the BWA process output by coordinates via Picard SortSam and checked
   duplicate reads using Picard MarkDuplicates (v2.23.1). Two types of
   duplicated reads, optical duplicates (incorrectly detected as multiple
   clusters by the optical sensor of the sequencing machine) and library
   duplicates (created during PCR library preparation), were annotated for
   downstream analyses. For all tools requiring interval files, we used
   V6_S07604514_hs_hg38_S07604514_Covered.bed provided by Agilent. Because
   of systematic errors from the sequencing machine, base quality scores
   were recalibrated by the GATK4 BQSR algorithm (v4.1.8.0). The algorithm
   calculates the average Phred score by assuming the mismatch base is an
   error by matching known variants with the data base. We used the known
   variants VCF from dbSNP
   (broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf,
   broad_hg38_v0_Homo_sapiens_assembly38.known_indels.vcf.gz) and the
   Mills and 1000 G project
   (broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz) from
   the GATK4 bucket
   ([348]https://console.cloud.google.com/storage/browser/genomics-public-
   data/resources/broad/hg38/v0/). Then, the empirical Phred score is
   calculated using the average Phred score and then recalibration is
   performed based on the empirical Phred score.

Variant calling

   We called somatic variants from the recalibrated BAMs using GATK4
   Mutect2 (v4.1.8.0). In the Mutect2 pipeline,
   somatic-hg38_1000g_pon.hg38.vcf.gz data was used as panel-of-normals
   and somatic-hg38_af-only-gnomad.hg38.vcf.gz data was used for germline
   variants removal. We performed tumor with paired normal mode to exclude
   germline variants. As one tumor sample had no matching normal sample,
   the tumor only mode was performed. After calling somatic variants from
   GATK4 Mutect2, a read support reference for the well-known variant
   sites was created from the tumor recalibrated BAMs using GATK4
   GetPileupSummaries. The tool requires a common somatic mutation sites
   VCF, so we used the somatic-hg38_small_exac_common_3.hg38.vcf.gz file
   created from the gnomAD resource from the GATK4 bucket
   ([349]https://console.cloud.google.com/storage/browser/genomics-public-
   data/resources/broad/hg38/v0/). The GetPileupSummaries table was used
   in the GATK4 CalculateContamination to calculate the fraction of reads
   deriving from cross-sample contamination. The calculated cross-sample
   contamination data were saved as a table for each sample, and tumor
   segmentation data by minor allele fraction were additionally collected.
   The raw output of Mutect2 was converted into filtered VCF using
   cross-sample contamination table and tumor segmentation data by GATK4
   FilterMutectCalls. The VCFs generated from GATK4 FilterMutectCalls were
   converted to annotated VCF and MAF files via GATK4 funcotator. We used
   GATK4 4.2.0.0 version solely for the annotation step to use the latest
   annotation data source for GATK4 funcotator. The latest pre-packaged
   data sources (funcotator_dataSources.v1.7.20200521 s) was downloaded
   from the GATK4 bucket (gs://broad-public-datasets/funcotator/). All
   annotations were performed in CANONICAL transcript-selection-mode after
   choosing a custom transcript list. Finally, the annotated MAF files
   were merged into one using maftools v2.8.05^[350]70.

   Germline variants were called from the recalibrated BAMs using GATK4
   HaplotypeCaller (v4.1.8.0). The reference confidence scores were
   confirmed in GVCF mode using the option -ERC GVCF, which was a
   reference model emitted with condensed non-variant blocks. For variant
   calls, -G StandardAnnotation and -G AS_StandardAnnotation annotation
   options were applied. All single sample GVCFs from GATK4
   HaplotypeCaller were imported into GenomicsDB by GATK4 GenomicsDBImport
   (v4.1.8.0) for joint genotyping. Joint genotyping was performed using
   GATK4 GenotypeGVCFs (v4.1.8.0) from the constructed GenomicsDB. The
   produced joint VCF was recalibrated in variant quality by two steps.
   The first step was the recalibration of the SNPs. After calculating the
   exome specific recalibration score through GATK4 VariantRecalibrator
   (v.4.1.8.0), it was applied to the joint VCF using GATK4 ApplyVQSR
   (v.4.1.8.0). We used the training sets from the HapMap project
   (broad_hg38_v0_hapmap_3.3.hg38.vcf.gz), Mills and 1000 G project
   (broad_hg38_v0_1000G_omni2.5.hg38.vcf.gz,
   broad_hg38_v0_1000G_phase1.snps.high_confidence.hg38.vcf.gz) and dbSNP
   (broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf) from GATK4 bucket
   ([351]https://console.cloud.google.com/storage/browser/genomics-public-
   data/resources/broad/hg38/v0/). The second step was the recalibration
   of indels. The same tools in the first step were used and the training
   sets from Mills and 1000 Genome Project
   (broad_hg38_v0_Mills_and_1000G_gold_standard.indels.hg38.vcf.gz) and
   the dbSNP genotyping calls
   (broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf). The recalibrated
   joint VCF was converted to annotated VCF via the GATK4 funcotator. The
   4.2.0.0 version of GATK4 was again used at the annotation step to
   utilize the latest annotation data source for GATK4 funcotator. The
   latest pre-packaged data sources
   (funcotator_dataSources.v1.7.20200521 g) was downloaded from the GATK4
   bucket (gs://broad-public-datasets/funcotator/). All annotations were
   performed in CANONICAL transcript-selection-mode.

Identification of copy number alterations

   DNA somatic copy number variations (CNVs) were detected using CNVkit
   v0.9.8^[352]71. We labeled the copy number status with the following
   criteria: genes with an absolute copy number greater than or equal to 4
   were labeled as “amplification”, and genes greater than or equal to 2.5
   and less than 4 were labeled as “gain”. Similarly, genes with an
   absolute copy number from 0.5 to 1.5 were labeled “heterozygous
   deletion”, and those less than 0.5 were labeled “homozygous deletion”.

Copy number signature analysis and WGD detection

   For copy number signature analysis and WGD detection, we used FACETS
   v0.16.0^[353]72 to identify allele-specific copy number information.
   Preprocessed paired tumor-normal BAM files and a VCF file of common and
   germline polymorphic sites downloaded from
   [354]https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
   were used as the input for FACETS. Two samples were dropped out due to
   quality problems; therefore, 228 samples in total were used for copy
   number signature analysis and WGD detection. We defined samples as
   ‘WGD-positive’ if greater than 50% of their autosomal genome had an
   absolute copy number greater than or equal to two.

   Mutational signature analysis was performed using the COSMIC signature
   database v3^[355]73 and R package Sigminer v2.1.5^[356]74, as
   previously described^[357]75. Briefly, three copy number features,
   including total copy number, segment length, and loss-of-heterozygosity
   state were extracted and classified into 48 components that categorize
   the continuous values regarding the value range and biological
   significance. To decide the number of signature groups (or a
   factorization rank), non-negative matrix factorization (NMF) was
   performed using a tumor-by-component matrix with 50 runs checking 2 to
   12 ranks. Based on the cophenetic score plot, we determined to use rank
   seven for copy number variants. Each signature was nominated by the
   COSMIC signature of the highest cosine similarity. Then, by performing
   hierarchical clustering, we assigned the samples to one of the
   signatures based on the consensus matrix.

RNA-seq data generation and preprocessing

   Total mRNA-seq libraries were prepared using Library-TruSeq Stranded
   Total RNA with Ribo-Zero H/M/R_Gold. Pooled libraries were run on the
   Illumina NovaSeq6000 to generate an average of 200 million reads per
   library with 100-bp pair-end. The raw Illumina BCL binary files were
   converted into Fastq files using the Illumina package bcl2fastq. The
   sequenced read data was checked for quality and adaptor/overrepresented
   sequence (ORS) by FastQC (v0.11.9). There was no ORS that could be a
   quality problem, and since the adapter sequence was less than 1% in all
   samples, the additional trimming process was skipped. Fastq files were
   aligned to the human reference GENCODE GRCh38.p13 v32 primary assembly
   genome using STAR (v2.7.3a). After alignment, we sorted the output by
   coordinates via Picard SortSam and checked duplicate reads using Picard
   MarkDuplicates (v2.23.1). The reads that contain Ns in the cigar string
   were split using GATK4 SplitNCigarReads. Regarding systematic errors
   from the sequencing machine, base quality scores were recalibrated by
   the GATK4 BQSR algorithm (v4.1.8.0). The aligned BAMs from STAR were
   sorted by coordinates using Samtools Sort (v1.10). The Salmon algorithm
   (v1.4.0) was applied to quantify transcript-level expression, and raw
   read counts were produced. The output files (quant.sf) produced by
   Salmon, quantified transcript-level estimates, were imported to R using
   tximport and converted to a gene-level expression matrix. Then, we
   estimated the size factors using the “median ratio method” with the
   “estimateSizeFactors” function supported by DESeq2 R packages. With
   estimated size factors, we normalized the gene-level read count matrix
   using the “counts” function in DESeq2 R packages^[358]76, which divides
   the read counts by the size factors.

Isoform expression analysis

   Isoform scale expression was quantified from transcriptome data using a
   pipeline including StringTie^[359]77. STAR-aligned BAM without going
   through GATK4 SplitNCigarReads in RNA preprocessing was used.
   Transcripts of each sample were assembled using StringTie
   v2.1.7^[360]78. The first analyzed outputs were combined into a single
   assembly using merge command of StringTie (stringtie –merge), creating
   a single merged gtf containing the transcript of the entire sample.
   After that, a secondary analysis was conducted to calculate the
   expression in isoform level of each sample based on the merged
   transcript assembly using the StringTie Ballgown command (stringtie
   -eB). The novel isoform in the StringTie outputs were compared and
   annotated using gffcompare v0.11.6^[361]79. Processed isoform gtfs are
   made into a single count matrix using predDE.py, a python script
   provided by StringTie.

   The abundance or exon usage change between normal and cancer tissue was
   analyzed and visualized using IsoformSwitchAnalyzeR pipeline.
   IsoformSwitchAnalyzeR v1.17.4^[362]80 pipeline includes DRIMseq^[363]81
   to find isoform switch by condition and extract the sequence of mRNA or
   amino acid produced from the isoform. We analyzed predicted isoforms or
   ORF lists with additional tools as CPAT^[364]82, PFAM^[365]83,[366]84,
   SignalP^[367]85 and IUPred2A^[368]86.

Fusion gene analysis

   To detect the fusion genes, RNA-seq fastq was aligned, and fusion
   calling and filtering were performed. First, RNA-seq reads were mapped
   to the GENCODE GRCh38.p13 v32 primary assembly genome using STAR
   aligner v2.7.3a^[369]87. Unlike previous STAR mapping, ‘–chimOutType
   WithinBAM’ was included in the STAR output to include chimeric read to
   be suitable for use in arriba fusion calling. To increase sensitivity,
   the parameters were adjusted as recommended by the author as follows:
     * --chimSegmentMin 10
     * --chimOutType WithinBAM SoftClip
     * --chimJunctionOverhangMin 10
     * --chimScoreMin 1
     * --chimScoreDropMax 30
     * --chimScoreJunctionNonGTAG 0
     * --chimScoreSeparation 1
     * --alignSJstitchMismatchNmax 5 -1 5 5
     * --chimSegmentReadGapMax 3

   We found fusion gene by applying Arriba v1.2.0^[370]88 to the STAR
   mapped BAM file. Mismatches were discovered by comparing chimeric reads
   with reference genome assembly fasta and annotation gtf. Additionally,
   fusion genes had bad quality of frequently found healthy tissues were
   excluded from the analysis using the hg38 fusion blacklist provided by
   Arriba. Arriba outputs were labeled with sample name and tissue and
   concatenated with one table.

   From the Arriba output, fusion genes with confidence of “low” was
   removed. At this time, the fusion genes appearing in normal tissues
   were collected and used as a blacklist for tumor tissue fusion genes.
   If the fusion genes found in tumor tissue were included in this
   blacklist, they were excluded from the analysis. Furthermore, fusion
   genes including known cancer-related genes (ALK, ROS1, RET, and
   PTK2)^[371]89 were annotated as “known fusion” and fusion genes found
   across several tumor samples as “recurrent fusion”. They were used for
   downstream analysis. After filtering, we visualized the structure and
   protein domain of fusion genes using the R script (”draw_fusion.R”)
   provided by arriba.

Protein extraction and tryptic digestion

   For in-depth proteomic experiments, fifty milligrams of cryopulverized
   human NSCLC and NAT samples were homogenized in lysis buffer at a ratio
   of about 300 μL lysis buffer for every tissue. The lysis buffer
   consisted of 5% SDS, 50 mM TEAB (pH 7.55, Thermo Fisher Scientific,
   USA, 90114), protease inhibitor cocktail (1:100; Thermo Fisher
   Scientific, USA, 78430), 1 EA of PhosSTOP (Roche, Swiss, 4906845001),
   and 20 μM PUGNAc (Sigma-Aldrich, USA, A7229). Tissue lysis was
   performed with a couple of probe sonications using the Digital Sonifier
   SFX 550(Branson, USA). Sonication time was set at 30 s with a cycle of
   on-time 5 s and an interval of 3 s at 28%. Lysates were centrifuged at
   14,500 g at 4 °C for 5 min, followed by measuring protein concentration
   using the BCA assay (Thermo Fisher Scientific, USA, 23225), and kept at
   -80 °C until used for analysis.

   Protein lysates were reduced at 95 °C for 10 minutes with 20 mM
   1,4-dithiothreitol (Roche, Swiss, 10708984001), followed by alkylation
   with 40 mM iodoacetamide (Sigma-Aldrich, USA, I6125) in the dark at RT
   for 30 minutes. 12% phosphoric acid (Sigma-Aldrich, USA, 695017) of
   1/10 of the sample volume and 7 multiple volumes of S-trap binding
   buffer (100 mM TEAB in 90% MeOH) were added sequentially for formation
   of colloidal status. The solution was loaded on an S-Trap™ spin column
   (Protifi, USA) and centrifuged to 4000 g for 30 s. The captured protein
   was washed three times at 4000 g for 30 s with 400 μL of binding
   buffer. For tryptic digestion, 125 μL of digestion buffer (50 mM TEAB)
   containing Pierce Trypsin/Lys-C Mix (Thermo Fisher Scientific, USA,
   [372]A41007) (w/w ratio of 1:25) was added and incubated at 37 °C for
   6 h. The digested sample was sequentially centrifuged with 80 μL of
   digestion buffer, 80 μL of 0.2% formic acid (Honeywell, USA, 94318) at
   1000 g for 60 s, and 80 μL of 0.2% formic acid in 50% acetonitrile
   (ACN) at 4000 g for 60 s. Eluted peptides were dried and stored at
   −80 °C until the next process. Samples were desalted using a C18 spin
   column (Harvard Apparatus, USA) and dried. A205 application of NanoDrop
   One (Thermo Fisher Scientific, USA) was employed for peptide
   concentration measurements. A baseline was established using 1 µL of
   HPLC water, and dried samples were each diluted with HPLC water and
   measured at 1 µL. For the common reference sample, 10 μg aliquots were
   taken from the sample being analyzed, pooled together, and used for
   each TMT batch.

Construction of the common reference pool

   In-depth proteomic analyses for this study were organized as TMTpro 16
   plex experiments. For comparative quantification between all samples
   across experiments, a common reference (CR) sample was involved at the
   134 N channel in each 16-plex. The CR is a protein mixture that
   contains all samples analyzed in the TMT experiments. All subsequent
   procedures, including digestion, were performed along with other
   individual samples to minimize variation.

TMTpro 16-plex labeling

   Peptides, 250 μg per sample (based on peptide level quantification with
   NanoDrop One), were labeled with 16-plex TMT reagents (Thermo Fisher
   Scientific, USA, [373]A44520) according to the manufacturer’s protocol.
   For each peptide aliquot of an individual sample, 1 mg of labeling
   reagent was used. Dissolve each sample in 50 μL of 100 mM TEAB (pH
   8.5), add 20 μL of labeling reagent dissolved in anhydrous
   acetonitrile, and incubate for 1 h with shaking. 1.8 μg of labeled
   peptides from each channel were taken out and subjected to LC-MS/MS
   analysis to confirm labeling efficiency before pooling. Label
   efficiency criteria were set as having a minimum of 95% fully labeled
   MS/MS spectra in each sample. To quench the reaction, 5 μL of 5%
   hydroxylamine was added and incubated for 15 minutes. Samples were
   pooled in each 16-plex experiment and sequentially desalted with
   Sep-Pak C18 3 cc Vac Cartridge (Waters, USA) and dried.

Mid pH reverse-phase liquid chromatography fractionation

   To minimize sample complexity, samples were fractionated by mid-pH
   reversed phase (RP) separation using Shimadzu LC20A (Shimadzu, Japan)
   with an analytical column (XBridge Peptide BEH C18 Column; 300 Å, 5 µm,
   4.6 × 250 mm, Waters, USA) and a guard column (SecurityGuard cartridge
   C18, 4 ×3.0 mm, Phenomenex, USA). We performed the peptide
   fractionation at a flow rate of 0.5 mL/min. Mobile phases A and B were
   10 mM TEAB and 10 mM TEAB in 90% ACN, and the sample was dissolved in
   0.1% formic acid in 90 μL. The LC gradient of mobile phase B was 5% in
   8 min, 40% in 65 min, 44% in 69 min, 60% from 74 min to 88 min, and 5%
   in 90 min. We collected a total of 96 fractions from 8 min, pooling
   them into 24 non-consecutive fractions every 0.91 min. We then
   concatenated 95% of each fraction into 12 non-consecutive fractions for
   post-translational modifications (PTMs) analysis.

Phosphopeptide enrichment

   For the phosphopeptide enrichment, immobilized metal affinity
   chromatography (IMAC) was used. 300 μL of Ni-NTA agarose bead slurry
   (QIAGEN, Germany, 30410) was resuspended in the tube. The slurry was
   spun down for 1 minute, and the supernatant was removed. The bead was
   washed three times with 1 mL of HPLC water and then incubated with
   1.2 mL of 100 mM EDTA (Sigma-Aldrich, USA, E7889) for 30 minutes by
   end-over-end turning at RT. After three washes with 1 mL of HPLC water,
   the beads were incubated with 1.2 mL of 10 mM FeCl[3] for 30 minutes
   with end-over-end turning at room temperature. 12 fractions for PTM
   analysis are reconstituted in 113 μL of 0.1% TFA in 50% ACN, and then
   337 μL of 0.1% TFA in 95% ACN is added sequentially. Beads were
   resuspended in 140 μL of 1:1:1 ACN:MeOH:0.01% acetic acid solution and
   10 μL of beads were aliquoted into 12 tubes for each fraction. We
   resuspended the beads in the sample solution to bind the
   phosphopeptides, and then gently mixed them for 30 minutes at room
   temperature. The supernatant was collected separately for the next
   acetyl peptide enrichment step. Beads coupled with phosphopeptides were
   dissolved in 180 μL of 0.1% TFA in 80% ACN and desalted using a C18
   stage-tip to elute for LC-MS/MS.

Acetylpeptide enrichment

   For the acetylpeptide enrichment, the PTMScan® Acetyl-Lysine Motif
   [Ac-K] Kit (Cell Signaling Technology, USA, 13416) was used. IMAC
   eluents were concatenated into four fractions and dried. Peptides were
   dissolved in 1.4 mL of 1 x IAP buffer, in which 10 x IAP buffer (5.78 g
   MOPS-NaOH, 1.461 NaCl, 0.8 g dibasic sodium phosphate, and 0.08 g
   monobasic sodium phosphate) were adjusted to pH with acetic acid
   ( ~ 1.4 to 1.6 mL). Agarose beads bound to the acetyl-lysine motif
   antibody were pre-washed a total of 4 times using 1 x IAP buffer under
   ice and split into 2 tubes with equal volume. The peptides were
   transferred to a tube and incubated at 4 °C for 3 hours with
   end-over-end turning. Each tube was centrifuged (2,000 g, 30 sec,
   4 °C), and the supernatant was separated, followed by washing the beads
   twice with 1 mL of ice-cold PBS and three times with chilled HPLC
   water. 100 μL of 0.15% TFA was added to beads coupled with acetyl
   peptides and then incubated at RT for 10 minutes, gently mixed every
   2-3 minutes, and then centrifuged to elute. We desalted the eluted
   acetyl peptides using a C18 stage-tip (IMAC procedure) and dried them
   after a total of two elution steps.

LC-MS/MS for proteomics analyses

   For global proteomic analyses, the Ultimate 3000 RSLC nano system
   (Thermo Fisher Scientific, USA) coupled with the Q Exactive HF-X hybrid
   quadrupole-Orbitrap mass spectrometer (Thermo Fisher Scientific, USA)
   was used. Trap column (Acclaim™ PepMap™ 100 C18 LC Column, C18, 100 um
   x 2 cm, 5 μm, Thermo Fisher Scientific, USA) and analytical column
   (EASY-Spray™ LC Columns, C18, 75 um x 50 cm, 2 μm, Thermo Fisher
   Scientific, USA), which were heated to 50 °C to prevent
   over-pressuring, were equipped with UHPLC separation. Mobile phase flow
   rate was 0.3 μL/min, and solvents A and B were 97% water, 3% ACN, 0.1%
   formic acid, and 90% ACN, 0.1% formic acid, respectively. The gradient
   profile of solvent B was 2% in 12 min, 8% in 16 min, 25% in 140 min,
   35% in 150 min, 85% from 155 min to 163 min, and 2% from 165 min to
   185 min. Data-dependent acquisition was performed at a spray voltage of
   1.5 kV. MS1 spectra were measured with a resolution of 120,000 and a
   mass range of 350 to 1500 m/z. The AGC target of 3e6, maximum injection
   time of 50 ms, and isolation window of 0.7 m/z were set. Top 15 most
   abundant precursors per cycle were selected to trigger MS/MS. MS2
   spectra were measured with a resolution of 45,000, 1e5 of the AGC
   target, 120 ms of maximum injection time, 34% of collision energy, and
   110 of the fixed first mass (m/z). The precursor charge state was 2-5,
   the intensity threshold was 1e4 and the dynamic exclusion time was
   45 s.

   For PTM analysis, the Ultimate 3000 RSLC nano system (Thermo Fisher
   Scientific, USA) coupled with the Exploris 480 orbitrap mass
   spectrometer (Thermo Fisher Scientific, USA) was used. Trap column
   (Acclaim™ PepMap™ 100 C18 LC Column, C18, 100 um x 2 cm, 5 μm, Thermo
   Fisher Scientific, USA) and analytical column (EASY-Spray™ LC Columns,
   C18, 75 um x 50 cm, 2 μm, Thermo Fisher Scientific, USA) were equipped
   with UHPLC separation. The mobile phase flow rate was 0.3 μL/min, and
   solvents A and B were 0.1 % formic acid in water and 0.1 % formic acid
   in ACN, respectively. The LC gradient of solvent B was 2% in 14 min, 4%
   in 17 min, 16% in 120 min, 25% in 145 min, 85% from 150 min to 158 min,
   and 4% from 160 min to 185 min. Data-dependent acquisition was
   performed at a spray voltage of 2.1 kV and the cycle time was set to
   3 sec. MS1 spectra were measured with a resolution of 120,000 and a
   mass range of 350 to 1500 m/z. The normalized AGC target (%) was 300,
   and the maximum injection time was 50 ms. MS2 spectra were measured
   with a resolution of 45,000, 7.5 e4 of AGC target, 120 ms of maximum
   injection time, 38% of collision energy, and 0.7 m/z of isolation
   window. Peptides were selected for a 2-6 precursor charge state with an
   intensity threshold of 1e4. Peptides that triggered MS/MS scans were
   dynamically excluded from further MS/MS scans for 45 sec, with a +/- 10
   ppm mass tolerance.

The database search strategy

   In addition to canonical peptides, to reliably identify variant,
   post-translational modified, and novel peptides in global proteomics,
   we used a multi-stage search strategy: 1) the identification of
   canonical peptides (primary database search), 2) the identification of
   variant and modified peptides by considering 2355 modifications in
   Unimod, and 3) the identification of novel peptides. For each stage,
   the FDR was calculated separately using a target-decoy strategy, and
   identifications were obtained at 1% FDR. Only unidentified spectra from
   the previous stage were subjected to the subsequent stage. Details of
   each stage are described in the following sections: 1) “The Primary
   Database Search”, 2) “Identification of Variant and Modified Peptides”,
   and 3) “Identification of Novel Peptides”.

The primary database search

   All MS raw files were analyzed using Proteome Discoverer v2.4 with the
   SequestHT search engine^[374]90 against the UniProt human protein
   database v2021.01 (97,795 entries), combined with 179 common
   contaminant proteins. The search parameters were set to 10 ppm for
   precursor mass tolerance, 0.02 Da for fragment mass tolerance,
   fully-tryptic enzyme specificity allowing for up to 2 missed cleavages,
   carbamidomethylation ( + 57.021 Da) on Cys residues and TMT
   modification ( + 304.207 Da) on the peptide N terminus and Lys residues
   for static modifications, and oxidation ( + 15.995 Da) on Met residues,
   Met-loss (-131.040 Da) on protein N-terminal Met residues and
   deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic
   modifications. The minimum length of a peptide was set at six residues.

   For phosphoproteome and acetylproteome, the search parameters
   additionally included phosphorylation ( + 79.966 Da) on Ser, Thr, and
   Tyr residues and acetylation ( + 42.016 Da) on the protein N terminus
   and Lys residues for dynamic modifications. The TMT modifications were
   set as dynamic modifications for acetylproteome, to prevent
   co-assignment of TMT and acetylation. All peptide spectrum matches
   (PSMs) were subsequently rescored by Percolator^[375]91 and validated
   at an estimated false discovery rate (FDR) of 1%. For phosphosite and
   acetylsite localization, ptmRS^[376]92 was used, and modified sites
   with ptmRS probability greater than 0.95 were regarded as confident.

Identification of variant and modified peptides

   Using CustomProDB^[377]93, each patient’s germline and somatic
   mutations were used to generate variant protein sequences. The variant
   protein sequences were merged for each batch and combined with 16,130
   UniProt human proteins identified from the primary database search and
   179 common contaminant proteins to generate combined customized
   databases.

   All MS raw files were converted to MGF files using msconvert v3.0.1,
   and the precursor m/z values were replaced with the recalibrated values
   by Proteome Discoverer in the primary database search (this
   recalibrated MS2 spectra were used in all subsequent searches). The
   unidentified MS2 spectra from the primary database search were searched
   against the combined customized databases using MODplus v1.02^[378]94.
   The search parameters were set to 10 ppm for precursor mass tolerance,
   0.01 Da for-fragment mass tolerance, semi-tryptic enzyme specificity
   allowing for up to 2 missed cleavages, -1 to +2 for 13 C isotope error,
   carbamidomethylation ( + 57.021 Da) on Cys residues and TMT
   modification ( + 304.207 Da) on the peptide N terminus and Lys residues
   for static modifications. All modifications in Unimod database v2020.10
   (2355 entries) including all amino acid substitutions were considered
   for dynamic modification (multiply modified peptides were allowed
   within the modified mass range of -150 to +350 Da). All PSMs were
   validated at an estimated FDR of 1% using MODplus FDR toolkit.

   To identify neoantigen candidates, the proteomics-supported somatic
   mutations were filtered by examining the intensity of TMT reporter ion
   matched to the patient for whom the somatic mutation was called (the
   corresponding reporter ion intensity must account for at least 20% of
   the total intensity of all reporter ions). The immunogenicity of
   filtered somatic mutations was also evaluated by predicting binding
   affinity to human leukocyte antigen (HLA)-I molecules.

Identification of novel peptides

   All unidentified MS2 spectra from the primary and subsequent variant
   database searches were analyzed to identify novel peptides originating
   from pseudogenes, long noncoding RNAs (lncRNAs), untranslated regions
   (UTRs), and novel isoforms (including fusion genes). For pseudogenes,
   lncRNAs, and UTRs whose RNA transcripts’ FPKMs are greater than 0, the
   RNA sequences were selected from the GENCODE transcript fasta files
   v32, and their three frame translations were generated for each
   patient, except for UTRs, whose translations were generated by the UTR
   sequence database construction method^[379]55, which assumes that
   alternative cognate and near-cognate start codons and translational
   readthrough in the stop codons can result in abnormal translation.
   These noncoding peptide sequences were merged for each batch and
   matched to MS2 spectra by MS-GF+ v2021.09^[380]95 with the following
   search parameters: 10 ppm for precursor mass tolerance, tryptic enzyme
   specificity allowing up to 2 missed cleavages (also permitting
   non-enzymatic terminals), no 13 C isotope error allowed,
   carbamidomethylation ( + 57.021 Da) on Cys residues and TMT
   modification ( + 304.207 Da) on the peptide N terminus and Lys residues
   for static modifications, oxidation ( + 15.995 Da) on Met residues and
   deamidation ( + 0.984 Da) on Asn and Gln residues for dynamic
   modifications, and TMT protocol. The minimum length of a peptide was
   set at 8 residues. All PSMs were validated at an estimated FDR of 1%.

   The novel transcripts, including fusion genes, were predicted by
   StringTie v2.1.7 (class code =, c, k, m, n, and j)^[381]78,[382]96 and
   Arriba v1.2.0^[383]88, respectively, and their three frame translations
   were generated to compose the novel protein isoform sequences for each
   patient. The novel isoform sequences were merged for each batch and
   matched to MS2 spectra using Comet v2021.02.0^[384]97 with the
   following parameters: 10 ppm for precursor mass tolerance, semi-tryptic
   enzyme specificity allowing up to two missed cleavages, no isotope
   error allowed, carbamidomethylation ( + 57.021 Da) on Cys residues and
   TMT modification ( + 304.207 Da) on the peptide N terminus and Lys
   residues for static modifications, and oxidation ( + 15.995 Da) on Met
   residues and deamidation ( + 0.984 Da) on Asn and Gln residues for
   dynamic modifications. The minimum length of a peptide was set at 8
   residues. All PSMs were subsequently rescored by Percolator and
   validated at an estimated FDR of 1%.

   We rejected PSMs conflicting with both MS-GF+ and Comet (i.e.,
   identical spectra but different peptides assigned). The remaining novel
   peptides were searched using BLAST^[385]98 and filtered out if there
   were peptide sequence matches in the Uniprot (v2022.03, 226,999
   entries), RefSeq (v2022.09, 130,184 entries) and/or Gencode (v41,
   110,224 entries) human protein sequences, allowing no more than a
   single amino acid substitution. We also removed the frameshift peptides
   to concentrate solely on novel peptides derived from noncoding regions
   and novel isoforms. Finally, we used peptides supported by at least one
   RNA-Seq read and a quantifiable PSM to retrieve unique peptides at the
   gene level. In total, we obtained and used 1045 novel sequences for
   further analysis.

Quantification of Protein, Phosphosite and Acetylsite

   Peptide abundances were normalized to have the same total amount of
   peptide abundances between TMT channel values in the same batch by
   Proteome Discoverer v2.4 with the SequestHT search engine^[386]90
   against the UniProt human protein database v2021.01 (97,795 entries),
   combined with 179 common contaminant proteins. The co-isolation
   threshold 50% and the average reporter S/N threshold 10% were used to
   filter out quantification values of low quality. For global proteome
   quantification, proteins with at least one unique and/or razor peptides
   were exported, and normalization between batches was achieved by
   dividing each TMT channel value by that of the common reference (CR)
   channel in the same batch. The log[2] ratio value was used for the
   subsequent quantification analysis.

   The quantification of phosphosites and acetylsites was performed using
   the same method: the sum of peptide abundances containing each
   modification site was calculated to represent modification site
   abundance, and each TMT channel value was normalized using the same
   method as the global proteome data. After applying two-component
   normalization of TMT ratios for each proteomics dataset^[387]99, the
   log[2] ratio value was used for the subsequent quantification analysis.

   We selected features that have less than 30% missing values across all
   tumors for the global proteomics, phosphoproteomics, and
   acetylproteomics dataset using preprocessed normalized log[2] ratio
   values. After then, we performed k-nearest neighbor imputation (k = 5)
   for missing values using the impute R package (10.18129/B9.bioc.impute)
   for the subsequent quantification analysis.

NMF clustering analysis for multiomics subtypes

   To identify patient subtypes, we performed non-negative matrix
   factorization (NMF) analysis on global proteins and PTM sites as
   instructed in previous CPTAC studies^[388]10,[389]13. We concatenated
   preprocessed imputated global proteomics, phosphoproteomics, and
   acetylproteomics dataset into a single matrix. Then, we excluded
   features with the lowest standard deviation (bottom 5th percentile),
   followed by scale and z-score transformation. Using the NMF
   R-package^[390]100, we performed the NMF analysis for factorization
   ranks from 2 to 10 using the standard NMF algorithm from Brunet et
   al.^[391]101 and 2000 max iterations for convergence and repeated 50
   times to compute clustering statistics. We determined the optimal
   factorization rank k, representing the number of clusters, from the
   rank with the maximal cophenetic correlation coefficient value and its
   drastic decrease. With the optimal factorization rank k, we sought to
   robust clusters from the NMF analysis using 200 runs and 5000 max
   iterations. From decomposed NMF matrices, we assigned samples to NMF
   clusters corresponding to the optimal factorization rank k and obtained
   features specific to each cluster according to the row-wise feature
   score^[392]102. We defined a “core sample” for those having a cluster
   membership score ≥ 0.5 as described previously^[393]10,[394]13.

Combined NMF and Comparison of NMF subtypes with previous NSCLC subtypes

   To assess the concurrence between our subtypes and lung cancer subtypes
   defined through recent multi-omics studies, we conducted NMF analysis
   by integrating our study cohort with previous CPTAC study cohort, which
   contains LUAD^[395]10 and LSCC^[396]13. Prior to running NMF, each
   study was subjected to pre-processing using the methodology outlined in
   the “NMF clustering analysis for multiomics subtypes” section. Features
   pertaining to proteomics, phosphoproteomics, and acetylproteomics
   datasets were examined, and only those identified as commonly present
   in all three studies were selected and concatenated into a single
   matrix for NMF clustering. Using the NMF R-package, we sought to robust
   cluster from the NMF analysis using 200 runs and 5000 max iterations
   with rank 5. We analyzed the subtypes derived from NMF decomposition
   matrices for each individual study within the study cohort samples of
   LUAD and LSCC and compared them with the subtypes identified in the
   combined NMF results.

   To validate the robustness of our subtype classification in a recent
   NSCLC study^[397]10,[398]11,[399]13,[400]14 comprising 462 patients, we
   conducted a comparative analysis of subtype features across multiomics
   datasets included in each individual study. To identify the features
   associated with the subtypes of each study, we performed re-clustering
   of NMF features using the omics datasets and parameters employed in
   prior studies and selected features utilizing the “Max”
   method^[401]102. Gillette study was conducted with RNA, global-,
   phospho- and acetyl-proteomics datasets for NMF, and used 4 ranks, 200
   runs and 5000 max iterations. Satpathy study was conducted with CNV,
   RNA, global-, phospho-, and acetyl-proteomics dataset for NMF and used
   5 ranks, 500 runs, and 5000 max iterations. Also, Lehtio study was
   conducted with only global proteomics dataset for NMF, and used 6
   ranks, 100 runs, and 5000 max iterations. For Xu’s (2020) study, the
   up-regulated protein list for each subtype as presented was utilized as
   a subtype-specific features. To confirm a statistically significant
   relationship, an odds ratio and p value were calculated by conducting a
   Fisher’s exact test in R between each study cluster’s features.

Gene set enrichment analysis

   To obtain characteristic analysis information corresponding to NMF
   subtype based on the expression data of multi-omics of tumor samples,
   we performed PTM signature Enrichment Analysis (PTM-SEA^[402]103) for
   phosphoprotoemics dataset and R package GSVA^[403]104 for global-,
   phospho- and acetyl protoemic datasets.

   For GSVA, pathways from hallmark, KEGG, Reactome, GO and Wikipathway
   databases which were downloaded from MSigDB were considered and we used
   only pathways which contains more than 200 genes and lower than 1000
   genes in each pathway.

   database: “h.all.v2023.1.Hs.symbols.gmt.txt”,
   “c2.cp.kegg.v2023.1.Hs.symbols.gmt.txt”,
   “c2.cp.reactome.v2023.1.Hs.symbols.gmt.txt”,
   “c2.cp.wikipathways.v2023.1.Hs.symbols.gmt.txt”,
   “c5.go.v2023.1.Hs.symbols.gmt.txt”

   For PTM-SEA, flank amino acid sequence (+/- 7 aa) was added to PTM data
   as primary identifier and used. We computed normalized enrichment
   scores (NES) of gene sets. We used the implementation which contains
   PTM-SEA available on GitHub
   ([404]https://github.com/broadinstitue/ssGSEA2.0) using the command
   interface R-script (ssgsea-cli.R) using the following parameters:
     * database:”ptm.sig.db.all.flanking.human.v1.9.0.gmt”
     * sample. norm. Type: “rank”
     * output. score. Type = “NES”
     * nperm = 100

   For identifying NMF subtype feature pathway, we select Hallmark pathway
   gene sets (MSigDB), which summarize specific cancer-related biological
   states or processes for ssGSEA and PTM signatures database (PTMsigDB)
   containing a list of modification site-specific signatures.

Identification of differentially enriched copy number variations

   Gene-level copy numbers detected by CNVkit were converted to
   cytoband-level by calculating the mean copy numbers in each cytoband.
   Then we used a linear regression model to identify differentially
   enriched copy number variations in each NMF Subtype. To control the
   effect of the pathological diagnosis classification (DX), we set DX as
   a covariate.
   [MATH:
   <mi>C</mi><mi>y</mi><mi>t</mi><mi>o</mi><mi>b</mi><mi>a</mi><mi>n</mi><
   mi>d</mi><mo>−</mo><mi>l</mi><mi>e</mi><mi>v</mi><mi>e</mi><mi>l</mi><m
   space
   width="0.16em"></mspace><mi>c</mi><mi>o</mi><mi>p</mi><mi>y</mi><mspace
   width="0.16em"></mspace><mi>n</mi><mi>u</mi><mi>m</mi><mi>b</mi><mi>e</
   mi><mi>r</mi><mi>s</mi><mo>=</mo><msub><mrow><mi>β</mi></mrow><mrow><mn
   >0</mn></mrow></msub><mo>+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>1
   </mn></mrow></msub><mo>⋅</mo><mfenced close=")"
   open="("><mrow><mi>S</mi><mi>u</mi><mi>b</mi><mi>t</mi><mi>y</mi><mi>p<
   /mi><mi>e</mi></mrow></mfenced><mo>+</mo><msub><mrow><mi>β</mi></mrow><
   mrow><mn>2</mn></mrow></msub><mo>⋅</mo><mfenced close=")"
   open="("><mrow><mi>D</mi><mi>X</mi></mrow></mfenced><mo>+</mo><mi>ε</mi
   > :MATH]

   β[0] and β[1] indicate the intercept and the log2FC value,
   respectively. Cytobands with an adjusted p-value (calculated using
   Benjamini–Hochberg method) less than 0.05 were selected as significant
   CNVs.

Identification of differentially expressed genes (DEGs)

   To identify differentially expressed genes between conditions, we used
   “DESeq” function provided by DESeq2 R packages^[405]76. Quantified
   transcript-level estimates from Salmon were imported to R using
   “tximport” and were converted to a gene-level expression matrix. With
   this gene-level expression matrix, DESeqDataSet was generated using
   “DESeqDataSetFromTximport” function. Then, differential expression
   analysis based on the negative binomial distribution was performed
   using “DESeq” function. Notably, we conducted the DEG analysis using
   the formula below.
   [MATH:
   <mi>d</mi><mi>e</mi><mi>s</mi><mi>i</mi><mi>g</mi><mi>n</mi><mspace
   width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><mo>~</mo><mi>G</mi><mi>r</mi><mi>o</mi><mi>u</
   mi><mi>p</mi><mspace width="0.25em"></mspace><mo>+</mo><mspace
   width="0.25em"></mspace><mi>D</mi><mi>X</mi><mspace
   width="0.25em"></mspace><mo>+</mo><mspace
   width="0.25em"></mspace><mi>r</mi><mi>n</mi><mi>a</mi><mo>.</mo><mi>b</
   mi><mi>a</mi><mi>t</mi><mi>c</mi><mi>h</mi> :MATH]

   In this model, Group represent the conditions under comparison. DX is
   the name of the pathological diagnosis classification, including LUAD
   or LSCC. Batch information for RNA (rna.batch) was also used as a
   covariate in DEG analysis to remove batch effects from DEG analysis.
   The adjusted p-value was calculated using the Benjamini-Hochberg
   method, and an adjusted p-value less than 0.05 was used as a criterion
   for differentially expressed genes.

Identification of differentially expressed proteins, phosphorylation, and
acetylation (DEP)

   We identified differentially expressed proteins, phosphorylation, and
   acetylation using linear regression models as below.
   [MATH: <mi>Y</mi><mspace width="0.25em"></mspace><mo>=</mo><mspace
   width="0.25em"></mspace><msub><mrow><mi>β</mi></mrow><mrow><mn>0</mn></
   mrow></msub><mo>+</mo><msub><mrow><mi>β</mi></mrow><mrow><mn>1</mn></mr
   ow></msub><mspace
   width="0.25em"></mspace><mo>⋅</mo><mrow><mo>(</mo><mrow><mi>G</mi><mi>r
   </mi><mi>o</mi><mi>u</mi><mi>p</mi></mrow><mo>)</mo></mrow><mo>+</mo><m
   sub><mrow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>⋅</mo><mfe
   nced close=")"
   open="("><mrow><mi>D</mi><mi>X</mi></mrow></mfenced><mo>+</mo><msub><mr
   ow><mi>β</mi></mrow><mrow><mn>2</mn></mrow></msub><mo>⋅</mo><mfenced
   close=")"
   open="("><mrow><mi>B</mi><mi>a</mi><mi>t</mi><mi>c</mi><mi>h</mi></mrow
   ></mfenced><mo>+</mo><mi>ε</mi> :MATH]

   y[i] is a vector of dependent variables that represents the expression
   values of each proteomic feature. Group is a vector of independent
   variables that represents the state we are trying to compare. DX is the
   name of the pathological diagnosis classification, and Batch is a TMT
   channel number that represents batch information. β[0] is intercept,
   β[1] is beta coefficient for Group variables, and represents the
   log[2]FC value. The linear regression’s calculated p-value tests the
   null hypothesis that the coefficient is equal to zero, indicating no
   difference between groups. The adjusted p-value was calculated using
   Benjamini–Hochberg method, and an adjusted p-value less than 0.05 was
   used as a criterion for differentially expressed proteins,
   phosphorylation, and acetylation.

Survival analysis

   To compare the survival probability across our five NMF subtypes, we
   measured overall survival (OS) length (Supplementary Data [406]1a), the
   time from the date of tumor resection surgery to the time of
   NSCLC-induced death, for 229 patients. With NSCLC-death, we performed a
   survival analysis based on Kaplan–Meier estimation. In the Kaplan–Meier
   estimation model, NSCLC-death was used as an indicator of the ending
   time of the OS length. It is labeled as a right-censored observation in
   the case of patients with no deaths, which causes NSCLC at the end of
   the study. For patients who have died but whose cause of death is not
   NSCLC, we can use partial information that they survived beyond a
   certain point, but the exact date of NSCLC-death is uncertain.

   The Kaplan-Meier estimation was used to measure survival curves for
   each subtype. The Kaplan-Meier survival curve is defined as the
   probability of surviving for each length of time after tumor resection
   surgery while considering time at many small intervals. For each time
   interval, survival probability is calculated as the number of patients
   who survived divided by the number of patients at risk. Patients who
   have died, dropped out, or moved out are not counted as “patients at
   risk”. The total probability of survival until that time interval is a
   cumulative probability, calculated based on the law of multiplication
   of probability by multiplying all the probabilities of survival at all
   time intervals until a certain point^[407]105. We used the log-rank
   test to statistically test the null hypothesis that there is no
   difference between the survival curves of each subtype. For the
   log-rank test, the total number of observed events in each subtype,
   i.e., O[1] and O[2] was used, and the expected number of events in each
   group, i.e., E[1] and E[2], was calculated. The total number of
   expected events is calculated as the sum of the expected number of
   events at the time of each event in any of the subtype, bringing all
   subtypes together. The expected number of events at the time of each
   event is the result of multiplying the total number of patients
   surviving at the time of events in all subtypes by the risk of events
   at time^[408]105. Log-rank test statistic is calculated as below.
   [MATH:
   <mi>L</mi><mi>o</mi><mi>g</mi><mo>−</mo><mi>r</mi><mi>a</mi><mi>n</mi><
   mi>k</mi><mspace
   width="0.16em"></mspace><mi>t</mi><mi>e</mi><mi>s</mi><mi>t</mi><mspace
   width="0.16em"></mspace><mi>s</mi><mi>t</mi><mi>a</mi><mi>t</mi><mi>i</
   mi><mi>s</mi><mi>t</mi><mi>i</mi><mi>c</mi><mo>=</mo><mfrac><mrow><msup
   ><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>O</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>E</mi></mrow><mrow><mn>1</mn></mrow></msub
   ></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><ms
   ub><mrow><mi>E</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow></mfrac>
   <mo>+</mo><mfrac><mrow><msup><mrow><mfenced close=")"
   open="("><mrow><msub><mrow><mi>O</mi></mrow><mrow><mn>2</mn></mrow></ms
   ub><mo>−</mo><msub><mrow><mi>E</mi></mrow><mrow><mn>2</mn></mrow></msub
   ></mrow></mfenced></mrow><mrow><mn>2</mn></mrow></msup></mrow><mrow><ms
   ub><mrow><mi>E</mi></mrow><mrow><mn>2</mn></mrow></msub></mrow></mfrac>
   :MATH]

   Since there are five subtypes in total, the pooled p value is
   calculated and shown in Fig. [409]2g. We perform pairwise comparison
   between Subtypes to decide which Subtypes represent poor prognosis
   subtype of NSCLC and adjusted p-value was calculated using the
   Benjamini–Hochberg procedure.

   To identify the molecular features that result in survival probability
   differences, we also performed feature-wise survival analyses comparing
   survival curves between group of patients with top 50% expression and
   bottom 50% expression for each protein and PTM features based on
   Kaplan-Meier estimation. The prognostic direction, which shows whether
   the prognosis is favorable or unfavorable, was determined by the sign
   of coefficient from Cox proportional-hazard (CPH) model not from
   Kaplan-Meier model. Since in many cases, the survival probability does
   not reach 0.5, it was difficult to compare the median survival time
   that can be used as an indicator of prognostic direction in the
   Kaplan–Meier model. CPH model was used for univariate analysis and
   categorical analysis. h(t) is hazard function determined by the
   univariate, high- or low expression group. h[0] is called the baseline
   hazard. It corresponds to the value of the hazard if x[1] is equal to
   zero. The coefficient b[1] measure the impact of univariate and tells
   us prognostic direction by sign. x[1] is categorical variable, low- or
   high expression group. When sign of the coefficient b[1] for high
   expression group is positive, its feature classified as unfavorable
   prognostic features and vice versa.
   [MATH: <mi>h</mi><mfenced close=")"
   open="("><mrow><mi>t</mi></mrow></mfenced><mo>=</mo><msub><mrow><mi>h</
   mi></mrow><mrow><mn>0</mn></mrow></msub><mfenced close=")"
   open="("><mrow><mi>t</mi></mrow></mfenced><mo>×</mo><mi>exp</mi><mrow><
   mo>(</mo><mrow><msub><mrow><mi>b</mi></mrow><mrow><mn>1</mn></mrow></ms
   ub><msub><mrow><mi>x</mi></mrow><mrow><mn>1</mn></mrow></msub></mrow><m
   o>)</mo></mrow> :MATH]

   For significance, a log-rank test was used, and the nominal
   p-value < 0.001 was used as a criterion for survival-associated
   features. The results of the feature-wise survival analysis are listed
   in Supplementary Data [410]2d. We selected the top 1% of features by
   statistical significance for survival-associated features.

Kinase activity estimation based on phosphoproteomic data

   To understand the Subtype-specific kinase activities, we used decoupleR
   R package to infer kinase activity from the results of differentially
   expressed phosphorylation (DEPP) using prior knowledge network, such as
   databases for kinase/phosphatase and substrate interactions^[411]106.
   We used prior knowledge network provided by PHONEMeS^[412]107. Also, we
   set minimum size of regulons to 5, and provided 1000 deregulated
   phosphorylation (each 500 phosphorylation for up-, down-regulated).
   There are a variety of methods to infer kinase activities and we used
   multivariate linear model (MLM)-based method as below:
   [MATH: <mi>Y</mi><mo>=</mo><mi>β</mi><mi>X</mi><mo>+</mo><mi
   mathvariant="normal">ψ</mi> :MATH]

   where the dependent variable Y represents the t-statistic measurements
   of phosphorylation from the results of the differentially expressed
   phosphorylation analysis, and the independent variable X is the
   connectivity matrix representing the associations with kinases. X[ij]
   equals 1 when phosphosite i is a known substrate of the kinase j, 0
   otherwise. Ψ represents the normally distributed error of the fit and β
   represents the scores of the kinase activity^[413]108. In decoupleR
   packages in R, we used “run_mlm” function to estimate kinase activity.
   P-value was obtained from a multivariate linear model and the adjusted
   p-value was calculated using the Benjamini–Hochberg adjustment.

Single-cell specific subtype distribution

   To assess the tumor microenvironment in five NSCLC subtypes by
   comparing subtype-specific genes with a diverse set of cell
   type-specific genes, using the dataset ‘local_extended.rds’ obtained
   from [414]https://luca.icbi.at. Subtype-specific gene sets were
   generated based on the results of differential gene expression analysis
   (DEGs), including only genes with significantly increased fold change
   and a significant adjusted p-value (<0.05).

   For the tumor versus tumor comparison, we considered only the cell
   type-specific genes originating from ‘tumor_primary’ cells. Conversely,
   for the tumor versus normal adjacent tissue (NAT) comparison, we used a
   subset of cell type-specific genes from both ‘tumor_primary’ and
   ‘normal_adjacent’ cells.

Clustering analysis for immune microenvironment

   We used transcriptome data of NSCLC patients to define the subtypes of
   tumor immune microenvironment. Two clustering approaches were used,
   which were based on the enrichment score of cell types^[415]41 and
   immune-related pathways^[416]42. The cell type-based clustering was
   performed with 205 tumors and 85 normal samples. The enrichment scores
   of 64 cell types were inferred by Xcell^[417]41, which performed gene
   set enrichment analysis based on the curated gene signatures for each
   cell type. The inferred cell enrichment scores were then normalized as
   z-score and consensus clustering was performed using R package
   CancerSubtypes^[418]109. After the consensus clustering, the
   partitioning around medoids (PAM) algorithm was used to divide the
   three clusters. We first assigned a NAT-enriched cluster among the
   three clusters when it was mainly matched to normal samples. After
   that, the immune score calculated by Xcell was used to determine the
   HTE and CTE immune clusters, so that the cluster with the higher immune
   score was defined as having an HTE tumor. We also performed
   pathway-based clustering with 205 tumor samples. For that, GSVA was
   performed across the patients based on seven curated immune-related
   pathways. The enrichment scores of the pathways were normalized as
   z-scores, and k-means clustering was performed based on the z-scores to
   identify two immune clusters. The cluster having a higher pathway
   enrichment score was defined as HTE tumor.

Survival and regression analysis with immune cluster and cell types

   The enrichment score of cell type and the status of the immune cluster
   across the patients were tested for their correlation with a set of
   clinical and molecular features. The pattern of TILs was examined by
   regression analysis using MASS package and GLM package of R. Overall
   survival and relapse-free survival were tested for their correlations
   with cell types and immune clusters by Cox proportional hazards
   regression analysis using CoxPHFitter package of Python. All analyses
   were performed with clinical histology and sample batches as
   covariates.

Identifying putative regulators associated with immune landscape

   To identify putative regulators associated with immune landscape, we
   first inferred protein activity using transcriptome data of the
   patients. For the protein activity, we obtained regulon networks of
   lung cancer from ARACNe package^[419]110 of R and calculated protein
   activity of each patient using viper tool^[420]44 that inferred how
   each protein regulates its target genes based the regulon network. The
   positive value of protein activity indicates that the protein
   positively regulates its target genes to be overexpressed, and the
   negative value indicates vice versa. We then calculated the correlation
   between the protein activity of immunomodulators and the enrichment
   score of cells, or the status of the immune cluster. The RNA and
   protein expression of the same immunomodulators was also analyzed for
   their associations with the immune landscapes. To corroborate the
   putative regulation of the immune landscape, we further analyzed
   whether a specific driver mutation was involved in the association
   between immunomodulators and immune cells or immune clusters. The
   curated driver mutations were derived from Oncovar^[421]111,
   DriverDBv3^[422]112, Intogen^[423]113, and mutation catalogue from
   Martínez-Jiménez et al.^[424]114.

Immune landscape analysis with the integrated cohort

   To test the reproducibility of our analysis about the immune landscape
   across NSCLC patients, we used two independent NSCLC cohorts of
   Satpathy et al.^[425]13 and Gillette et al.^[426]10. The z-score
   normalization was performed for the enrichment scores of 64 cell types
   in each cohort that had 202 (Satpathy et al.), 211 (Gillette et al.),
   and 290 (our cohort) patients. The normalized enrichment scores of cell
   types were then integrated into a single cohort and performed consensus
   clustering. PAM algorithm was used to make three clusters and they were
   mapped to NAT, HTE, and CTE, respectively, according to the method used
   in our original cohort.

Neoantigen and cryptic peptide prediction

   We predicted different types of neoantigens and cryptic peptides
   according to their origin and validation (Supplementary Fig. [427]6a).
   Firstly, canonical neoantigens were predicted from the mutated peptides
   that had a length between 9mer and 12mer amino acids including somatic
   mutations. The binding affinity of the mutated peptides was predicted
   using MHCflurry^[428]115 and NetMHCpan^[429]116 for the patient HLAs
   that were identified by OptiType^[430]117. Only the mutated peptides to
   be predicted as binding to the patient HLAs were defined as the
   potential neoantigen. Among the neoantigen candidates, we further
   defined “confirmed neoantigens” when they had evidence of MS
   experiments (see the section “Identification of variant and modified
   peptides”). The remaining two types were defined by non-canonical
   peptides that originated from unconventional translation of
   pseudogenes, lncRNAs, UTRs, or novel isoform transcripts (Refer to
   section “Identification of novel peptides”). We identified novel
   isoform transcripts that were expressed only in tumor samples when they
   matched normal samples. The mean expression of normal samples was used
   when the tumor sample did not match normal. We treated a degree of FPKM
   below 1 as indicating no expression. Only the novel peptides that
   occurred in more than two tumor samples and not in at least one normal
   sample were predicted as “cryptic peptides” using the same methodology
   for neoantigen candidates in the prediction of binding to HLAs. We
   further determined “confirmed cryptic peptide” when the cryptic
   peptides also showed substantial expression (FPKM > = 1) in the tumor
   samples as described in the previous study^[431]118.

Analysis of multi-omics subtype distribution according to the cryptic peptide
load and immune landscape status

   We analyzed the distribution of the patients of each multi-omics
   subtype across the status of cryptic peptide load and immune landscape.
   Two features were chosen as the surrogates of the immune landscape,
   which were the status of the immune cluster (HTE/CTE) and the antigen
   processing and presentation machinery (APM). The patients were first
   separated into two groups that had a high and low number of cryptic
   peptides. They were also divided into two other groups: HTE and CTE, or
   high APM and low APM. The status of APM was determined based on the
   pathway enrichment score of “Antigen_Processing_and_Presentation” in
   KEGG pathway^[432]119. We created four categories based on the status
   of cryptic peptide load and immune landscape, and the degree of
   over-representation for each multi-omics subtype was calculated by a
   two-sided Fisher’s exact test.

Patient-derived lung tumoroid culture

   Patient-derived lung tumoroid culture and drug screening were performed
   by SG Medical, Inc. (Seoul, Korea) and conducted as previously
   described^[433]120. Briefly, tissues were kept in a cold Hank’s
   balanced salt solution (HBSS) with antibiotics (Gibco, OK, USA) and on
   ice after dissection. Samples were washed three times with cold HBSS
   with antibiotics and sectioned with sterile blades. Sectioned samples
   were incubated with 0.001% DNase (Sigma-Aldrich, MO, USA), 1 mg/ml
   collagenase (Roche, IN, USA), 200 U/ml penicillin, and 200 mg/ml
   streptomycin in DMEM/F12 medium (Gibco, OK, USA) at 37 °C for 1 h with
   intermittent agitation. After incubation, the suspensions were
   repeatedly triturated by pipetting and passed through 40-μm cell
   strainers (BD Falcon, CA, USA). The strained cells were centrifuged at
   112 × g for 3 min, and the pellet was resuspended in lung tumoroid
   culture media (DMEM/F12 supplemented with 20 ng/ml of bFGF (Invitrogen,
   CA, USA), 50 ng/ml human EGF (Invitrogen), N2 (Invitrogen), B27
   (Invitrogen), 10 μM ROCK inhibitor (Enzo Life Sciences, NY, USA), and
   1% penicillin/streptomycin (Gibco, OK, USA). The suspension was mixed
   with matrigel and seeded onto 6-well plates. Culture media were
   replaced every 4 days.

Drug screening

   Lung tumoroids cultured were harvested and dissociated using TrypLE
   Express (Gibco, OK, USA). The dissociated lung tumoroids were diluted
   in a lung tumoroid culture media-matrigel mixture and seeded onto
   384-well plates (250 cells per well). After lung tumoroid generation, 8
   concentrations of Selinexor (Selleckchem, TX, USA) and vehicles (DMSO)
   as a negative control were added in triplicate. After 6 days,
   quantification of cell viability was done by adding 10 μl of
   CellTiter-Glo 3D (Promega) to each well according to the manufacturer’s
   instructions on a Varioskan LUX Multimode Microplate Reader (Thermo
   Fisher Scientific, MA, USA). The determination of IC50 values was
   conducted using GraphPad Prism.

CPTAC NSCLC data download and preprocessing

   Clinical data, MAF file, and gene-level CNV data are downloaded from
   the supplementary tables provided by previous studies^[434]10,[435]13.
   Segment-level CNV data is downloaded from the GDC data portal
   ([436]https://portal.gdc.cancer.gov), and global proteome,
   phosphoproteome, acetylproteome, and clinical data, including survival
   information, were downloaded from LinkedOmics
   ([437]http://www.linkedomics.org). For global, phospho-, and
   acetyl-proteomics data, we performed the imputation in the same way as
   in our cohort. Briefly, features with more than 30% missing values
   across all tumors were discarded, and k-NN imputation using k = 5 was
   performed. We conducted all downstream analyses using these data in the
   same manner as we did with our cohort.

Reporting summary

   Further information on research design is available in the [438]Nature
   Portfolio Reporting Summary linked to this article.

Supplementary information

   [439]Supplementary Information^ (6.2MB, pdf)
   [440]Peer Review File^ (4.8MB, pdf)
   [441]Reporting Summary^ (1.1MB, pdf)
   [442]41467_2024_54434_MOESM4_ESM.pdf^ (181.7KB, pdf)

   Description of Additional Supplementary Files
   [443]Supplementary Data 1^ (317.3KB, xlsx)
   [444]Supplementary Data 2^ (1.6MB, xlsx)
   [445]Supplementary Data 3^ (12.8MB, xlsx)
   [446]Supplementary Data 4^ (231.9KB, xlsx)
   [447]Supplementary Data 5^ (308.8KB, xlsx)
   [448]Supplementary Data 6^ (480.6KB, xlsx)
   [449]Supplementary Data 7^ (120MB, xlsx)

Acknowledgements